Summary: Used in the NNPreProc layers. It fails the online training when there is empty batch.
Reviewed By: dzhulgakov
Differential Revision: D4235498
fbshipit-source-id: bde00a011831762e44a3f9bf2190d4b241a06ccc
Summary: Each sparse feature is a ID list. And usually the position of the id in the id list is meaningful. The earlier the id appears in the list, the more important. In this diff, we multiple each embedding with a weight, where the weight corresponds to the position. With this change, same ID appears on different position would have different norm/length/importance after aggregation. The firstX transformation in sigrid is a special case of this model where the weights before n are 1, and 0 after n, where n is the argument of firstX.
Reviewed By: xianjiec
Differential Revision: D4181251
fbshipit-source-id: 2a6f8b7240af445b6bd2052fd24c2d99f39ee7ff