Summary: This diff use stack workspaces in RecurrentNetwork, which allows to simplify the implementation and get rid of scratches.
Reviewed By: salexspb
Differential Revision: D4446813
fbshipit-source-id: 514eec7e4300bdf492a9cb192b40cf4f89acf656
Summary:
it's broken because it relies on add sparse bias.
it's not easy to add_sparse_bias after switch to loader_param.
DPA would like to try it out :)
Differential Revision: D4447275
fbshipit-source-id: 631cb4995f35383070e44387dc86692ba64b91eb
Summary: Remove usage of recurrent_sizes, so recurrent states' sizes can depend on input (in case of attention matrix for beam decoder). I removed recurrent_sizes from forward and backward steps.
Reviewed By: salexspb
Differential Revision: D4427688
fbshipit-source-id: 580420a294d309c86ec5cb4e677058623b7228e1
Summary:
In this diff I stop passing parameters by name and also remove hardcoded output ids which were there specifically for LSTM to work. It also allows to avoid using recurrent_sizes in the backward pass (for forward this is done in D4427688)
Using similar technic it should be simple enough to eliminate blob name passing at all. Then we can fix scoping. These can be done in a next diff.
Reviewed By: urikz
Differential Revision: D4444614
fbshipit-source-id: 3580a76365502b9f2f09e3d8b7e78084ca739f00
Summary:
New operator is added for model calibration. Given a piecewise linear function and raw prediction as input, generate the mapping as output.
Detail can be find in the operator doc.
Differential Revision: D4418640
fbshipit-source-id: f8ff3ea786b0fe233a4ddcb709e5dbf0861ca484
Summary:
This is a handy tool for amortizing expensive operators (e.g.
distributed communication, some heavier kernel launches, etc) over a
lot of small blobs (e.g. all the biases in a network). We can just
coalesce these small blobs in-place into a single blob, act on them in
operators, etc as if they are non-coalsed (passing them as inputs to
operators, etc), and then finally for heavier operators, just work on
the coalesced blob that contains each of these units.
I named it UnsafeCoalesce since it introduces blob aliasing, which
needs care for work like memory management, graph rewriting as in
memonger, etc.
Reviewed By: Yangqing
Differential Revision: D3557149
fbshipit-source-id: 09cff4459b84270fe9e1da3b4a168fd66d01f795
Summary:
This is a first step in improving our RNN story. It provides a wrapper around current RecurrentNetworkOp implementation which infers most of the redundant parameters and makes API much simpler.
Also in order to support general step nets I added an extra argument to the RecurrentNetworkOp.
Future work:
1. Inferring step net output and internal blobs (scratches) sizes and type
2. Avoid accessing blobs by names in c++ part
3. Remove requirement for inputs / output 1:1 correspondence in the step net
4. Make python API support networks with operators like Sum being on the boarder of the Cell net (currently there is an issue with such networks where gradient blobs which are on the side are not explicitly created).
Differential Revision: D4268503
fbshipit-source-id: f8a66491c2b55daa730caeed7e9f2b3921541b49
Summary:
This adds Caffe2 support for MKL operators directly with MKLMemory. Included a
Relu layer that shows how to use it.
Reviewed By: salexspb
Differential Revision: D4322144
fbshipit-source-id: 8b3392c4fd024ab1a7ba7135c349ebd3e1976799
Summary:
float64 test breaks things on the cuda side. I am deleting it for now and if
we add it back, let's make sure we run the test on a GPU machine first :)
Reviewed By: azzolini
Differential Revision: D4324427
fbshipit-source-id: 0246fe9dd28a286422ca94c90f5b0fc33a162e74
Summary: Allows to collect samples over multiple batches. The method uses a circular array and so there is no guarantee about the order of the samples. The goal is to get a view of the data accross multiple batches
Reviewed By: salexspb
Differential Revision: D4216181
fbshipit-source-id: bb9e1fa84ac7e04006dcddb53c9347a42ec83dc8
Summary: Used in the NNPreProc layers. It fails the online training when there is empty batch.
Reviewed By: dzhulgakov
Differential Revision: D4235498
fbshipit-source-id: bde00a011831762e44a3f9bf2190d4b241a06ccc
Summary: Each sparse feature is a ID list. And usually the position of the id in the id list is meaningful. The earlier the id appears in the list, the more important. In this diff, we multiple each embedding with a weight, where the weight corresponds to the position. With this change, same ID appears on different position would have different norm/length/importance after aggregation. The firstX transformation in sigrid is a special case of this model where the weights before n are 1, and 0 after n, where n is the argument of firstX.
Reviewed By: xianjiec
Differential Revision: D4181251
fbshipit-source-id: 2a6f8b7240af445b6bd2052fd24c2d99f39ee7ff