Summary:
Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated.
The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network).
Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v * tanh(coverage * coverage_weights + encoder_output + decoder_state).
Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function.
Differential Revision: D5593617
fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46
Summary: Adding support to use kernels, strides, pads etc. as arguments.
Reviewed By: houseroad
Differential Revision: D5710699
fbshipit-source-id: 8b63af4c4a76cd06b637a376aeb29a34c659be2e
Summary:
_LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs.
Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 )
Reviewed By: jhcross
Differential Revision: D5632276
fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1
Summary: These were missing and required for some seq2seq models. Unit tested. The previous implementation of ReduceBackMean shape inference was incorrect, so removed it.
Reviewed By: asaadaldien
Differential Revision: D5691262
fbshipit-source-id: 76f868b298440f988635966a410f0232301ca6c4
Summary:
Split the first dimension of a tensor into 2, the first of which is fixed and given in the argument.
This is used to then split batch into smaller batches and distributed it across workers.
Reviewed By: harouwu
Differential Revision: D5702175
fbshipit-source-id: 02bb93e49bf9db411b516e149c8e647301dd2ca5
Summary:
This adds a fast path for global max pooling with NCHW. Compared to equivalent ReduceBackMean, this is about 3.5x faster.
Based on D5533059.
Reviewed By: akyrola
Differential Revision: D5681122
fbshipit-source-id: 7a4df934044c7dd01888f095f7dd46654aaf4eae
Summary:
Optimizations for SinusoidPositionEncodingOp to sinusoid position embeddings
more competitive against table based embeddings.
- Removed most calls to std::pow
- Replaced division with multiplication with reciprocal
- Reused computation across examples within a batch
Current speedup with batch size of 16, sequence length of 128 and embedding
size of 512 is about 270x (17k embeddings per second -> 4.7M embeddings per
second). The speedup is very dependent on the batch size; at a batch size of 4
this only gets 1.7M embeddings per second.
Profile: https://pxl.cl/8zf0
Annotated DoRunWithType: P57925031
Reviewed By: jamesr66a
Differential Revision: D5634766
fbshipit-source-id: 0f35bb176164ea547c91de242a0205c5d7adf7cf
Summary:
Add more data augmentation to ImageInputOp
1) Inception-style random sized cropping
2) color jittering
3) color lighting
Reviewed By: panshen1
Differential Revision: D5637726
fbshipit-source-id: 45d9cc69eec9f4d48c1607d80ccd89e325961b1a
Summary:
Adding a range operator in the spirit of np.arange. It is an imporant building block for a lot of manipulation functions.
This accepts parameters with the same meaning in the same order as python's range or np.arange (e.g. `(stop)`, `(start, stop)` or `(start, stop, step)`)
Differential Revision: D5616861
fbshipit-source-id: 02622b8bd85ebca125cc881c06fae5b54b7c602a
Summary: The new test ensures 'add_axis' and 'split' arguments work as intended for tensors of various dimensions. Hypothesis should checks various edge cases like zeroes in 'split_info' and 1D input with axis=0, add_axis=1.
Reviewed By: hoangmit
Differential Revision: D5645778
fbshipit-source-id: 061f9511a082da54e5c1bbe53a0e7096af4b8d1b
Summary: Implement a brew wrapper for the LayerNorm op. This adds the scalar weight and bias terms to the op.
Reviewed By: jmp84
Differential Revision: D5595836
fbshipit-source-id: 467b2e1158b0c454a149d4b26c47719826e98752
Summary:
Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod.
Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage.
Reviewed By: jmp84
Differential Revision: D5639306
fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4
Summary:
As an alternative to sharing embeddings, we want to explore merging the ID_LISTs in the net.
This commit adds an operator to merge many ID_LIST features into a single one.
Differential Revision: D5481523
fbshipit-source-id: 446121122a32de5682d5d75a165370bc8d776d03
Summary: This can be used for local attention to mask elements outside of a window
Reviewed By: jamesr66a
Differential Revision: D5643677
fbshipit-source-id: 92b33866258ccc7307d5bcf08234610aa3fb152d
Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.
Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.
This diff also restores the link-ops that I unlanded earlier.
The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.
Reviewed By: salexspb
Differential Revision: D5001637
fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
Summary:
Implement forward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72.
This implements two modes: a sequence-length based mode and a matrix triangle mode.
Reviewed By: akyrola
Differential Revision: D5615493
fbshipit-source-id: a2ce4a8e655d9b720049010a7856be052c5567eb
Summary: In order to control the absolute scale/magnitude of the output of this op, added a tuning parameter: amplitude
Reviewed By: jamesr66a
Differential Revision: D5596574
fbshipit-source-id: 3b7e316de55cce6fd686da70aa5658ec3e99b070
Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b).
Differential Revision: D5589309
fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049
Summary: This diff implements CUDA version of OneHot operator.
Reviewed By: bddppq
Differential Revision: D5578543
fbshipit-source-id: 55b70e8ec6ee34b647b9140fecbba31b6968f403
Summary: Add CUDA version of GRU operator
Reviewed By: jamesr66a
Differential Revision: D5571043
fbshipit-source-id: 332aa64fc8a9116cc33382f2b2907080e58c13b3
Summary:
It was reverted previously because of lack of schema for gradient op. Added it back and resend.
difference between this diff and previous reverted diff:
1. added schema for gradient operator
2. change line:95 in kmax_pooling_op.h from CAFFE_ENFORCE to CAFFE_ENFORCE_GE
Reviewed By: xianjiec
Differential Revision: D5568867
fbshipit-source-id: 39813b389a5da803967a561249793afdfce00c58
Summary:
The L1Distance operator used to return a single value denoting the L1 of the entire input, instead of a vector for each input value.
This fixes that.
Reviewed By: Yangqing
Differential Revision: D5570385
fbshipit-source-id: fbab0e0c9262ccbdb3af27262b8baacdeb2d0fc9
Summary:
To train an image model, we also can use label embedding vector as supervision as opposed to using SoftmaxLoss/SigmoidCrossEntropyLoss.
In such case, the label is a dense vector. This diff enables such use cases.
Reviewed By: panshen1
Differential Revision: D5556203
fbshipit-source-id: 52c61495e02fab457dc2d43e3345d7dbd5580ab7
Summary:
Implement dot attention as described in https://arxiv.org/abs/1508.04025
This saves the computation of weighted encoder outputs in `rnn_cell.py`
When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2.
Refactored unit tests.
Reviewed By: jhcross
Differential Revision: D5486976
fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb