Summary: As said in the title. This should save a lot of memory if using both train and test workflows.
Reviewed By: jhcross
Differential Revision: D4855436
fbshipit-source-id: 9eeca548eee118e07bd587c46f40e7beb138318e
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.
* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params. This way we can test if cuDNN and our own produce the same result.
recurrent_test.py tests for the equivalency
Reviewed By: salexspb
Differential Revision: D4654988
fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
Summary:
Uses the cudnnTransformTensor function. It works by shuffling the strides according to the transpose axis. Significant speedup over current GPU version .
+ moves the transpose test under utility_ops, because hypothesis_test is too big
Reviewed By: jamesr66a
Differential Revision: D4810993
fbshipit-source-id: 82577c4ced1389e70bd5992820ae4d8297a3817f
Summary:
This is pretty tricky to explain, but we can just use
backward_links. This way the whole cell would use a blob from the
states_grad tensor instead of having its own blob. This also should
save on memory a bit
Differential Revision: D4770798
fbshipit-source-id: 673f85b2c2fdf42c47feeaa24d1e2bf086f012f9
Summary: We anyway accumulate values of this blob (param_grad) in a another special internal blob
Differential Revision: D4768643
fbshipit-source-id: a9d08b7eafd25f278a8db722f9cdb1d0064b852a
Summary: Apart from copying gradient blobs for inputs with initial_cell_input, we needed to perform a similar operation for external parameters used by the step net
Reviewed By: salexspb
Differential Revision: D4752259
fbshipit-source-id: 13ee48cf583ed86221a4cc1cc9f57f5c3a7d2450
Summary: This didn't work for a reason specified in comments. Also some cleanup in the unit tests, now inference uses a custom workspace to run cell net on
Reviewed By: urikz
Differential Revision: D4742670
fbshipit-source-id: 04165c029fddec5ae31b20b207faf06d2fa20816
Summary: D4734505 part 2. Remove more instances of the batch_size parameter
Reviewed By: urikz
Differential Revision: D4736906
fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter
Reviewed By: urikz
Differential Revision: D4702086
fbshipit-source-id: c4c1d8425cd36c1e86695918eaba2667c27e9601
Summary: For example, test and train nets could have shared workspaces, leading to race condition. This adds an assertion and adds a running counter to the workspace-blob name.
Reviewed By: jhcross
Differential Revision: D4712152
fbshipit-source-id: 808d7069095bac24ebfe0c9d31ebd134f4cf0956
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
- cell net/step net external inputs must be namespace scoped
- prevent double-namescoping of cellnet inputs
- make data parallel model understand recurrentnets so the device-mapping works
Reviewed By: salexspb
Differential Revision: D4708840
fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
Summary:
Created a new function with specifics related to MI LSTM implementation in caffe2
See https://arxiv.org/pdf/1606.06630.pdf for details.
See D4478877 for the implementation of the same in tensorflow
Reviewed By: jhcross
Differential Revision: D4669882
fbshipit-source-id: 095bbcf187dbdac2cd79558ff0c8f9f67d8af639
Summary: Super rough implementation of recurrent attention. Planning to factor out the common code between the two functions as well as train and eval. I want to get this out and get eyes on it sooner rather than later
Differential Revision: D4647837
fbshipit-source-id: 54bc4e8ed0df6f04c86c425926decbe89f73b068
Summary:
Implementation of ##LSTMWithAttention##
Still TBD:
1. There are problems with back propagation, because gradient is not implemented for ops with broadcasting
2. I need to make initial_recurrent_state to be of shape [dim] rather than [1, batch_size, dim], so one doesn't need to provide batch_size to LSTMWithAttention
Differential Revision: D4298735
fbshipit-source-id: 8903fcff4d6a66647ee6d45a6ef28803fc3091e5
Summary:
Pass through the h-value recurrent output unchanged at each LSTM step beyond the valid part of a sequence (computed based on seqLengths, allowing batching of sequences of different length). This enables using the final-step output of each sequence as the output when one vector is desired for the entire sequence. Gradient also passed back unchanged.
Also made some cosmetic changes to recurrent_network_test.py (seq_lengths offset corrected, should be in [1, T] rather than [0, T-1]).
Reviewed By: urikz
Differential Revision: D4540307
fbshipit-source-id: 73a9f6326069d713dcb0cdc8d17869317c6dbe96
Summary:
(Caffe2) Modified RecurrentNetworkGradient operator so that training is possible with any of the output blob(s) receiving gradient during the backward pass. This is realized through a new argument for the RecurrentNetwork op, outputs_with_grads, which takes a list of the indices of the output blobs which will receive gradient. The default case (only receiving gradient from the first output blob) remains the default.
New unit test covers the case where outputs_with_grads = [1, 2] using Python LSTM wrapper.
Reviewed By: urikz
Differential Revision: D4518516
fbshipit-source-id: 5c531582b20f3cf727d1aa91239b4d5a2b8a7c1f
Summary: Updates function revise_recurrent_network_op() which supports cloning recurrent networks by adding a blob-name prefix to string arguments to maintain correspondence. Previously relied on many hard-coded indices referring to the positions of arguments and inputs of RecurrentNetworkOp and its corresponding gradient operator, and therefore broke when the implementation changed. This fix should make it more general and robust
Differential Revision: D4559768
fbshipit-source-id: fb85b0b1ffb1393dc84760d6ae5dc473e8b764b0
Summary:
I have forgotten to remove this one. The rest of indexing
instead of string names is comming after D4446813 lands as scratches
aren't inputs or outputs and thus can't be indexed.
Reviewed By: urikz
Differential Revision: D4465748
fbshipit-source-id: 2ccbedfb35541ef4a2231d1480eef59025bd5290
Summary: On some inputs TestWarden was failing
Reviewed By: Yangqing
Differential Revision: D4487293
fbshipit-source-id: 3da4b310a619c2b57f033b2dd7727f71403bfd68
Summary: This diff use stack workspaces in RecurrentNetwork, which allows to simplify the implementation and get rid of scratches.
Reviewed By: salexspb
Differential Revision: D4446813
fbshipit-source-id: 514eec7e4300bdf492a9cb192b40cf4f89acf656
Summary: Remove usage of recurrent_sizes, so recurrent states' sizes can depend on input (in case of attention matrix for beam decoder). I removed recurrent_sizes from forward and backward steps.
Reviewed By: salexspb
Differential Revision: D4427688
fbshipit-source-id: 580420a294d309c86ec5cb4e677058623b7228e1
Summary:
In this diff I stop passing parameters by name and also remove hardcoded output ids which were there specifically for LSTM to work. It also allows to avoid using recurrent_sizes in the backward pass (for forward this is done in D4427688)
Using similar technic it should be simple enough to eliminate blob name passing at all. Then we can fix scoping. These can be done in a next diff.
Reviewed By: urikz
Differential Revision: D4444614
fbshipit-source-id: 3580a76365502b9f2f09e3d8b7e78084ca739f00
Summary:
This is a first step in improving our RNN story. It provides a wrapper around current RecurrentNetworkOp implementation which infers most of the redundant parameters and makes API much simpler.
Also in order to support general step nets I added an extra argument to the RecurrentNetworkOp.
Future work:
1. Inferring step net output and internal blobs (scratches) sizes and type
2. Avoid accessing blobs by names in c++ part
3. Remove requirement for inputs / output 1:1 correspondence in the step net
4. Make python API support networks with operators like Sum being on the boarder of the Cell net (currently there is an issue with such networks where gradient blobs which are on the side are not explicitly created).
Differential Revision: D4268503
fbshipit-source-id: f8a66491c2b55daa730caeed7e9f2b3921541b49