There are small typos in:
- caffe2/python/recurrent.py
- test/distributed/test_c10d_nccl.py
- test/test_fx.py
- torch/csrc/jit/runtime/autodiff.cpp
- torchgen/gen.py
Fixes:
- Should read `propagation` rather than `propogation`.
- Should read `multiplied` rather than `multuplied`.
- Should read `eliminate` rather than `elminate`.
- Should read `dispatcher` rather than `disaptcher`.
Semi-automated pull request generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81435
Approved by: https://github.com/ngimel
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary: Remove scoping assrtion because it is not useful and causing errors
Reviewed By: salexspb
Differential Revision: D6538219
fbshipit-source-id: e587e294d4beec1370e6895af9354f0818a4cdd8
Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well.
Reviewed By: salexspb
Differential Revision: D5949330
fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f
Summary: RNN executor previously relied on getting the mapping from x to x_prev (and gradients) from recurrent.py, but we can just infer them from links. This makes all models compatible with rnn executor, given enable_rnn_executor=1 argument.
Reviewed By: jamesr66a
Differential Revision: D5801436
fbshipit-source-id: 14d0e26dfbad6347f645d907da493187c98e9b17
Summary: As title. Made the configurations op-specific since many models run multiple RNNs.
Reviewed By: jamesr66a
Differential Revision: D5796208
fbshipit-source-id: 88173879dfff9f3f7bf583ccc4f4c6385cca5aca
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.
Flag --caffe2_rnn_executor can be used to switch the executor off.
Reviewed By: salexspb
Differential Revision: D5749304
fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
Summary:
Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod.
Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage.
Reviewed By: jmp84
Differential Revision: D5639306
fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4
Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.
Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.
This diff also restores the link-ops that I unlanded earlier.
The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.
Reviewed By: salexspb
Differential Revision: D5001637
fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
Summary:
Added operator RecurrentNetworkBlobFetcherOp that takes as input a scratch workspace name and prefix, and copies over all blobs in the scratch workspace into the global workspace. This essentially extracts all intermediate recurrent network computation for each timestep.
Added a wrapper in recurrent.py - retrieve_step_blobs(net, prefix='rnn') - which, when called after an rnn is run, will return a list of all blobs extracted from the net.
Reviewed By: akyrola
Differential Revision: D5421926
fbshipit-source-id: 0f35b466d77d3c719fb0e32de7dbcafc6c0d5225
Summary:
Static RNN allows to unroll an RNN into Caffe2 graph using all existing cell abstractions. In this diff I introduce several new tests that already caught a few bugs in our RecurrentNetworkOp gradient accumulation logic by comparing it to an unrolled version.
Another use case is perf - potentially we can run an unrolled net faster because DAGNet will have access to the whole graph. Same about memonger. But this work is not part of this diff
Reviewed By: akyrola
Differential Revision: D5200943
fbshipit-source-id: 20f16fc1b2ca500d06ccc60c4cec6e81839149dc
Summary:
There is an edge case where internal gradient blobs of the backward step net should not be considered internally calclulated if the only "internal" calculation is in-place.
In the case of the failing attention unit tests, the offending blob was attention_weighted_encoder_context_grad, which was incorrectly considered internal because it was the output (as well as input) of a Reshape on the step net's edge. The caveat here is that the results may be unpredictable if a non-pass-through in-place operation is applied to a blob within a step net which is also consumed both internally and is a recurrent state/output. (This is an extreme edge case, and difficult to explicitly enforce, but it's worth noting.)
Reviewed By: salexspb
Differential Revision: D5198328
fbshipit-source-id: 0cfa8f903fd767fc50e727f238ac3d8cdca03fe0
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand
Reviewed By: akyrola
Differential Revision: D5142049
fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
Summary: As noted by salexspb, MultiRNNCell had unreliable gradient computation. The problem was that recurrent gradient and gradient computed wihtin the backward step net were not being accumulated during the backward pass, but rather writing to the same blob, thus overwriting each other. This diff fixes that by artificially introducing an extra blob for the internal output, and then accumulating it into the gradient coming from the recurrent connection.
Reviewed By: salexspb
Differential Revision: D5110059
fbshipit-source-id: 16add50989fe8866361bbc21afce5f214c5292fd
Summary:
This is preamble for the "diagonal executor". Instead of creating a Net for each timestep, we have a single executor for the RecurrentNetworkOp that manages ops per timestep.
This will be used if net_type='rnn', so one can still use the old way by using a net type of 'simple' or 'dag' (so there is effective kill-switch if there are some issues with this).
Did this only for the forward-model. Gradient op will follow later on, but it is basically similar, just reverse order.
Reviewed By: salexspb
Differential Revision: D4979933
fbshipit-source-id: bda77918ec518cb6b29d7021ee036d59eb2dd303
Summary:
This is useful when data has standalone sequences which are
not connected to each other by any meaningful context
Reviewed By: yqwangustc
Differential Revision: D4835164
fbshipit-source-id: f95626acc26acc3eba3bca7efb08ed1dbdb36c83
Summary:
Issue is that AliasOp doesn't work well with swaps that we do for
param.grad and param.accGrad. Tensors become the same if there is no
reallocation of the gradient tensor inside the backward cell net's
local workspace.
bug explanation from akyrola:
```
gpu_0/decoder/decoder_hidden_encoder_outputs_sum_grad: tensor A
on each timestap back to 0, we Alias
gpu_0/decoder/weighted_encoder_outputs_grad,
so then also
gpu_0/decoder/weighted_encoder_outputs_grad: tensor A
It's acc is:
gpu_0/decoder/weighted_encoder_outputs_grad_acc: tensor B
Now after timesteps, we swap (line 626) with _acc to get
gpu_0/decoder/weighted_encoder_outputs_grad: tensor B
gpu_0/decoder/weighted_encoder_outputs_grad_acc: tensor A
OPTION A -- batch size is same as before or smaller:
Then on next iteration, we do again the Alias to
gpu_0/decoder/decoder_hidden_encoder_outputs_sum_grad, so now
gpu_0/decoder/weighted_encoder_outputs_grad: tensor A
and also
gpu_0/decoder/weighted_encoder_outputs_grad_acc: tensor A
swapping them does nothing and they are the same
OPTION B -- batch size increases
gpu_0/decoder/decoder_hidden_encoder_outputs_sum_grad is reallocated,
becomes tensor C
gpu_0/decoder/weighted_encoder_outputs_grad becomes tensor C with
Alias
gpu_0/decoder/weighted_encoder_outputs_grad_acc: is tensor A
```
Reviewed By: urikz
Differential Revision:
D4946730
Tags: rnn, caffe2
fbshipit-source-id: b52d63cb238b81d2ad40e05e70deb32a81336f47
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.
Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.
This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.
Reviewed By: salexspb
Differential Revision: D4916482
fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
Summary: This is the nice way to re-use RNN layers for training and for inference.
Reviewed By: salexspb
Differential Revision: D4825894
fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied.
Differential Revision: D4880712
fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a
Summary:
prof_dag in step net is not supported
(Note: this ignores all push blocking failures!)
Differential Revision: D4876551
fbshipit-source-id: 4003e60908e51ef052f8656bf527b326676c298c
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.
For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.
For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).
I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.
Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.
Reviewed By: urikz
Differential Revision: D4853890
fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
Summary: As said in the title. This should save a lot of memory if using both train and test workflows.
Reviewed By: jhcross
Differential Revision: D4855436
fbshipit-source-id: 9eeca548eee118e07bd587c46f40e7beb138318e
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.
* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params. This way we can test if cuDNN and our own produce the same result.
recurrent_test.py tests for the equivalency
Reviewed By: salexspb
Differential Revision: D4654988
fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
Summary:
Uses the cudnnTransformTensor function. It works by shuffling the strides according to the transpose axis. Significant speedup over current GPU version .
+ moves the transpose test under utility_ops, because hypothesis_test is too big
Reviewed By: jamesr66a
Differential Revision: D4810993
fbshipit-source-id: 82577c4ced1389e70bd5992820ae4d8297a3817f
Summary:
This is pretty tricky to explain, but we can just use
backward_links. This way the whole cell would use a blob from the
states_grad tensor instead of having its own blob. This also should
save on memory a bit
Differential Revision: D4770798
fbshipit-source-id: 673f85b2c2fdf42c47feeaa24d1e2bf086f012f9
Summary: We anyway accumulate values of this blob (param_grad) in a another special internal blob
Differential Revision: D4768643
fbshipit-source-id: a9d08b7eafd25f278a8db722f9cdb1d0064b852a
Summary: Apart from copying gradient blobs for inputs with initial_cell_input, we needed to perform a similar operation for external parameters used by the step net
Reviewed By: salexspb
Differential Revision: D4752259
fbshipit-source-id: 13ee48cf583ed86221a4cc1cc9f57f5c3a7d2450
Summary: This didn't work for a reason specified in comments. Also some cleanup in the unit tests, now inference uses a custom workspace to run cell net on
Reviewed By: urikz
Differential Revision: D4742670
fbshipit-source-id: 04165c029fddec5ae31b20b207faf06d2fa20816
Summary: D4734505 part 2. Remove more instances of the batch_size parameter
Reviewed By: urikz
Differential Revision: D4736906
fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter
Reviewed By: urikz
Differential Revision: D4702086
fbshipit-source-id: c4c1d8425cd36c1e86695918eaba2667c27e9601
Summary: For example, test and train nets could have shared workspaces, leading to race condition. This adds an assertion and adds a running counter to the workspace-blob name.
Reviewed By: jhcross
Differential Revision: D4712152
fbshipit-source-id: 808d7069095bac24ebfe0c9d31ebd134f4cf0956
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
- cell net/step net external inputs must be namespace scoped
- prevent double-namescoping of cellnet inputs
- make data parallel model understand recurrentnets so the device-mapping works
Reviewed By: salexspb
Differential Revision: D4708840
fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
Summary:
Created a new function with specifics related to MI LSTM implementation in caffe2
See https://arxiv.org/pdf/1606.06630.pdf for details.
See D4478877 for the implementation of the same in tensorflow
Reviewed By: jhcross
Differential Revision: D4669882
fbshipit-source-id: 095bbcf187dbdac2cd79558ff0c8f9f67d8af639
Summary: Super rough implementation of recurrent attention. Planning to factor out the common code between the two functions as well as train and eval. I want to get this out and get eyes on it sooner rather than later
Differential Revision: D4647837
fbshipit-source-id: 54bc4e8ed0df6f04c86c425926decbe89f73b068
Summary:
Implementation of ##LSTMWithAttention##
Still TBD:
1. There are problems with back propagation, because gradient is not implemented for ops with broadcasting
2. I need to make initial_recurrent_state to be of shape [dim] rather than [1, batch_size, dim], so one doesn't need to provide batch_size to LSTMWithAttention
Differential Revision: D4298735
fbshipit-source-id: 8903fcff4d6a66647ee6d45a6ef28803fc3091e5
Summary:
Pass through the h-value recurrent output unchanged at each LSTM step beyond the valid part of a sequence (computed based on seqLengths, allowing batching of sequences of different length). This enables using the final-step output of each sequence as the output when one vector is desired for the entire sequence. Gradient also passed back unchanged.
Also made some cosmetic changes to recurrent_network_test.py (seq_lengths offset corrected, should be in [1, T] rather than [0, T-1]).
Reviewed By: urikz
Differential Revision: D4540307
fbshipit-source-id: 73a9f6326069d713dcb0cdc8d17869317c6dbe96
Summary:
(Caffe2) Modified RecurrentNetworkGradient operator so that training is possible with any of the output blob(s) receiving gradient during the backward pass. This is realized through a new argument for the RecurrentNetwork op, outputs_with_grads, which takes a list of the indices of the output blobs which will receive gradient. The default case (only receiving gradient from the first output blob) remains the default.
New unit test covers the case where outputs_with_grads = [1, 2] using Python LSTM wrapper.
Reviewed By: urikz
Differential Revision: D4518516
fbshipit-source-id: 5c531582b20f3cf727d1aa91239b4d5a2b8a7c1f
Summary: Updates function revise_recurrent_network_op() which supports cloning recurrent networks by adding a blob-name prefix to string arguments to maintain correspondence. Previously relied on many hard-coded indices referring to the positions of arguments and inputs of RecurrentNetworkOp and its corresponding gradient operator, and therefore broke when the implementation changed. This fix should make it more general and robust
Differential Revision: D4559768
fbshipit-source-id: fb85b0b1ffb1393dc84760d6ae5dc473e8b764b0