Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Goal of this PR is to unify cuda and hip device types in caffe2 python front end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14221
Differential Revision: D13148564
Pulled By: bddppq
fbshipit-source-id: ef9bd2c7d238200165f217097ac5727e686d887b
Summary: As title. Made the configurations op-specific since many models run multiple RNNs.
Reviewed By: jamesr66a
Differential Revision: D5796208
fbshipit-source-id: 88173879dfff9f3f7bf583ccc4f4c6385cca5aca
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.
Flag --caffe2_rnn_executor can be used to switch the executor off.
Reviewed By: salexspb
Differential Revision: D5749304
fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.
Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.
This diff also restores the link-ops that I unlanded earlier.
The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.
Reviewed By: salexspb
Differential Revision: D5001637
fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
Summary:
this works as a standalone python script because args are
global. When used from Flow for monitoring purposes it doesn't
work. This diff fixes it
Reviewed By: zem7
Differential Revision: D5349996
fbshipit-source-id: f73842901d975b783e09e9db0565eb81880bbea1
Summary:
A couple of fixes to fix broken rerporting of lstm_benchmark:
- last_time must be recorded after warm up
- entry count was incorectly removed
Reviewed By: salexspb
Differential Revision: D5349890
fbshipit-source-id: 5dd5bdf46594c520b61bc3b57b153f90a6a17903
Summary:
While this is not intended to be the best performat and
general solution, we can see from the test plan in some cases static DAG RNN could
perform better than our own implementation. Hopefully we will get
dynamic RNN DAG execution at least as fast as this one. Then we will
not need this one in production, only for testing.
Still putting it into our benchmark for comparison purposes
Reviewed By: akyrola
Differential Revision: D5210038
fbshipit-source-id: fa44baf51c455872abd6ec5f5d151cf06e15b1fa
Summary:
Use the rnn_cell's multi-cell for LSTM benchmark. While doing this, i had not changed the initial_states and I got a inconsistent result from rnn_cell, so added an assertion to check initial states length is 2 * num layers.
+ fix division by zero error
Reviewed By: salexspb
Differential Revision: D5003177
fbshipit-source-id: a8250b825394c352428a0f067098dfcd7516ab2a
Summary:
We need a warm-up stage because otherwise first iteration
speds too much timedoing all the allocations
Reviewed By: akyrola
Differential Revision: D4986201
fbshipit-source-id: f60a75520988ff3f1540bb157cdc69634f307db4
Summary:
This is useful when data has standalone sequences which are
not connected to each other by any meaningful context
Reviewed By: yqwangustc
Differential Revision: D4835164
fbshipit-source-id: f95626acc26acc3eba3bca7efb08ed1dbdb36c83
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.
Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.
This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.
Reviewed By: salexspb
Differential Revision: D4916482
fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
Summary: salexspb recognized that my diff of fixing num_layers>1 cudnn lstm made it run much slower. Turns out this was caused by adding the dropout states to the gradient op (which it was missing ,that was a bug). But since we use dropout=1.0, we don't need to initialize the dropout states, and turns out this improves the perf of CuDNN LSTM very significantly, at least when hidden_dim is small (2.5x increase with hidden_dim=40). With large hidden_dim, the improvement is more modest.
Reviewed By: salexspb
Differential Revision: D4920543
fbshipit-source-id: 860c9d4c61793252f658dc5e3390bab571476be5
Summary:
If command line flag caffe2_gpu_memory_tracking is enabled, CUDAContext will keep track of total memory allocated on each GPU. This requires keeping tracking of the sizes of the pointers, thus it might add some overhead, and is thus optional. The overhead is minimal in practice since we don't do allocations after first iterations, usually, though.
Added an op GetGPUMemoryUsage() to fetch this data programmatically, and python function utils GetGPUMemoryUsageStats() to call this op and package the results. Modified LSTM benchmark to report these stats.
This tracking is only for GPU now. CPU allocations are less organized..
Reviewed By: asaadaldien
Differential Revision: D4877451
fbshipit-source-id: 857798fe499d8c78cc590783052cbb2d4db56ea0
Summary: This is the nice way to re-use RNN layers for training and for inference.
Reviewed By: salexspb
Differential Revision: D4825894
fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.
For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.
For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).
I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.
Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.
Reviewed By: urikz
Differential Revision: D4853890
fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.
TODO: add sorting
Reviewed By: jamesr66a
Differential Revision: D4844628
fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.
* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params. This way we can test if cuDNN and our own produce the same result.
recurrent_test.py tests for the equivalency
Reviewed By: salexspb
Differential Revision: D4654988
fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
Summary:
Weighted LabelCrossEntropyGradientKernel had a clowny loop over D. Since the operation is completely linear, we can just do it all in a one parallel loop. Massive speed up: in my benchmark from 4s to 20ms.
+ added weights to the lstm_benchmark
Reviewed By: jamesr66a
Differential Revision: D4800889
fbshipit-source-id: f9850bcc56ce34d5d7a613419cd172256633a894
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.
+ addded softmaxwithloss to the lstm_benchmark
Reviewed By: jamesr66a
Differential Revision: D4800629
fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.
Modified the lstm_benchmark test to randomize sequence length.
This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.
WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156] 136.271 ms/iter ( 120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156] 190.074 ms/iter ( 156.828 ms/iter) RecurrentNetworkGradient
WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156] 375.369 ms/iter ( 249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156] 278.892 ms/iter ( 227.29 ms/iter) RecurrentNetworkGradient
With LSTM benchmark, get about 2x speedup
Reviewed By: jamesr66a
Differential Revision: D4789354
fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.
Reviewed By: salexspb
Differential Revision: D4605815
fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4