Commit Graph

12 Commits

Author SHA1 Message Date
Aapo Kyrola
9cb901caf0 Forward-only rnns
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.

Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.

This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.

Reviewed By: salexspb

Differential Revision: D4916482

fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
2017-04-24 15:52:27 -07:00
Aapo Kyrola
5a856ce03e disable dropout completely when not used
Summary: salexspb recognized that my diff of fixing num_layers>1 cudnn lstm made it run much slower. Turns out this was caused by adding the dropout states to the gradient op (which it was missing ,that was a bug). But since we use dropout=1.0, we don't need to initialize the dropout states, and turns out this improves the perf of CuDNN LSTM very significantly, at least when hidden_dim is small (2.5x increase with hidden_dim=40). With large hidden_dim, the improvement is more modest.

Reviewed By: salexspb

Differential Revision: D4920543

fbshipit-source-id: 860c9d4c61793252f658dc5e3390bab571476be5
2017-04-20 08:40:25 -07:00
Aapo Kyrola
bef5720b76 Flag to report total memory in GPUs + op and python func to retrieve
Summary:
If command line flag caffe2_gpu_memory_tracking is enabled, CUDAContext will keep track of total memory allocated on each GPU. This requires keeping tracking of the sizes of the pointers, thus it might add some overhead, and is thus optional. The overhead is minimal in practice since we don't do allocations after first iterations, usually, though.

Added an op GetGPUMemoryUsage() to fetch this data programmatically, and python function utils GetGPUMemoryUsageStats() to call this op and package the results. Modified LSTM benchmark to report these stats.

This tracking is only for GPU now. CPU allocations are less organized..

Reviewed By: asaadaldien

Differential Revision: D4877451

fbshipit-source-id: 857798fe499d8c78cc590783052cbb2d4db56ea0
2017-04-19 10:49:11 -07:00
Yury Zemlyanskiy
4bf559eddb RNNCell, LSTMCell, LSTMWithAttentionCell
Summary: This is the nice way to re-use RNN layers for training and for inference.

Reviewed By: salexspb

Differential Revision: D4825894

fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
2017-04-18 00:47:20 -07:00
Aapo Kyrola
1e5140aa76 option to recompute blobs backward pass with massive memory savings
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.

For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.

For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).

I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.

Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.

Reviewed By: urikz

Differential Revision: D4853890

fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
2017-04-11 13:03:48 -07:00
Aapo Kyrola
ffd298376a option to print tensor shapes at exit
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.

TODO: add sorting

Reviewed By: jamesr66a

Differential Revision: D4844628

fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
2017-04-06 21:36:04 -07:00
Aapo Kyrola
8da2d75ec8 Caffe2/Recurrent] recurrent.py API to cuDNN LSTM
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.

* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params.  This way we can test if cuDNN and our own produce the same result.

recurrent_test.py tests for the equivalency

Reviewed By: salexspb

Differential Revision: D4654988

fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
2017-04-05 14:20:23 -07:00
Aapo Kyrola
0771ce312a optimize weighted softmaxwithloss gradient
Summary:
Weighted LabelCrossEntropyGradientKernel had a clowny loop over D. Since the operation is completely linear, we can just do it all in a one parallel loop. Massive speed up: in my benchmark from 4s to 20ms.

+ added weights to the lstm_benchmark

Reviewed By: jamesr66a

Differential Revision: D4800889

fbshipit-source-id: f9850bcc56ce34d5d7a613419cd172256633a894
2017-03-30 23:02:19 -07:00
Aapo Kyrola
8421bf7c60 Faster softmaxWithLoss rowMaxKernel
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.

+ addded softmaxwithloss to the lstm_benchmark

Reviewed By: jamesr66a

Differential Revision: D4800629

fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
2017-03-30 15:49:46 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Aapo Kyrola
fd2835887b only resize stepWorkspaces when sequence length increases
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.

Modified the lstm_benchmark test to randomize sequence length.

This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.

WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156]    136.271 ms/iter (   120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156]    190.074 ms/iter (   156.828 ms/iter) RecurrentNetworkGradient

WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156]    375.369 ms/iter (   249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156]    278.892 ms/iter (    227.29 ms/iter) RecurrentNetworkGradient

With LSTM benchmark, get about 2x speedup

Reviewed By: jamesr66a

Differential Revision: D4789354

fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
2017-03-28 14:08:00 -07:00
Aapo Kyrola
f84e5360cc LSTM benchmark (Caffe2 RNN based)
Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.

Reviewed By: salexspb

Differential Revision: D4605815

fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4
2017-02-28 23:17:26 -08:00