Commit Graph

20 Commits

Author SHA1 Message Date
Dmytro Dzhulgakov
c0cebc3578 Added flags to lstm, convnet and sparse_nn_benchmarks to print out operators
Summary: pass flags directly to C2

Reviewed By: salexspb

Differential Revision: D5345869

fbshipit-source-id: 22b0e791526c7b0caf1e6a13dd29900df0db8fe8
2017-06-30 23:47:04 -07:00
Alexander Sidorov
a6dee1da32 Make args.fixed_shape in lstm_benchmark work in a library mode
Summary:
this works as a standalone python script because args are
global. When used from Flow for monitoring purposes it doesn't
work. This diff fixes it

Reviewed By: zem7

Differential Revision: D5349996

fbshipit-source-id: f73842901d975b783e09e9db0565eb81880bbea1
2017-06-29 14:55:26 -07:00
Aapo Kyrola
dd6e170b8d fix LSTM benchmark reporting
Summary:
A couple of fixes to fix broken rerporting of lstm_benchmark:
- last_time must be recorded after warm up
- entry count was incorectly removed

Reviewed By: salexspb

Differential Revision: D5349890

fbshipit-source-id: 5dd5bdf46594c520b61bc3b57b153f90a6a17903
2017-06-29 13:53:17 -07:00
Yiming Wu
fb4c0a664b brew API in lstm benchamrk
Summary: I deprecated CNN ModelHelper in LSTM benchmark

Reviewed By: salexspb

Differential Revision: D5342734

fbshipit-source-id: 81a552194bcb0cc3071604340fce6873230964f2
2017-06-28 20:18:12 -07:00
Alexander Sidorov
eefd4b0bb2 Static RNN: gpu support and lstm_benchmark integration
Summary:
While this is not intended to be the best performat and
general solution, we can see from the test plan in some cases static DAG RNN could
perform better than our own implementation. Hopefully we will get
dynamic RNN DAG execution at least as fast as this one. Then we will
not need this one in production, only for testing.

Still putting it into our benchmark for comparison purposes

Reviewed By: akyrola

Differential Revision: D5210038

fbshipit-source-id: fa44baf51c455872abd6ec5f5d151cf06e15b1fa
2017-06-16 11:31:43 -07:00
Aapo Kyrola
d312dcc881 lstm_benchmark use rnn_cell.LSTM multicell + assertion
Summary:
Use the rnn_cell's multi-cell for LSTM benchmark. While doing this, i had not changed the initial_states and I got a inconsistent result from rnn_cell, so added an assertion to check initial states length is 2 * num layers.

+ fix division by zero error

Reviewed By: salexspb

Differential Revision: D5003177

fbshipit-source-id: a8250b825394c352428a0f067098dfcd7516ab2a
2017-05-04 17:02:32 -07:00
Alexander Sidorov
379ac514b8 lstm_benchmark: add warm-up stage, support layers
Summary:
We need a warm-up stage because otherwise first iteration
speds too much timedoing all the allocations

Reviewed By: akyrola

Differential Revision: D4986201

fbshipit-source-id: f60a75520988ff3f1540bb157cdc69634f307db4
2017-05-02 20:34:00 -07:00
Alexander Sidorov
ad6204eb0b LSTM: support dropping hidden / cell states when sequence
Summary:
This is useful when data has standalone sequences which are
not connected to each other by any meaningful context

Reviewed By: yqwangustc

Differential Revision: D4835164

fbshipit-source-id: f95626acc26acc3eba3bca7efb08ed1dbdb36c83
2017-04-27 11:47:29 -07:00
Aapo Kyrola
9cb901caf0 Forward-only rnns
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.

Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.

This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.

Reviewed By: salexspb

Differential Revision: D4916482

fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
2017-04-24 15:52:27 -07:00
Aapo Kyrola
5a856ce03e disable dropout completely when not used
Summary: salexspb recognized that my diff of fixing num_layers>1 cudnn lstm made it run much slower. Turns out this was caused by adding the dropout states to the gradient op (which it was missing ,that was a bug). But since we use dropout=1.0, we don't need to initialize the dropout states, and turns out this improves the perf of CuDNN LSTM very significantly, at least when hidden_dim is small (2.5x increase with hidden_dim=40). With large hidden_dim, the improvement is more modest.

Reviewed By: salexspb

Differential Revision: D4920543

fbshipit-source-id: 860c9d4c61793252f658dc5e3390bab571476be5
2017-04-20 08:40:25 -07:00
Aapo Kyrola
bef5720b76 Flag to report total memory in GPUs + op and python func to retrieve
Summary:
If command line flag caffe2_gpu_memory_tracking is enabled, CUDAContext will keep track of total memory allocated on each GPU. This requires keeping tracking of the sizes of the pointers, thus it might add some overhead, and is thus optional. The overhead is minimal in practice since we don't do allocations after first iterations, usually, though.

Added an op GetGPUMemoryUsage() to fetch this data programmatically, and python function utils GetGPUMemoryUsageStats() to call this op and package the results. Modified LSTM benchmark to report these stats.

This tracking is only for GPU now. CPU allocations are less organized..

Reviewed By: asaadaldien

Differential Revision: D4877451

fbshipit-source-id: 857798fe499d8c78cc590783052cbb2d4db56ea0
2017-04-19 10:49:11 -07:00
Yury Zemlyanskiy
4bf559eddb RNNCell, LSTMCell, LSTMWithAttentionCell
Summary: This is the nice way to re-use RNN layers for training and for inference.

Reviewed By: salexspb

Differential Revision: D4825894

fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
2017-04-18 00:47:20 -07:00
Aapo Kyrola
1e5140aa76 option to recompute blobs backward pass with massive memory savings
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.

For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.

For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).

I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.

Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.

Reviewed By: urikz

Differential Revision: D4853890

fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
2017-04-11 13:03:48 -07:00
Aapo Kyrola
ffd298376a option to print tensor shapes at exit
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.

TODO: add sorting

Reviewed By: jamesr66a

Differential Revision: D4844628

fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
2017-04-06 21:36:04 -07:00
Aapo Kyrola
8da2d75ec8 Caffe2/Recurrent] recurrent.py API to cuDNN LSTM
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.

* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params.  This way we can test if cuDNN and our own produce the same result.

recurrent_test.py tests for the equivalency

Reviewed By: salexspb

Differential Revision: D4654988

fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
2017-04-05 14:20:23 -07:00
Aapo Kyrola
0771ce312a optimize weighted softmaxwithloss gradient
Summary:
Weighted LabelCrossEntropyGradientKernel had a clowny loop over D. Since the operation is completely linear, we can just do it all in a one parallel loop. Massive speed up: in my benchmark from 4s to 20ms.

+ added weights to the lstm_benchmark

Reviewed By: jamesr66a

Differential Revision: D4800889

fbshipit-source-id: f9850bcc56ce34d5d7a613419cd172256633a894
2017-03-30 23:02:19 -07:00
Aapo Kyrola
8421bf7c60 Faster softmaxWithLoss rowMaxKernel
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.

+ addded softmaxwithloss to the lstm_benchmark

Reviewed By: jamesr66a

Differential Revision: D4800629

fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
2017-03-30 15:49:46 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Aapo Kyrola
fd2835887b only resize stepWorkspaces when sequence length increases
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.

Modified the lstm_benchmark test to randomize sequence length.

This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.

WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156]    136.271 ms/iter (   120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156]    190.074 ms/iter (   156.828 ms/iter) RecurrentNetworkGradient

WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156]    375.369 ms/iter (   249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156]    278.892 ms/iter (    227.29 ms/iter) RecurrentNetworkGradient

With LSTM benchmark, get about 2x speedup

Reviewed By: jamesr66a

Differential Revision: D4789354

fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
2017-03-28 14:08:00 -07:00
Aapo Kyrola
f84e5360cc LSTM benchmark (Caffe2 RNN based)
Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.

Reviewed By: salexspb

Differential Revision: D4605815

fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4
2017-02-28 23:17:26 -08:00