pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	9cb901caf0	Forward-only rnns Summary: Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator. When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles through only one private workspace. Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op. This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to get the benefits. Reviewed By: salexspb Differential Revision: D4916482 fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad	2017-04-24 15:52:27 -07:00
Aapo Kyrola	5a856ce03e	disable dropout completely when not used Summary: salexspb recognized that my diff of fixing num_layers>1 cudnn lstm made it run much slower. Turns out this was caused by adding the dropout states to the gradient op (which it was missing ,that was a bug). But since we use dropout=1.0, we don't need to initialize the dropout states, and turns out this improves the perf of CuDNN LSTM very significantly, at least when hidden_dim is small (2.5x increase with hidden_dim=40). With large hidden_dim, the improvement is more modest. Reviewed By: salexspb Differential Revision: D4920543 fbshipit-source-id: 860c9d4c61793252f658dc5e3390bab571476be5	2017-04-20 08:40:25 -07:00
Aapo Kyrola	bef5720b76	Flag to report total memory in GPUs + op and python func to retrieve Summary: If command line flag caffe2_gpu_memory_tracking is enabled, CUDAContext will keep track of total memory allocated on each GPU. This requires keeping tracking of the sizes of the pointers, thus it might add some overhead, and is thus optional. The overhead is minimal in practice since we don't do allocations after first iterations, usually, though. Added an op GetGPUMemoryUsage() to fetch this data programmatically, and python function utils GetGPUMemoryUsageStats() to call this op and package the results. Modified LSTM benchmark to report these stats. This tracking is only for GPU now. CPU allocations are less organized.. Reviewed By: asaadaldien Differential Revision: D4877451 fbshipit-source-id: 857798fe499d8c78cc590783052cbb2d4db56ea0	2017-04-19 10:49:11 -07:00
Yury Zemlyanskiy	4bf559eddb	RNNCell, LSTMCell, LSTMWithAttentionCell Summary: This is the nice way to re-use RNN layers for training and for inference. Reviewed By: salexspb Differential Revision: D4825894 fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a	2017-04-18 00:47:20 -07:00
Aapo Kyrola	1e5140aa76	option to recompute blobs backward pass with massive memory savings Summary: This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator. For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive. For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep). I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look. Added options to LSTM, MILSTM and LSTMAttention to enable memory mode. Reviewed By: urikz Differential Revision: D4853890 fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa	2017-04-11 13:03:48 -07:00
Aapo Kyrola	ffd298376a	option to print tensor shapes at exit Summary: Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable. TODO: add sorting Reviewed By: jamesr66a Differential Revision: D4844628 fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185	2017-04-06 21:36:04 -07:00
Aapo Kyrola	8da2d75ec8	Caffe2/Recurrent] recurrent.py API to cuDNN LSTM Summary: Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM. * Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent. * Removed RecurrentInit as not needed * recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM * recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases * recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params. This way we can test if cuDNN and our own produce the same result. recurrent_test.py tests for the equivalency Reviewed By: salexspb Differential Revision: D4654988 fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0	2017-04-05 14:20:23 -07:00
Aapo Kyrola	0771ce312a	optimize weighted softmaxwithloss gradient Summary: Weighted LabelCrossEntropyGradientKernel had a clowny loop over D. Since the operation is completely linear, we can just do it all in a one parallel loop. Massive speed up: in my benchmark from 4s to 20ms. + added weights to the lstm_benchmark Reviewed By: jamesr66a Differential Revision: D4800889 fbshipit-source-id: f9850bcc56ce34d5d7a613419cd172256633a894	2017-03-30 23:02:19 -07:00
Aapo Kyrola	8421bf7c60	Faster softmaxWithLoss rowMaxKernel Summary: We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total. + addded softmaxwithloss to the lstm_benchmark Reviewed By: jamesr66a Differential Revision: D4800629 fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5	2017-03-30 15:49:46 -07:00
Aaron Markham	58f7f2b441	doxygen python block added Summary: Closes https://github.com/caffe2/caffe2/pull/226 Differential Revision: D4793550 Pulled By: JoelMarcey fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e	2017-03-29 06:46:16 -07:00
Aapo Kyrola	fd2835887b	only resize stepWorkspaces when sequence length increases Summary: We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies. Modified the lstm_benchmark test to randomize sequence length. This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes. WITH: I0328 12:17:54.073976 492094 prof_dag_net.cc:156] 136.271 ms/iter ( 120.987 ms/iter) RecurrentNetwork I0328 12:17:54.073982 492094 prof_dag_net.cc:156] 190.074 ms/iter ( 156.828 ms/iter) RecurrentNetworkGradient WITHOUT: I0328 12:25:17.658206 518884 prof_dag_net.cc:156] 375.369 ms/iter ( 249.268 ms/iter) RecurrentNetwork I0328 12:25:17.658211 518884 prof_dag_net.cc:156] 278.892 ms/iter ( 227.29 ms/iter) RecurrentNetworkGradient With LSTM benchmark, get about 2x speedup Reviewed By: jamesr66a Differential Revision: D4789354 fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff	2017-03-28 14:08:00 -07:00
Aapo Kyrola	f84e5360cc	LSTM benchmark (Caffe2 RNN based) Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size. Reviewed By: salexspb Differential Revision: D4605815 fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4	2017-02-28 23:17:26 -08:00

12 Commits