pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	8421bf7c60	Faster softmaxWithLoss rowMaxKernel Summary: We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total. + addded softmaxwithloss to the lstm_benchmark Reviewed By: jamesr66a Differential Revision: D4800629 fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5	2017-03-30 15:49:46 -07:00
Aaron Markham	58f7f2b441	doxygen python block added Summary: Closes https://github.com/caffe2/caffe2/pull/226 Differential Revision: D4793550 Pulled By: JoelMarcey fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e	2017-03-29 06:46:16 -07:00
Aapo Kyrola	fd2835887b	only resize stepWorkspaces when sequence length increases Summary: We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies. Modified the lstm_benchmark test to randomize sequence length. This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes. WITH: I0328 12:17:54.073976 492094 prof_dag_net.cc:156] 136.271 ms/iter ( 120.987 ms/iter) RecurrentNetwork I0328 12:17:54.073982 492094 prof_dag_net.cc:156] 190.074 ms/iter ( 156.828 ms/iter) RecurrentNetworkGradient WITHOUT: I0328 12:25:17.658206 518884 prof_dag_net.cc:156] 375.369 ms/iter ( 249.268 ms/iter) RecurrentNetwork I0328 12:25:17.658211 518884 prof_dag_net.cc:156] 278.892 ms/iter ( 227.29 ms/iter) RecurrentNetworkGradient With LSTM benchmark, get about 2x speedup Reviewed By: jamesr66a Differential Revision: D4789354 fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff	2017-03-28 14:08:00 -07:00
Aapo Kyrola	f84e5360cc	LSTM benchmark (Caffe2 RNN based) Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size. Reviewed By: salexspb Differential Revision: D4605815 fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4	2017-02-28 23:17:26 -08:00

4 Commits