Commit Graph

4 Commits

Author SHA1 Message Date
Aapo Kyrola
8421bf7c60 Faster softmaxWithLoss rowMaxKernel
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.

+ addded softmaxwithloss to the lstm_benchmark

Reviewed By: jamesr66a

Differential Revision: D4800629

fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
2017-03-30 15:49:46 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Aapo Kyrola
fd2835887b only resize stepWorkspaces when sequence length increases
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.

Modified the lstm_benchmark test to randomize sequence length.

This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.

WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156]    136.271 ms/iter (   120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156]    190.074 ms/iter (   156.828 ms/iter) RecurrentNetworkGradient

WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156]    375.369 ms/iter (   249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156]    278.892 ms/iter (    227.29 ms/iter) RecurrentNetworkGradient

With LSTM benchmark, get about 2x speedup

Reviewed By: jamesr66a

Differential Revision: D4789354

fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
2017-03-28 14:08:00 -07:00
Aapo Kyrola
f84e5360cc LSTM benchmark (Caffe2 RNN based)
Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.

Reviewed By: salexspb

Differential Revision: D4605815

fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4
2017-02-28 23:17:26 -08:00