Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.
+ addded softmaxwithloss to the lstm_benchmark
Reviewed By: jamesr66a
Differential Revision: D4800629
fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.
Modified the lstm_benchmark test to randomize sequence length.
This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.
WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156] 136.271 ms/iter ( 120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156] 190.074 ms/iter ( 156.828 ms/iter) RecurrentNetworkGradient
WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156] 375.369 ms/iter ( 249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156] 278.892 ms/iter ( 227.29 ms/iter) RecurrentNetworkGradient
With LSTM benchmark, get about 2x speedup
Reviewed By: jamesr66a
Differential Revision: D4789354
fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.
Reviewed By: salexspb
Differential Revision: D4605815
fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4