Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.
Reviewed By: salexspb
Differential Revision: D4605815
fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4