Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254
Reviewed By: bwasti
Differential Revision: D4871680
Pulled By: dzhulgakov
fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3