Summary: Based on benchmark script located at `caffe2/experiments/python/device_reduce_sum_bench.py`, device reduce sum is slower for N <= 10000, so we only switch to use device reduce for large N in SumElements. This diff applies the same schema for SumSqrElements.
Reviewed By: jamesr66a
Differential Revision: D5369868
fbshipit-source-id: ae13a611aff9d3464d1c4950ee155c740a2da339
Summary: Port SumElements and softmax_ops.cu to use device reduce sum
Reviewed By: akyrola
Differential Revision: D5351881
fbshipit-source-id: ca9604186c261ffcb1480da2a17baab8a4809372