pytorch/caffe2/perfkernels
James Reed d17c22d024 Improve embedding_bag add kernel (#19329)
Summary:
This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5

And I got ~8 GB/s before this change, but ~14 GB/s after this change.

This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166):

== Before ==

time_per_iter 0.0001298875093460083
GB/s 3.082544287868467

== After ==

time_per_iter 0.00010104801654815674
GB/s 3.9623142905451076

The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that.

EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression:

before

time_per_iter 8.983819484710693e-05
GB/s 4.456723564864611

After no axpy
time_per_iter 7.19951868057251e-05
GB/s 5.56126065872172

AFter perfkernels
time_per_iter 5.6699180603027346e-05
GB/s 7.061548257694262

After perfkernels no grad
time_per_iter 4.388842582702637e-05
GB/s 9.122769670026413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19329

Reviewed By: dzhulgakov

Differential Revision: D14969630

Pulled By: jamesr66a

fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738
2019-04-19 19:16:24 -07:00
..
__init__.py re-enable copy of python files, but be careful that the copy is only … (#14982) 2018-12-11 16:54:08 -08:00
adagrad_avx.cc use fp16<->fp32 intrinsic (#17496) 2019-03-07 02:23:07 -08:00
adagrad.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
adagrad.h more careful use of auto in sparse operations (#17958) 2019-03-14 22:10:42 -07:00
CMakeLists.txt Resolve errors in perfkernel for Windows (#16031) 2019-01-16 21:51:00 -08:00
common_avx.cc Remove Apache headers from source. 2018-03-27 13:10:18 -07:00
common_avx2.cc Remove Apache headers from source. 2018-03-27 13:10:18 -07:00
common_avx512.cc include avx512vl to avx512 code path (#14733) 2018-12-05 00:50:51 -08:00
common.h Resolve errors in perfkernel for Windows (#16031) 2019-01-16 21:51:00 -08:00
cvtsh_ss_bugfix.h use fp16<->fp32 intrinsic (#17496) 2019-03-07 02:23:07 -08:00
embedding_lookup_avx2.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
embedding_lookup_fused_8bit_rowwise_avx2.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
embedding_lookup.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
embedding_lookup.h Improve embedding_bag add kernel (#19329) 2019-04-19 19:16:24 -07:00
fused_8bit_rowwise_embedding_lookup.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
fused_8bit_rowwise_embedding_lookup.h more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
hp_emblookup_codegen.py more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
math_cpu_avx2.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
math_cpu_base.cc Resolve errors in perfkernel for Windows (#16031) 2019-01-16 21:51:00 -08:00
math.h Resolve errors in perfkernel for Windows (#16031) 2019-01-16 21:51:00 -08:00
typed_axpy_avx.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
typed_axpy_avx2.cc more careful use of inline/template function in perfkernels (#15388) 2019-01-30 22:49:37 -08:00
typed_axpy.cc Move math::Axpy function to elementwise lib (#18316) 2019-03-26 12:19:19 -07:00
typed_axpy.h Remove Apache headers from source. 2018-03-27 13:10:18 -07:00