mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary: This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5 And I got ~8 GB/s before this change, but ~14 GB/s after this change. This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166): == Before == time_per_iter 0.0001298875093460083 GB/s 3.082544287868467 == After == time_per_iter 0.00010104801654815674 GB/s 3.9623142905451076 The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that. EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression: before time_per_iter 8.983819484710693e-05 GB/s 4.456723564864611 After no axpy time_per_iter 7.19951868057251e-05 GB/s 5.56126065872172 AFter perfkernels time_per_iter 5.6699180603027346e-05 GB/s 7.061548257694262 After perfkernels no grad time_per_iter 4.388842582702637e-05 GB/s 9.122769670026413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19329 Reviewed By: dzhulgakov Differential Revision: D14969630 Pulled By: jamesr66a fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738 |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| adagrad_avx.cc | ||
| adagrad.cc | ||
| adagrad.h | ||
| CMakeLists.txt | ||
| common_avx.cc | ||
| common_avx2.cc | ||
| common_avx512.cc | ||
| common.h | ||
| cvtsh_ss_bugfix.h | ||
| embedding_lookup_avx2.cc | ||
| embedding_lookup_fused_8bit_rowwise_avx2.cc | ||
| embedding_lookup.cc | ||
| embedding_lookup.h | ||
| fused_8bit_rowwise_embedding_lookup.cc | ||
| fused_8bit_rowwise_embedding_lookup.h | ||
| hp_emblookup_codegen.py | ||
| math_cpu_avx2.cc | ||
| math_cpu_base.cc | ||
| math.h | ||
| typed_axpy_avx.cc | ||
| typed_axpy_avx2.cc | ||
| typed_axpy.cc | ||
| typed_axpy.h | ||