pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Xiaoqiang Zheng b96a6516a6 Add CPP Full Reduction Benchmarks. (#50193 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193 * Supports aten, native reference implementation, and NNC TE implementations. * Support functionality checks against aten, in addition to performance checks. Test plans: * After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt, * bin/tensorexpr_bench --benchmark_filter=Reduce1D Measurements: On a Broadwell E5-2686 CPU, Reduce1D/Torch/16777216 5638547 ns 5638444 ns 119 BYTES=11.902G/s Reduce1D/Naive/16777216 19308235 ns 19308184 ns 36 BYTES=3.47567G/s Reduce1D/NativeRfactor/16777216 8433348 ns 8433038 ns 85 BYTES=7.95785G/s Reduce1D/NativeVector/16777216 5608836 ns 5608727 ns 124 BYTES=11.9651G/s Reduce1D/NativeTiled/16777216 5550233 ns 5550221 ns 126 BYTES=12.0912G/s Reduce1D/TeNaive/16777216 21451047 ns 21450752 ns 33 BYTES=3.12851G/s Reduce1D/TeSplitTail/16777216 23701732 ns 23701229 ns 30 BYTES=2.83145G/s Reduce1D/TeSplitMask/16777216 23683589 ns 23682978 ns 30 BYTES=2.83363G/s Reduce1D/TeRfactorV2/16777216 5378019 ns 5377909 ns 131 BYTES=12.4786G/s Result summary: * The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart. Follow-up items: * rfactor does not work well with split * We don't have a multi-threaded implementation yet. * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25821880 Pulled By: zheng-xq fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3	2021-01-21 10:00:50 -08:00
..
tensorexpr	Add CPP Full Reduction Benchmarks. (#50193 )	2021-01-21 10:00:50 -08:00

Xiaoqiang Zheng b96a6516a6 Add CPP Full Reduction Benchmarks. (#50193 )

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193

* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.

Test plans:

* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D

Measurements:

On a Broadwell E5-2686 CPU,

Reduce1D/Torch/16777216            5638547 ns    5638444 ns        119 BYTES=11.902G/s
Reduce1D/Naive/16777216           19308235 ns   19308184 ns         36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216    8433348 ns    8433038 ns         85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216     5608836 ns    5608727 ns        124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216      5550233 ns    5550221 ns        126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216         21451047 ns   21450752 ns         33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216     23701732 ns   23701229 ns         30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216     23683589 ns   23682978 ns         30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216      5378019 ns    5377909 ns        131 BYTES=12.4786G/s

Result summary:

* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.

Follow-up items:

* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
  * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25821880

Pulled By: zheng-xq

fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3

2021-01-21 10:00:50 -08:00

tensorexpr

Add CPP Full Reduction Benchmarks. (#50193 )

2021-01-21 10:00:50 -08:00