pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Bert Maher	a23e82df10	[nnc] Tweak log_nnc_sleef so vectorization kicks in (#51491 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491 The vectorizer heuristic is pretty dumb and only kicks in if the unroll factor is exactly 8 or 4. It's still slower than direct implementation, which isn't surprising. ghstack-source-id: 120783426 Test Plan: `buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench` Before: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 438 ns 438 ns 1795511 log/s=146.259M/s log_nnc_sleef/512 3196 ns 3195 ns 210032 log/s=160.235M/s log_nnc_sleef/8192 77467 ns 77466 ns 8859 log/s=105.749M/s log_nnc_sleef/32768 310206 ns 310202 ns 2170 log/s=105.634M/s log_nnc_fast/64 100 ns 100 ns 7281074 log/s=637.144M/s log_nnc_fast/512 546 ns 546 ns 1335816 log/s=938.361M/s log_nnc_fast/8192 7360 ns 7359 ns 91971 log/s=1.11316G/s log_nnc_fast/32768 30793 ns 30792 ns 22633 log/s=1064.17M/s log_aten/64 427 ns 427 ns 1634897 log/s=150.021M/s log_aten/512 796 ns 796 ns 877318 log/s=643.566M/s log_aten/8192 6690 ns 6690 ns 102649 log/s=1.22452G/s log_aten/32768 25357 ns 25350 ns 27808 log/s=1.29263G/s ``` After: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 189 ns 188 ns 3872475 log/s=340.585M/s log_nnc_sleef/512 1307 ns 1307 ns 557770 log/s=391.709M/s log_nnc_sleef/8192 20259 ns 20257 ns 34240 log/s=404.404M/s log_nnc_sleef/32768 81556 ns 81470 ns 8767 log/s=402.209M/s log_nnc_fast/64 110 ns 110 ns 6564558 log/s=581.116M/s log_nnc_fast/512 554 ns 554 ns 1279304 log/s=923.376M/s log_nnc_fast/8192 7774 ns 7774 ns 91421 log/s=1053.75M/s log_nnc_fast/32768 31008 ns 31006 ns 21279 log/s=1056.83M/s ``` Reviewed By: bwasti Differential Revision: D26139067 fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2	2021-02-01 16:35:37 -08:00
Mikhail Zolotukhin	e975169426	[TensorExpr] Redesign `Tensor` class. (#50995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995 This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and merges it with recently introduced 'CompoundTensor'. A statement for the tensor is either passed directly to the Tensor constructor (akin to 'CompoundTensor'), or is built immediately in constructor. LoopNest is no longer responsible for constructing statements from tensors - it simply stitches already constructed statements contained in Tensors. This has a side effect that now we cannot construct several loopnests from the same tensors - we need to explicitly clone statements if we want to do that. A special copy constructor was added to LoopNest to make it more convenient (note: this only affects tests, we don't usually create multiple loopnests in other places). Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D26038223 Pulled By: ZolotukhinM fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17	2021-01-27 16:14:22 -08:00
Nikita Shulga	97ea95ddd7	Delete tabs from becnh_approx.cpp (#51157 ) Summary: Introduced by D25981260 (`f08464f31d`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157 Reviewed By: bwasti Differential Revision: D26090008 Pulled By: malfet fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e	2021-01-26 15:53:47 -08:00
Bert Maher	c4029444d1	[nnc] Per-operator benchmarks (#51093 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51093 Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. ghstack-source-id: 120403675 Test Plan: ``` op eager nnc speedup hardswish 0.187 0.051 3.70 hardswish 0.052 0.052 1.00 sigmoid 0.148 1.177 0.13 reciprocal 0.049 0.050 0.98 neg 0.038 0.037 1.02 relu 0.037 0.036 1.03 isnan 0.119 0.020 5.86 log 0.082 1.330 0.06 log10 0.148 1.848 0.08 log1p 0.204 1.413 0.14 log2 0.285 1.167 0.24 exp 0.063 1.123 0.06 expm1 0.402 1.417 0.28 erf 0.167 0.852 0.20 erfc 0.181 1.098 0.16 cos 0.124 0.793 0.16 sin 0.126 0.838 0.15 tan 0.285 1.777 0.16 acos 0.144 1.358 0.11 asin 0.126 1.193 0.11 cosh 0.384 1.761 0.22 sinh 0.390 2.279 0.17 atan 0.240 1.564 0.15 tanh 0.320 2.259 0.14 sqrt 0.043 0.069 0.63 rsqrt 0.118 0.117 1.01 abs 0.038 0.037 1.03 ceil 0.038 0.038 1.01 floor 0.039 0.039 1.00 round 0.039 0.292 0.13 trunc 0.040 0.036 1.12 lgamma 2.045 2.721 0.75 ``` Reviewed By: zheng-xq Differential Revision: D26069791 fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba	2021-01-26 14:10:08 -08:00
Bram Wasti	f08464f31d	[nnc] Add benchmarks Summary: Adding a set of benchmarks for key operators Test Plan: buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench Reviewed By: ZolotukhinM Differential Revision: D25981260 fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396	2021-01-26 13:51:33 -08:00
Xiaoqiang Zheng	b96a6516a6	Add CPP Full Reduction Benchmarks. (#50193 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193 * Supports aten, native reference implementation, and NNC TE implementations. * Support functionality checks against aten, in addition to performance checks. Test plans: * After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt, * bin/tensorexpr_bench --benchmark_filter=Reduce1D Measurements: On a Broadwell E5-2686 CPU, Reduce1D/Torch/16777216 5638547 ns 5638444 ns 119 BYTES=11.902G/s Reduce1D/Naive/16777216 19308235 ns 19308184 ns 36 BYTES=3.47567G/s Reduce1D/NativeRfactor/16777216 8433348 ns 8433038 ns 85 BYTES=7.95785G/s Reduce1D/NativeVector/16777216 5608836 ns 5608727 ns 124 BYTES=11.9651G/s Reduce1D/NativeTiled/16777216 5550233 ns 5550221 ns 126 BYTES=12.0912G/s Reduce1D/TeNaive/16777216 21451047 ns 21450752 ns 33 BYTES=3.12851G/s Reduce1D/TeSplitTail/16777216 23701732 ns 23701229 ns 30 BYTES=2.83145G/s Reduce1D/TeSplitMask/16777216 23683589 ns 23682978 ns 30 BYTES=2.83363G/s Reduce1D/TeRfactorV2/16777216 5378019 ns 5377909 ns 131 BYTES=12.4786G/s Result summary: * The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart. Follow-up items: * rfactor does not work well with split * We don't have a multi-threaded implementation yet. * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25821880 Pulled By: zheng-xq fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3	2021-01-21 10:00:50 -08:00
Bert Maher	468c99fba4	Reapply D25856891: [te] Benchmark comparing fused overhead to unfused (#50543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543 Original commit changeset: 2d2f07f79986 Was part of a stack that got reverted. This is just a benchmark. ghstack-source-id: 119825594 Test Plan: CI Reviewed By: navahgar Differential Revision: D25912439 fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676	2021-01-14 14:17:45 -08:00
Mike Ruberry	4ee631cdf0	Revert D25856891: [te] Benchmark comparing fused overhead to unfused Test Plan: revert-hammer Differential Revision: D25856891 (`36ae3feb22`) Original commit changeset: 0e99515ec2e7 fbshipit-source-id: 2d2f07f79986ca7815b9eae63e734db76bdfc0c8	2021-01-14 04:33:35 -08:00
Bert Maher	36ae3feb22	[te] Benchmark comparing fused overhead to unfused (#50305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305 That's it ghstack-source-id: 119631533 Test Plan: ``` buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead ``` ``` Run on (24 X 2394.67 MHz CPU s) 2021-01-08 16:06:17 ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2157 ns 2157 ns 311314 UnfusedOverhead 2443 ns 2443 ns 311221 ``` Reviewed By: ZolotukhinM Differential Revision: D25856891 fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3	2021-01-13 12:09:37 -08:00
Bram Wasti	1047957831	[te][reapply] Add fast log approximation based on sleef (#49575 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575 This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25627157 fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9	2020-12-17 17:02:00 -08:00
Edward Yang	ea4ccc730e	Revert D25445815: [te] Add fast log approximation based on sleef Test Plan: revert-hammer Differential Revision: D25445815 (`1329066b69`) Original commit changeset: 20696eacd12a fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782	2020-12-17 15:03:17 -08:00
Bram Wasti	1329066b69	[te] Add fast log approximation based on sleef Summary: This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25445815 fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888	2020-12-17 14:28:34 -08:00
Bert Maher	464d23e6b4	[te][benchmark] Add more optimized versions of gemm (#48159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159 Test Plan: Imported from OSS Reviewed By: Chillee, ngimel Differential Revision: D25059742 Pulled By: bertmaher fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55	2020-11-18 12:21:08 -08:00
Bert Maher	b7261de0df	[pytorch][te] Add compilation time benchmark (#46124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124 We want to make sure we can actually fuse kernels within a fairly tight time budget. So here's a quick benchmark of codegen for a simple pointwise activation function (swish). I kept all the intermediate tensors separate to force TE to actually do inlining. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench ``` I've only run in debug mode so results aren't super meaningful, but even in that mode it's 18ms for compilation, 15 of which are in llvm. Update, opt build mode: ``` ---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- BM_CompileSwish 5123276 ns 5119846 ns 148 BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160 ``` Reviewed By: asuhan Differential Revision: D24232801 fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76	2020-10-09 23:11:37 -07:00
Bert Maher	f2e569461b	[te] Tiled (m=32 x n=32) gemm benchmark (#45905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142402 Pulled By: bertmaher fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f	2020-10-06 16:57:31 -07:00
Bert Maher	50f89578dd	[te] Add a benchmark harness (#45875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875 Adds a googlebenchmark harness for perf testing programs generated by tensorexpr, sans any pytorch wrappings (for python-level benchmarks of tensorexpr, see benchmarks/tensorexpr). Currently there's a harness for gemm that sets up the problem using torch (and also measures the perf of a torch::mm to give a baseline). Right now there's just an unoptimized implementation that is expected to be not very fast. More optimized versions are coming. Sample output from my dev box: ``` Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------- Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s ``` Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142403 Pulled By: bertmaher fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597	2020-10-06 16:57:27 -07:00

16 Commits