pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Bert Maher	36ae3feb22	[te] Benchmark comparing fused overhead to unfused (#50305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305 That's it ghstack-source-id: 119631533 Test Plan: ``` buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead ``` ``` Run on (24 X 2394.67 MHz CPU s) 2021-01-08 16:06:17 ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2157 ns 2157 ns 311314 UnfusedOverhead 2443 ns 2443 ns 311221 ``` Reviewed By: ZolotukhinM Differential Revision: D25856891 fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3	2021-01-13 12:09:37 -08:00
Bram Wasti	1047957831	[te][reapply] Add fast log approximation based on sleef (#49575 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575 This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25627157 fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9	2020-12-17 17:02:00 -08:00
Edward Yang	ea4ccc730e	Revert D25445815: [te] Add fast log approximation based on sleef Test Plan: revert-hammer Differential Revision: D25445815 (`1329066b69`) Original commit changeset: 20696eacd12a fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782	2020-12-17 15:03:17 -08:00
Bram Wasti	1329066b69	[te] Add fast log approximation based on sleef Summary: This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25445815 fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888	2020-12-17 14:28:34 -08:00
Bert Maher	464d23e6b4	[te][benchmark] Add more optimized versions of gemm (#48159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159 Test Plan: Imported from OSS Reviewed By: Chillee, ngimel Differential Revision: D25059742 Pulled By: bertmaher fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55	2020-11-18 12:21:08 -08:00
Bert Maher	b7261de0df	[pytorch][te] Add compilation time benchmark (#46124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124 We want to make sure we can actually fuse kernels within a fairly tight time budget. So here's a quick benchmark of codegen for a simple pointwise activation function (swish). I kept all the intermediate tensors separate to force TE to actually do inlining. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench ``` I've only run in debug mode so results aren't super meaningful, but even in that mode it's 18ms for compilation, 15 of which are in llvm. Update, opt build mode: ``` ---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- BM_CompileSwish 5123276 ns 5119846 ns 148 BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160 ``` Reviewed By: asuhan Differential Revision: D24232801 fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76	2020-10-09 23:11:37 -07:00
Bert Maher	f2e569461b	[te] Tiled (m=32 x n=32) gemm benchmark (#45905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142402 Pulled By: bertmaher fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f	2020-10-06 16:57:31 -07:00
Bert Maher	50f89578dd	[te] Add a benchmark harness (#45875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875 Adds a googlebenchmark harness for perf testing programs generated by tensorexpr, sans any pytorch wrappings (for python-level benchmarks of tensorexpr, see benchmarks/tensorexpr). Currently there's a harness for gemm that sets up the problem using torch (and also measures the perf of a torch::mm to give a baseline). Right now there's just an unoptimized implementation that is expected to be not very fast. More optimized versions are coming. Sample output from my dev box: ``` Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------- Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s ``` Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142403 Pulled By: bertmaher fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597	2020-10-06 16:57:27 -07:00

8 Commits