pytorch

OSSForks/pytorch

Fork 0

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Commit Graph

Author	SHA1	Message	Date
Bert Maher	b7261de0df	[pytorch][te] Add compilation time benchmark (#46124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124 We want to make sure we can actually fuse kernels within a fairly tight time budget. So here's a quick benchmark of codegen for a simple pointwise activation function (swish). I kept all the intermediate tensors separate to force TE to actually do inlining. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench ``` I've only run in debug mode so results aren't super meaningful, but even in that mode it's 18ms for compilation, 15 of which are in llvm. Update, opt build mode: ``` ---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- BM_CompileSwish 5123276 ns 5119846 ns 148 BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160 ``` Reviewed By: asuhan Differential Revision: D24232801 fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76	2020-10-09 23:11:37 -07:00
Bert Maher	f2e569461b	[te] Tiled (m=32 x n=32) gemm benchmark (#45905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142402 Pulled By: bertmaher fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f	2020-10-06 16:57:31 -07:00
Bert Maher	50f89578dd	[te] Add a benchmark harness (#45875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875 Adds a googlebenchmark harness for perf testing programs generated by tensorexpr, sans any pytorch wrappings (for python-level benchmarks of tensorexpr, see benchmarks/tensorexpr). Currently there's a harness for gemm that sets up the problem using torch (and also measures the perf of a torch::mm to give a baseline). Right now there's just an unoptimized implementation that is expected to be not very fast. More optimized versions are coming. Sample output from my dev box: ``` Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------- Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s ``` Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142403 Pulled By: bertmaher fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597	2020-10-06 16:57:27 -07:00

Author

SHA1

Message

Date

Bert Maher

b7261de0df

[pytorch][te] Add compilation time benchmark (#46124 )

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124

We want to make sure we can actually fuse kernels within a fairly
tight time budget.  So here's a quick benchmark of codegen for a simple
pointwise activation function (swish).  I kept all the intermediate tensors
separate to force TE to actually do inlining.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.

Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish                         5123276 ns    5119846 ns        148
BM_CompileSwishLLVMOnly                 4754361 ns    4753701 ns        160
```

Reviewed By: asuhan

Differential Revision: D24232801

fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76

2020-10-09 23:11:37 -07:00

Bert Maher

f2e569461b

[te] Tiled (m=32 x n=32) gemm benchmark (#45905 )

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142402

Pulled By: bertmaher

fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f

2020-10-06 16:57:31 -07:00

Bert Maher

50f89578dd

[te] Add a benchmark harness (#45875 )

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875

Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).

Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).

Right now there's just an unoptimized implementation that is expected to be not
very fast.  More optimized versions are coming.

Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                    73405 ns      73403 ns       8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128        3073003 ns    3072808 ns        229 GFLOPS=1.36497G/s
```

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142403

Pulled By: bertmaher

fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597

2020-10-06 16:57:27 -07:00

3 Commits