Summary:
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25445815
fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124
We want to make sure we can actually fuse kernels within a fairly
tight time budget. So here's a quick benchmark of codegen for a simple
pointwise activation function (swish). I kept all the intermediate tensors
separate to force TE to actually do inlining.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```
I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.
Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish 5123276 ns 5119846 ns 148
BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160
```
Reviewed By: asuhan
Differential Revision: D24232801
fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875
Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).
Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).
Right now there's just an unoptimized implementation that is expected to be not
very fast. More optimized versions are coming.
Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 256K (x24)
L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s
```
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D24142403
Pulled By: bertmaher
fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597