Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193
* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.
Test plans:
* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D
Measurements:
On a Broadwell E5-2686 CPU,
Reduce1D/Torch/16777216 5638547 ns 5638444 ns 119 BYTES=11.902G/s
Reduce1D/Naive/16777216 19308235 ns 19308184 ns 36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216 8433348 ns 8433038 ns 85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216 5608836 ns 5608727 ns 124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216 5550233 ns 5550221 ns 126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216 21451047 ns 21450752 ns 33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216 23701732 ns 23701229 ns 30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216 23683589 ns 23682978 ns 30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216 5378019 ns 5377909 ns 131 BYTES=12.4786G/s
Result summary:
* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.
Follow-up items:
* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
* Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25821880
Pulled By: zheng-xq
fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543
Original commit changeset: 2d2f07f79986
Was part of a stack that got reverted. This is just a benchmark.
ghstack-source-id: 119825594
Test Plan: CI
Reviewed By: navahgar
Differential Revision: D25912439
fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676
Summary:
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25445815
fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124
We want to make sure we can actually fuse kernels within a fairly
tight time budget. So here's a quick benchmark of codegen for a simple
pointwise activation function (swish). I kept all the intermediate tensors
separate to force TE to actually do inlining.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```
I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.
Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish 5123276 ns 5119846 ns 148
BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160
```
Reviewed By: asuhan
Differential Revision: D24232801
fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875
Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).
Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).
Right now there's just an unoptimized implementation that is expected to be not
very fast. More optimized versions are coming.
Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 256K (x24)
L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s
```
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D24142403
Pulled By: bertmaher
fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597