Commit Graph

16 Commits

Author SHA1 Message Date
Bert Maher
a23e82df10 [nnc] Tweak log_nnc_sleef so vectorization kicks in (#51491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491

The vectorizer heuristic is pretty dumb and only kicks in if the
unroll factor is exactly 8 or 4.

It's still slower than direct implementation, which isn't surprising.
ghstack-source-id: 120783426

Test Plan:
`buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench`

Before:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           438 ns        438 ns    1795511 log/s=146.259M/s
log_nnc_sleef/512         3196 ns       3195 ns     210032 log/s=160.235M/s
log_nnc_sleef/8192       77467 ns      77466 ns       8859 log/s=105.749M/s
log_nnc_sleef/32768     310206 ns     310202 ns       2170 log/s=105.634M/s
log_nnc_fast/64            100 ns        100 ns    7281074 log/s=637.144M/s
log_nnc_fast/512           546 ns        546 ns    1335816 log/s=938.361M/s
log_nnc_fast/8192         7360 ns       7359 ns      91971 log/s=1.11316G/s
log_nnc_fast/32768       30793 ns      30792 ns      22633 log/s=1064.17M/s
log_aten/64           427 ns        427 ns    1634897 log/s=150.021M/s
log_aten/512          796 ns        796 ns     877318 log/s=643.566M/s
log_aten/8192        6690 ns       6690 ns     102649 log/s=1.22452G/s
log_aten/32768      25357 ns      25350 ns      27808 log/s=1.29263G/s
```

After:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           189 ns        188 ns    3872475 log/s=340.585M/s
log_nnc_sleef/512         1307 ns       1307 ns     557770 log/s=391.709M/s
log_nnc_sleef/8192       20259 ns      20257 ns      34240 log/s=404.404M/s
log_nnc_sleef/32768      81556 ns      81470 ns       8767 log/s=402.209M/s
log_nnc_fast/64            110 ns        110 ns    6564558 log/s=581.116M/s
log_nnc_fast/512           554 ns        554 ns    1279304 log/s=923.376M/s
log_nnc_fast/8192         7774 ns       7774 ns      91421 log/s=1053.75M/s
log_nnc_fast/32768       31008 ns      31006 ns      21279 log/s=1056.83M/s
```

Reviewed By: bwasti

Differential Revision: D26139067

fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2
2021-02-01 16:35:37 -08:00
Mikhail Zolotukhin
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
Nikita Shulga
97ea95ddd7 Delete tabs from becnh_approx.cpp (#51157)
Summary:
Introduced by D25981260 (f08464f31d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157

Reviewed By: bwasti

Differential Revision: D26090008

Pulled By: malfet

fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
2021-01-26 15:53:47 -08:00
Bert Maher
c4029444d1 [nnc] Per-operator benchmarks (#51093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51093

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.
ghstack-source-id: 120403675

Test Plan:
```
op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75
```

Reviewed By: zheng-xq

Differential Revision: D26069791

fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
2021-01-26 14:10:08 -08:00
Bram Wasti
f08464f31d [nnc] Add benchmarks
Summary: Adding a set of benchmarks for key operators

Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench

Reviewed By: ZolotukhinM

Differential Revision: D25981260

fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
2021-01-26 13:51:33 -08:00
Xiaoqiang Zheng
b96a6516a6 Add CPP Full Reduction Benchmarks. (#50193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193

* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.

Test plans:

* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D

Measurements:

On a Broadwell E5-2686 CPU,

Reduce1D/Torch/16777216            5638547 ns    5638444 ns        119 BYTES=11.902G/s
Reduce1D/Naive/16777216           19308235 ns   19308184 ns         36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216    8433348 ns    8433038 ns         85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216     5608836 ns    5608727 ns        124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216      5550233 ns    5550221 ns        126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216         21451047 ns   21450752 ns         33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216     23701732 ns   23701229 ns         30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216     23683589 ns   23682978 ns         30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216      5378019 ns    5377909 ns        131 BYTES=12.4786G/s

Result summary:

* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.

Follow-up items:

* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
  * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25821880

Pulled By: zheng-xq

fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
2021-01-21 10:00:50 -08:00
Bert Maher
468c99fba4 Reapply D25856891: [te] Benchmark comparing fused overhead to unfused (#50543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543

Original commit changeset: 2d2f07f79986

Was part of a stack that got reverted.  This is just a benchmark.
ghstack-source-id: 119825594

Test Plan: CI

Reviewed By: navahgar

Differential Revision: D25912439

fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676
2021-01-14 14:17:45 -08:00
Mike Ruberry
4ee631cdf0 Revert D25856891: [te] Benchmark comparing fused overhead to unfused
Test Plan: revert-hammer

Differential Revision:
D25856891 (36ae3feb22)

Original commit changeset: 0e99515ec2e7

fbshipit-source-id: 2d2f07f79986ca7815b9eae63e734db76bdfc0c8
2021-01-14 04:33:35 -08:00
Bert Maher
36ae3feb22 [te] Benchmark comparing fused overhead to unfused (#50305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305

That's it
ghstack-source-id: 119631533

Test Plan:
```
buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead
```
```
Run on (24 X 2394.67 MHz CPU s)
2021-01-08 16:06:17
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead         2157 ns       2157 ns     311314
UnfusedOverhead       2443 ns       2443 ns     311221
```

Reviewed By: ZolotukhinM

Differential Revision: D25856891

fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3
2021-01-13 12:09:37 -08:00
Bram Wasti
1047957831 [te][reapply] Add fast log approximation based on sleef (#49575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575

This is a fast log implementations

benchmark:

```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25627157

fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
Edward Yang
ea4ccc730e Revert D25445815: [te] Add fast log approximation based on sleef
Test Plan: revert-hammer

Differential Revision:
D25445815 (1329066b69)

Original commit changeset: 20696eacd12a

fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
Bram Wasti
1329066b69 [te] Add fast log approximation based on sleef
Summary:
This is a fast log implementations

benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25445815

fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00
Bert Maher
464d23e6b4 [te][benchmark] Add more optimized versions of gemm (#48159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159

Test Plan: Imported from OSS

Reviewed By: Chillee, ngimel

Differential Revision: D25059742

Pulled By: bertmaher

fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55
2020-11-18 12:21:08 -08:00
Bert Maher
b7261de0df [pytorch][te] Add compilation time benchmark (#46124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124

We want to make sure we can actually fuse kernels within a fairly
tight time budget.  So here's a quick benchmark of codegen for a simple
pointwise activation function (swish).  I kept all the intermediate tensors
separate to force TE to actually do inlining.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.

Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish                         5123276 ns    5119846 ns        148
BM_CompileSwishLLVMOnly                 4754361 ns    4753701 ns        160
```

Reviewed By: asuhan

Differential Revision: D24232801

fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
2020-10-09 23:11:37 -07:00
Bert Maher
f2e569461b [te] Tiled (m=32 x n=32) gemm benchmark (#45905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142402

Pulled By: bertmaher

fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f
2020-10-06 16:57:31 -07:00
Bert Maher
50f89578dd [te] Add a benchmark harness (#45875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875

Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).

Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).

Right now there's just an unoptimized implementation that is expected to be not
very fast.  More optimized versions are coming.

Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                    73405 ns      73403 ns       8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128        3073003 ns    3072808 ns        229 GFLOPS=1.36497G/s
```

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142403

Pulled By: bertmaher

fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
2020-10-06 16:57:27 -07:00