Commit Graph

20 Commits

Author SHA1 Message Date
Raghavan Raman
d8b53598e9 [nnc] Add a API to unroll loops by a given factor (#72071)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72071

Reviewed By: ngimel

Differential Revision: D33946250

Pulled By: navahgar

fbshipit-source-id: 3f3f92054174620025a9d71154d006f1738953e2
2022-02-03 10:38:07 -08:00
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin
dd96c26066 [TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778

This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30487425

Pulled By: ZolotukhinM

fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
Raghavan Raman
e2467cc43e [NNC] Make splitWithTail transform in-place (#58268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427228

Pulled By: navahgar

fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f
2021-05-25 11:31:14 -07:00
Bert Maher
c42dd8b257 Revert "Use at::cpu in bench_approx (#56563)" (#56816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56816

This doesn't actually work.  For some reason the linker can't find
at::cpu::logit_out, and it's not worth digging into why not.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27977406

Pulled By: bertmaher

fbshipit-source-id: d0235a393f25243e2c8a011e9baf267daf483ae4
2021-04-26 23:51:49 -07:00
Bert Maher
57cba8e601 Use at::cpu in bench_approx (#56563)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56563

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27902737

Pulled By: bertmaher

fbshipit-source-id: 66962671afbb093d5ae0b9308a401536c06ce8f5
2021-04-21 22:56:07 -07:00
Raghavan Raman
164de39a11 Fix build failure due to namespace change for log_out and tanh_out (#56278)
Summary:
There is a build failure in `bench_approx.cpp` due to namespace change for log_out and tanh_out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56278

Reviewed By: bertmaher, nikithamalgifb

Differential Revision: D27825621

Pulled By: navahgar

fbshipit-source-id: 0bccd324af92a3460610bf475514449f0223de2b
2021-04-16 13:34:32 -07:00
Wenlei Xie
53596cdb73 Remove hacky wrapper for about 100 kernels (#54367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54367

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 124804544

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27210057

fbshipit-source-id: 368dc77843468cfc44535488a040dbc2cb67208d
2021-03-25 10:00:16 -07:00
Zirui Tao
2b202667c1 [1/N] CPU pointwise optimization: Add a benchmark for Relu
Summary: As title

Test Plan:
Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated
  Total time: 02:32.3 min
Run on (24 X 2394.45 MHz CPU s)
2021-02-16 21:29:30
----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
relu_nnc/64                                        1738 ns       1738 ns     410535 log/s=36.8257M/s
relu_nnc/512                                       1708 ns       1708 ns     408678 log/s=299.711M/s
relu_nnc/8192                                      3297 ns       3297 ns     214362 log/s=2.48499G/s
relu_nnc/32768                                    10725 ns      10722 ns      61032 log/s=3.05603G/s
log_nnc_sleef/64                                   2076 ns       2075 ns     326248 log/s=30.8436M/s
log_nnc_sleef/512                                  3070 ns       3069 ns     230616 log/s=166.81M/s
log_nnc_sleef/8192                                22214 ns      22210 ns      31251 log/s=368.849M/s
log_nnc_sleef/32768                               85835 ns      85824 ns       8366 log/s=381.804M/s
log_nnc_fast/64                                    1852 ns       1852 ns     379123 log/s=34.5532M/s
log_nnc_fast/512                                   2456 ns       2456 ns     299463 log/s=208.503M/s
log_nnc_fast/8192                                 10953 ns      10952 ns      69894 log/s=747.957M/s
log_nnc_fast/32768                                35424 ns      35422 ns      19986 log/s=925.08M/s
log_nnc_vml/64                                     2361 ns       2361 ns     356220 log/s=27.1063M/s
log_nnc_vml/512                                    2218 ns       2218 ns     313444 log/s=230.857M/s
log_nnc_vml/8192                                   8420 ns       8420 ns      81594 log/s=972.912M/s
log_nnc_vml/32768                                 29484 ns      29484 ns      21701 log/s=1.1114G/s
log_aten/64                                       15970 ns      15970 ns      44401 log/s=4.00742M/s
log_aten/512                                      18344 ns      18344 ns      41056 log/s=27.9114M/s
log_aten/8192                                     24894 ns      24893 ns      27414 log/s=329.084M/s
log_aten/32768                                    29129 ns      29125 ns      22477 log/s=1.12508G/s
logit_nnc_sleef/64                                 2379 ns       2379 ns     261168 logit/s=26.8981M/s
logit_nnc_sleef/512                                5778 ns       5774 ns     114009 logit/s=88.6757M/s
logit_nnc_sleef/8192                              57268 ns      57236 ns      12429 logit/s=143.127M/s
logit_nnc_sleef/32768                            216356 ns     216344 ns       3026 logit/s=151.462M/s
logit_nnc_fast/64                                  2178 ns       2173 ns     282306 logit/s=29.4565M/s
logit_nnc_fast/512                                 2955 ns       2943 ns     202527 logit/s=173.95M/s
logit_nnc_fast/8192                               14836 ns      14835 ns      46794 logit/s=552.192M/s
logit_nnc_fast/32768                              53999 ns      53997 ns      12842 logit/s=606.846M/s
logit_nnc_vml/64                                   2132 ns       2132 ns     335874 logit/s=30.018M/s
logit_nnc_vml/512                                  3029 ns       3029 ns     250988 logit/s=169.058M/s
logit_nnc_vml/8192                                13264 ns      13263 ns      53504 logit/s=617.655M/s
logit_nnc_vml/32768                               49395 ns      48284 ns      14526 logit/s=678.654M/s
logit_aten/64                                     88180 ns      86690 ns       9270 logit/s=738.261k/s
logit_aten/512                                    54682 ns      54489 ns      10000 logit/s=9.3964M/s
logit_aten/8192                                  170878 ns     164357 ns       6965 logit/s=49.8427M/s
logit_aten/32768                                 452291 ns     434638 ns       3967 logit/s=75.3915M/s
logit_caffe2/64                                   30170 ns      29902 ns      24686 logit/s=2.14029M/s
logit_caffe2/512                                 203517 ns     201201 ns       3570 logit/s=2.54472M/s
logit_caffe2/8192                               3199528 ns    3157098 ns        220 logit/s=2.59479M/s
logit_caffe2/32768                             12520838 ns   12504846 ns         56 logit/s=2.62042M/s
tanh_nnc_fast/64                                   1979 ns       1977 ns     309745 tanh/s=32.3752M/s
tanh_nnc_fast/512                                  2331 ns       2331 ns     300937 tanh/s=219.636M/s
tanh_nnc_fast/8192                                 8323 ns       8323 ns      83601 tanh/s=984.26M/s
tanh_nnc_fast/32768                               30767 ns      30766 ns      23024 tanh/s=1065.06M/s
tanh_aten/64                                      17181 ns      17180 ns      36818 tanh/s=3.72522M/s
tanh_aten/512                                     19071 ns      19036 ns      37243 tanh/s=26.8968M/s
tanh_aten/8192                                    53542 ns      52006 ns      16268 tanh/s=157.521M/s
tanh_aten/32768                                  619869 ns     587600 ns       1000 tanh/s=55.7658M/s
tanh_caffe2/64                                     9668 ns       9654 ns      70926 tanh/s=6.62919M/s
tanh_caffe2/512                                   70409 ns      70409 ns       9881 tanh/s=7.27184M/s
tanh_caffe2/8192                                1179098 ns    1179011 ns        644 tanh/s=6.9482M/s
tanh_caffe2/32768                               4384300 ns    4382613 ns        156 tanh/s=7.47682M/s
BatchNorm/ATen/1/64/112/112                    23186429 ns   23183715 ns         27 GB/s=277.028M/s
BatchNorm/ATen/1/256/14/14                      1772907 ns    1770636 ns        394 GB/s=226.703M/s
BatchNorm/ATen/1/128/28/28                      3069417 ns    3069229 ns        232 GB/s=261.569M/s
BatchNorm/ATen/1/64/56/56                       6367276 ns    6367190 ns        111 GB/s=252.173M/s
BatchNorm/ATen/1/512/7/7                        1334734 ns    1334373 ns        516 GB/s=150.411M/s
BatchNorm/ATen/5/64/112/112                   131727903 ns  131721364 ns          7 GB/s=243.792M/s
BatchNorm/ATen/5/256/14/14                      7879002 ns    7874672 ns         85 GB/s=254.873M/s
BatchNorm/ATen/5/128/28/28                     15561373 ns   15269781 ns         42 GB/s=262.877M/s
BatchNorm/ATen/5/64/56/56                      29169722 ns   29107393 ns         24 GB/s=275.812M/s
BatchNorm/ATen/5/512/7/7                        5042006 ns    5028687 ns        100 GB/s=199.559M/s
BatchNorm/NNC/1/64/112/112                      3303598 ns    3271058 ns        188 GB/s=1.96344G/s
BatchNorm/NNC/1/256/14/14                        330641 ns     326644 ns       2033 GB/s=1.22889G/s
BatchNorm/NNC/1/128/28/28                        498706 ns     497894 ns       1131 GB/s=1.61242G/s
BatchNorm/NNC/1/64/56/56                        1116910 ns    1114768 ns        641 GB/s=1.44033G/s
BatchNorm/NNC/1/512/7/7                          163380 ns     163351 ns       3493 GB/s=1.22867G/s
BatchNorm/NNC/5/64/112/112                     16392078 ns   16386427 ns         41 GB/s=1.95971G/s
BatchNorm/NNC/5/256/14/14                       1133781 ns    1133369 ns        674 GB/s=1.77086G/s
BatchNorm/NNC/5/128/28/28                       2053208 ns    2053211 ns        276 GB/s=1.95503G/s
BatchNorm/NNC/5/64/56/56                        3874949 ns    3874734 ns        165 GB/s=2.07193G/s
BatchNorm/NNC/5/512/7/7                          653665 ns     651498 ns       1236 GB/s=1.54033G/s
BatchNorm/ATenRelu/1/64/112/112                36878892 ns   36100523 ns         22 GB/s=177.907M/s
BatchNorm/ATenRelu/1/256/14/14                  6404318 ns    5544976 ns        100 GB/s=72.3913M/s
BatchNorm/ATenRelu/1/128/28/28                  5897059 ns    5735509 ns        106 GB/s=139.973M/s
BatchNorm/ATenRelu/1/64/56/56                  10075458 ns    9965146 ns         62 GB/s=161.125M/s
BatchNorm/ATenRelu/1/512/7/7                    2680507 ns    2662541 ns        254 GB/s=75.3806M/s
BatchNorm/ATenRelu/5/64/112/112               145738113 ns  144253693 ns          5 GB/s=222.612M/s
BatchNorm/ATenRelu/5/256/14/14                 13582519 ns   13427209 ns         65 GB/s=149.476M/s
BatchNorm/ATenRelu/5/128/28/28                 22747138 ns   22627185 ns         31 GB/s=177.401M/s
BatchNorm/ATenRelu/5/64/56/56                  53609692 ns   52936728 ns         15 GB/s=151.656M/s
BatchNorm/ATenRelu/5/512/7/7                   11378314 ns   11083777 ns         65 GB/s=90.5395M/s
BatchNorm/NNCRelu/1/64/112/112                  3154436 ns    3148939 ns        193 GB/s=2.03958G/s
BatchNorm/NNCRelu/1/256/14/14                    337341 ns     337163 ns       1926 GB/s=1.19055G/s
BatchNorm/NNCRelu/1/128/28/28                    505570 ns     505569 ns       1231 GB/s=1.58794G/s
BatchNorm/NNCRelu/1/64/56/56                     903452 ns     903421 ns        659 GB/s=1.77728G/s
BatchNorm/NNCRelu/1/512/7/7                      158521 ns     158321 ns       3781 GB/s=1.2677G/s
BatchNorm/NNCRelu/5/64/112/112                 15488210 ns   15480019 ns         41 GB/s=2.07446G/s
BatchNorm/NNCRelu/5/256/14/14                   1149186 ns    1148963 ns        649 GB/s=1.74683G/s
BatchNorm/NNCRelu/5/128/28/28                   2011589 ns    2011424 ns        320 GB/s=1.99564G/s
BatchNorm/NNCRelu/5/64/56/56                    3776274 ns    3776060 ns        161 GB/s=2.12607G/s
BatchNorm/NNCRelu/5/512/7/7                      699762 ns     699582 ns        975 GB/s=1.43446G/s
BM_CompileSwish                                30471825 ns   30470017 ns         24
BM_CompileSwishLLVMOnly                        27479624 ns   27473475 ns         25
FusedOverhead                                    196219 ns     196195 ns       3342
UnfusedOverhead                                  220210 ns     220119 ns       3302
Gemm/Torch/128/128/128                           115526 ns     115343 ns       7414 GFLOPS=36.3637G/s
Gemm/TensorExprNoopt/128/128/128                3155851 ns    3155706 ns        210 GFLOPS=1.32912G/s
Gemm/TensorExprTile32x32/128/128/128             124454 ns     124452 ns       5774 GFLOPS=33.7021G/s
Gemm/TensorExprTile4x16/128/128/128              174408 ns     174366 ns       3987 GFLOPS=24.0546G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128      72949 ns      72948 ns       9028 GFLOPS=57.4974G/s
Gemm/TensorExprTile4x16Cache/128/128/128          73237 ns      73234 ns       9501 GFLOPS=57.2726G/s
Reduce1D/Torch/16777216                       426865265 ns  426853756 ns          2 BYTES=157.217M/s
Reduce1D/Naive/16777216                       132347709 ns  132343710 ns          5 BYTES=507.08M/s
Reduce1D/NativeRfactor/16777216               234668375 ns  234664682 ns          3 BYTES=285.978M/s
Reduce1D/TeNaive/16777216                      20468304 ns   20467906 ns         34 BYTES=3.27874G/s
Reduce1D/TeSplitTail/16777216                  20378995 ns   20378678 ns         34 BYTES=3.29309G/s
Reduce1D/TeSplitMask/16777216                  20371783 ns   20371260 ns         36 BYTES=3.29429G/s
Reduce1D/TeRfactorV2/16777216                   8235908 ns    8235723 ns         84 BYTES=8.14851G/s

CPU info:

Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows:

  *-cpu:0
       description: CPU
       product: Intel Core Processor (Broadwell)
       vendor: Intel Corp.
       physical id: 400
       bus info: cpu@0
       version: 6.61.2
       slot: CPU 0
       size: 2GHz
       capacity: 2GHz
       width: 64 bits
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
       configuration: cores=1 enabledcores=1 microcode=1 threads=1

Reviewed By: bwasti

Differential Revision: D26275048

fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5
2021-02-17 17:18:28 -08:00
Bert Maher
602434bcbe [te] Benchmark vml-based logit (#51771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771

This benchmarks an NNC implementation of logit based on VML's log
implementation.

It's a modest improvement over the sleef algorithm, but seems to be a bit
slower than aten (at larger sizes), and I'm not totally sure why, since you'd
think a fused logit kernel would be better than doing clamp/sub/div, followed
by log.  And yet...

Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2
machine; I suspect that it's needed to give the scheduler enough freedom to
fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory.
ghstack-source-id: 121392349

Test Plan:
```
-----------------------------------------------------------------------------
Benchmark                      Time           CPU Iterations UserCounters...
-----------------------------------------------------------------------------
logit_nnc_sleef/64           483 ns        483 ns    1452336 logit/s=132.469M/s
logit_nnc_sleef/512         3019 ns       3019 ns     228059 logit/s=169.577M/s
logit_nnc_sleef/8192       71427 ns      71424 ns       9662 logit/s=114.695M/s
logit_nnc_sleef/32768     307062 ns     306722 ns       2406 logit/s=106.833M/s

logit_nnc_fast/64            147 ns        147 ns    4408910 logit/s=434.908M/s
logit_nnc_fast/512           781 ns        781 ns     881230 logit/s=655.53M/s
logit_nnc_fast/8192        12519 ns      12518 ns      55626 logit/s=654.421M/s
logit_nnc_fast/32768       50530 ns      50526 ns      10000 logit/s=648.536M/s

logit_nnc_vml/64             125 ns        125 ns    5551460 logit/s=511.603M/s
logit_nnc_vml/512            733 ns        733 ns     938444 logit/s=698.955M/s
logit_nnc_vml/8192         11282 ns      11280 ns      61610 logit/s=726.23M/s
logit_nnc_vml/32768        45051 ns      44991 ns      15473 logit/s=728.325M/s

logit_aten/64                450 ns        449 ns    1599269 logit/s=142.429M/s
logit_aten/512              1055 ns       1054 ns     665538 logit/s=485.595M/s
logit_aten/8192            10865 ns      10864 ns      64152 logit/s=754.032M/s
logit_aten/32768           42106 ns      42103 ns      16477 logit/s=778.287M/s

logit_caffe2/64              233 ns        233 ns    2952127 logit/s=274.761M/s
logit_caffe2/512            1795 ns       1795 ns     393354 logit/s=285.177M/s
logit_caffe2/8192          29924 ns      29923 ns      23225 logit/s=273.77M/s
logit_caffe2/32768        123899 ns     123893 ns       5642 logit/s=264.487M/s
```

Reviewed By: bwasti

Differential Revision: D26272325

fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688
2021-02-10 02:09:14 -08:00
Bert Maher
2e35fe9535 [te] Implement log approximation using the VML approach (#51752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752

Using a straight power series approximation with enough terms gives
precision down to the denormal range, and avoids the fp division used in the
sleef approach.  This is nice because recent CPUs have dual pipelined fma units,
so we can compute 16 logarithms in parallel; whereas there's usually only one
FP divider and it has a fairly high latency/low throughput.
ghstack-source-id: 121392347

Test Plan:
On my avx2+fma broadwell:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           178 ns        178 ns    3933565 log/s=358.993M/s
log_nnc_sleef/512         1286 ns       1285 ns     559459 log/s=398.354M/s
log_nnc_sleef/8192       19366 ns      19364 ns      36619 log/s=423.053M/s
log_nnc_sleef/32768      79288 ns      79286 ns       8718 log/s=413.287M/s

log_nnc_fast/64             92 ns         92 ns    7644990 log/s=696.939M/s
log_nnc_fast/512           483 ns        483 ns    1426802 log/s=1059.49M/s
log_nnc_fast/8192         7519 ns       7514 ns      95319 log/s=1090.23M/s
log_nnc_fast/32768       31344 ns      31338 ns      22397 log/s=1045.62M/s

log_nnc_vml/64              88 ns         88 ns    7923812 log/s=728.469M/s
log_nnc_vml/512            454 ns        454 ns    1521437 log/s=1.12739G/s
log_nnc_vml/8192          6763 ns       6763 ns     103264 log/s=1.21136G/s
log_nnc_vml/32768        26565 ns      26564 ns      23609 log/s=1.23354G/s

log_aten/64                418 ns        418 ns    1651401 log/s=153.117M/s
log_aten/512               801 ns        801 ns     875857 log/s=638.923M/s
log_aten/8192             6877 ns       6872 ns     100840 log/s=1.19208G/s
log_aten/32768           26989 ns      26988 ns      26268 log/s=1.21416G/s
```

Reviewed By: bwasti, zheng-xq

Differential Revision: D26246400

fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d
2021-02-10 02:09:10 -08:00
Bert Maher
a23e82df10 [nnc] Tweak log_nnc_sleef so vectorization kicks in (#51491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491

The vectorizer heuristic is pretty dumb and only kicks in if the
unroll factor is exactly 8 or 4.

It's still slower than direct implementation, which isn't surprising.
ghstack-source-id: 120783426

Test Plan:
`buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench`

Before:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           438 ns        438 ns    1795511 log/s=146.259M/s
log_nnc_sleef/512         3196 ns       3195 ns     210032 log/s=160.235M/s
log_nnc_sleef/8192       77467 ns      77466 ns       8859 log/s=105.749M/s
log_nnc_sleef/32768     310206 ns     310202 ns       2170 log/s=105.634M/s
log_nnc_fast/64            100 ns        100 ns    7281074 log/s=637.144M/s
log_nnc_fast/512           546 ns        546 ns    1335816 log/s=938.361M/s
log_nnc_fast/8192         7360 ns       7359 ns      91971 log/s=1.11316G/s
log_nnc_fast/32768       30793 ns      30792 ns      22633 log/s=1064.17M/s
log_aten/64           427 ns        427 ns    1634897 log/s=150.021M/s
log_aten/512          796 ns        796 ns     877318 log/s=643.566M/s
log_aten/8192        6690 ns       6690 ns     102649 log/s=1.22452G/s
log_aten/32768      25357 ns      25350 ns      27808 log/s=1.29263G/s
```

After:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           189 ns        188 ns    3872475 log/s=340.585M/s
log_nnc_sleef/512         1307 ns       1307 ns     557770 log/s=391.709M/s
log_nnc_sleef/8192       20259 ns      20257 ns      34240 log/s=404.404M/s
log_nnc_sleef/32768      81556 ns      81470 ns       8767 log/s=402.209M/s
log_nnc_fast/64            110 ns        110 ns    6564558 log/s=581.116M/s
log_nnc_fast/512           554 ns        554 ns    1279304 log/s=923.376M/s
log_nnc_fast/8192         7774 ns       7774 ns      91421 log/s=1053.75M/s
log_nnc_fast/32768       31008 ns      31006 ns      21279 log/s=1056.83M/s
```

Reviewed By: bwasti

Differential Revision: D26139067

fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2
2021-02-01 16:35:37 -08:00
Nikita Shulga
97ea95ddd7 Delete tabs from becnh_approx.cpp (#51157)
Summary:
Introduced by D25981260 (f08464f31d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157

Reviewed By: bwasti

Differential Revision: D26090008

Pulled By: malfet

fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
2021-01-26 15:53:47 -08:00
Bram Wasti
f08464f31d [nnc] Add benchmarks
Summary: Adding a set of benchmarks for key operators

Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench

Reviewed By: ZolotukhinM

Differential Revision: D25981260

fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
2021-01-26 13:51:33 -08:00
Bram Wasti
1047957831 [te][reapply] Add fast log approximation based on sleef (#49575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575

This is a fast log implementations

benchmark:

```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25627157

fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
Edward Yang
ea4ccc730e Revert D25445815: [te] Add fast log approximation based on sleef
Test Plan: revert-hammer

Differential Revision:
D25445815 (1329066b69)

Original commit changeset: 20696eacd12a

fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
Bram Wasti
1329066b69 [te] Add fast log approximation based on sleef
Summary:
This is a fast log implementations

benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25445815

fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00