Mikhail Zolotukhin
dd96c26066
[TensorExpr] More NFC changes like Expr* -> ExprPtr. ( #63778 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778
This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30487425
Pulled By: ZolotukhinM
fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
Raghavan Raman
e2467cc43e
[NNC] Make splitWithTail transform in-place ( #58268 )
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D28427228
Pulled By: navahgar
fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f
2021-05-25 11:31:14 -07:00
Bert Maher
c42dd8b257
Revert "Use at::cpu in bench_approx ( #56563 )" ( #56816 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56816
This doesn't actually work. For some reason the linker can't find
at::cpu::logit_out, and it's not worth digging into why not.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D27977406
Pulled By: bertmaher
fbshipit-source-id: d0235a393f25243e2c8a011e9baf267daf483ae4
2021-04-26 23:51:49 -07:00
Bert Maher
57cba8e601
Use at::cpu in bench_approx ( #56563 )
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56563
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27902737
Pulled By: bertmaher
fbshipit-source-id: 66962671afbb093d5ae0b9308a401536c06ce8f5
2021-04-21 22:56:07 -07:00
Raghavan Raman
164de39a11
Fix build failure due to namespace change for log_out and tanh_out ( #56278 )
...
Summary:
There is a build failure in `bench_approx.cpp` due to namespace change for log_out and tanh_out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56278
Reviewed By: bertmaher, nikithamalgifb
Differential Revision: D27825621
Pulled By: navahgar
fbshipit-source-id: 0bccd324af92a3460610bf475514449f0223de2b
2021-04-16 13:34:32 -07:00
Wenlei Xie
53596cdb73
Remove hacky wrapper for about 100 kernels ( #54367 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54367
Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 124804544
Test Plan: buck build //caffe2/aten/...
Reviewed By: smessmer
Differential Revision: D27210057
fbshipit-source-id: 368dc77843468cfc44535488a040dbc2cb67208d
2021-03-25 10:00:16 -07:00
Zirui Tao
2b202667c1
[1/N] CPU pointwise optimization: Add a benchmark for Relu
...
Summary: As title
Test Plan:
Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated
Total time: 02:32.3 min
Run on (24 X 2394.45 MHz CPU s)
2021-02-16 21:29:30
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
relu_nnc/64 1738 ns 1738 ns 410535 log/s=36.8257M/s
relu_nnc/512 1708 ns 1708 ns 408678 log/s=299.711M/s
relu_nnc/8192 3297 ns 3297 ns 214362 log/s=2.48499G/s
relu_nnc/32768 10725 ns 10722 ns 61032 log/s=3.05603G/s
log_nnc_sleef/64 2076 ns 2075 ns 326248 log/s=30.8436M/s
log_nnc_sleef/512 3070 ns 3069 ns 230616 log/s=166.81M/s
log_nnc_sleef/8192 22214 ns 22210 ns 31251 log/s=368.849M/s
log_nnc_sleef/32768 85835 ns 85824 ns 8366 log/s=381.804M/s
log_nnc_fast/64 1852 ns 1852 ns 379123 log/s=34.5532M/s
log_nnc_fast/512 2456 ns 2456 ns 299463 log/s=208.503M/s
log_nnc_fast/8192 10953 ns 10952 ns 69894 log/s=747.957M/s
log_nnc_fast/32768 35424 ns 35422 ns 19986 log/s=925.08M/s
log_nnc_vml/64 2361 ns 2361 ns 356220 log/s=27.1063M/s
log_nnc_vml/512 2218 ns 2218 ns 313444 log/s=230.857M/s
log_nnc_vml/8192 8420 ns 8420 ns 81594 log/s=972.912M/s
log_nnc_vml/32768 29484 ns 29484 ns 21701 log/s=1.1114G/s
log_aten/64 15970 ns 15970 ns 44401 log/s=4.00742M/s
log_aten/512 18344 ns 18344 ns 41056 log/s=27.9114M/s
log_aten/8192 24894 ns 24893 ns 27414 log/s=329.084M/s
log_aten/32768 29129 ns 29125 ns 22477 log/s=1.12508G/s
logit_nnc_sleef/64 2379 ns 2379 ns 261168 logit/s=26.8981M/s
logit_nnc_sleef/512 5778 ns 5774 ns 114009 logit/s=88.6757M/s
logit_nnc_sleef/8192 57268 ns 57236 ns 12429 logit/s=143.127M/s
logit_nnc_sleef/32768 216356 ns 216344 ns 3026 logit/s=151.462M/s
logit_nnc_fast/64 2178 ns 2173 ns 282306 logit/s=29.4565M/s
logit_nnc_fast/512 2955 ns 2943 ns 202527 logit/s=173.95M/s
logit_nnc_fast/8192 14836 ns 14835 ns 46794 logit/s=552.192M/s
logit_nnc_fast/32768 53999 ns 53997 ns 12842 logit/s=606.846M/s
logit_nnc_vml/64 2132 ns 2132 ns 335874 logit/s=30.018M/s
logit_nnc_vml/512 3029 ns 3029 ns 250988 logit/s=169.058M/s
logit_nnc_vml/8192 13264 ns 13263 ns 53504 logit/s=617.655M/s
logit_nnc_vml/32768 49395 ns 48284 ns 14526 logit/s=678.654M/s
logit_aten/64 88180 ns 86690 ns 9270 logit/s=738.261k/s
logit_aten/512 54682 ns 54489 ns 10000 logit/s=9.3964M/s
logit_aten/8192 170878 ns 164357 ns 6965 logit/s=49.8427M/s
logit_aten/32768 452291 ns 434638 ns 3967 logit/s=75.3915M/s
logit_caffe2/64 30170 ns 29902 ns 24686 logit/s=2.14029M/s
logit_caffe2/512 203517 ns 201201 ns 3570 logit/s=2.54472M/s
logit_caffe2/8192 3199528 ns 3157098 ns 220 logit/s=2.59479M/s
logit_caffe2/32768 12520838 ns 12504846 ns 56 logit/s=2.62042M/s
tanh_nnc_fast/64 1979 ns 1977 ns 309745 tanh/s=32.3752M/s
tanh_nnc_fast/512 2331 ns 2331 ns 300937 tanh/s=219.636M/s
tanh_nnc_fast/8192 8323 ns 8323 ns 83601 tanh/s=984.26M/s
tanh_nnc_fast/32768 30767 ns 30766 ns 23024 tanh/s=1065.06M/s
tanh_aten/64 17181 ns 17180 ns 36818 tanh/s=3.72522M/s
tanh_aten/512 19071 ns 19036 ns 37243 tanh/s=26.8968M/s
tanh_aten/8192 53542 ns 52006 ns 16268 tanh/s=157.521M/s
tanh_aten/32768 619869 ns 587600 ns 1000 tanh/s=55.7658M/s
tanh_caffe2/64 9668 ns 9654 ns 70926 tanh/s=6.62919M/s
tanh_caffe2/512 70409 ns 70409 ns 9881 tanh/s=7.27184M/s
tanh_caffe2/8192 1179098 ns 1179011 ns 644 tanh/s=6.9482M/s
tanh_caffe2/32768 4384300 ns 4382613 ns 156 tanh/s=7.47682M/s
BatchNorm/ATen/1/64/112/112 23186429 ns 23183715 ns 27 GB/s=277.028M/s
BatchNorm/ATen/1/256/14/14 1772907 ns 1770636 ns 394 GB/s=226.703M/s
BatchNorm/ATen/1/128/28/28 3069417 ns 3069229 ns 232 GB/s=261.569M/s
BatchNorm/ATen/1/64/56/56 6367276 ns 6367190 ns 111 GB/s=252.173M/s
BatchNorm/ATen/1/512/7/7 1334734 ns 1334373 ns 516 GB/s=150.411M/s
BatchNorm/ATen/5/64/112/112 131727903 ns 131721364 ns 7 GB/s=243.792M/s
BatchNorm/ATen/5/256/14/14 7879002 ns 7874672 ns 85 GB/s=254.873M/s
BatchNorm/ATen/5/128/28/28 15561373 ns 15269781 ns 42 GB/s=262.877M/s
BatchNorm/ATen/5/64/56/56 29169722 ns 29107393 ns 24 GB/s=275.812M/s
BatchNorm/ATen/5/512/7/7 5042006 ns 5028687 ns 100 GB/s=199.559M/s
BatchNorm/NNC/1/64/112/112 3303598 ns 3271058 ns 188 GB/s=1.96344G/s
BatchNorm/NNC/1/256/14/14 330641 ns 326644 ns 2033 GB/s=1.22889G/s
BatchNorm/NNC/1/128/28/28 498706 ns 497894 ns 1131 GB/s=1.61242G/s
BatchNorm/NNC/1/64/56/56 1116910 ns 1114768 ns 641 GB/s=1.44033G/s
BatchNorm/NNC/1/512/7/7 163380 ns 163351 ns 3493 GB/s=1.22867G/s
BatchNorm/NNC/5/64/112/112 16392078 ns 16386427 ns 41 GB/s=1.95971G/s
BatchNorm/NNC/5/256/14/14 1133781 ns 1133369 ns 674 GB/s=1.77086G/s
BatchNorm/NNC/5/128/28/28 2053208 ns 2053211 ns 276 GB/s=1.95503G/s
BatchNorm/NNC/5/64/56/56 3874949 ns 3874734 ns 165 GB/s=2.07193G/s
BatchNorm/NNC/5/512/7/7 653665 ns 651498 ns 1236 GB/s=1.54033G/s
BatchNorm/ATenRelu/1/64/112/112 36878892 ns 36100523 ns 22 GB/s=177.907M/s
BatchNorm/ATenRelu/1/256/14/14 6404318 ns 5544976 ns 100 GB/s=72.3913M/s
BatchNorm/ATenRelu/1/128/28/28 5897059 ns 5735509 ns 106 GB/s=139.973M/s
BatchNorm/ATenRelu/1/64/56/56 10075458 ns 9965146 ns 62 GB/s=161.125M/s
BatchNorm/ATenRelu/1/512/7/7 2680507 ns 2662541 ns 254 GB/s=75.3806M/s
BatchNorm/ATenRelu/5/64/112/112 145738113 ns 144253693 ns 5 GB/s=222.612M/s
BatchNorm/ATenRelu/5/256/14/14 13582519 ns 13427209 ns 65 GB/s=149.476M/s
BatchNorm/ATenRelu/5/128/28/28 22747138 ns 22627185 ns 31 GB/s=177.401M/s
BatchNorm/ATenRelu/5/64/56/56 53609692 ns 52936728 ns 15 GB/s=151.656M/s
BatchNorm/ATenRelu/5/512/7/7 11378314 ns 11083777 ns 65 GB/s=90.5395M/s
BatchNorm/NNCRelu/1/64/112/112 3154436 ns 3148939 ns 193 GB/s=2.03958G/s
BatchNorm/NNCRelu/1/256/14/14 337341 ns 337163 ns 1926 GB/s=1.19055G/s
BatchNorm/NNCRelu/1/128/28/28 505570 ns 505569 ns 1231 GB/s=1.58794G/s
BatchNorm/NNCRelu/1/64/56/56 903452 ns 903421 ns 659 GB/s=1.77728G/s
BatchNorm/NNCRelu/1/512/7/7 158521 ns 158321 ns 3781 GB/s=1.2677G/s
BatchNorm/NNCRelu/5/64/112/112 15488210 ns 15480019 ns 41 GB/s=2.07446G/s
BatchNorm/NNCRelu/5/256/14/14 1149186 ns 1148963 ns 649 GB/s=1.74683G/s
BatchNorm/NNCRelu/5/128/28/28 2011589 ns 2011424 ns 320 GB/s=1.99564G/s
BatchNorm/NNCRelu/5/64/56/56 3776274 ns 3776060 ns 161 GB/s=2.12607G/s
BatchNorm/NNCRelu/5/512/7/7 699762 ns 699582 ns 975 GB/s=1.43446G/s
BM_CompileSwish 30471825 ns 30470017 ns 24
BM_CompileSwishLLVMOnly 27479624 ns 27473475 ns 25
FusedOverhead 196219 ns 196195 ns 3342
UnfusedOverhead 220210 ns 220119 ns 3302
Gemm/Torch/128/128/128 115526 ns 115343 ns 7414 GFLOPS=36.3637G/s
Gemm/TensorExprNoopt/128/128/128 3155851 ns 3155706 ns 210 GFLOPS=1.32912G/s
Gemm/TensorExprTile32x32/128/128/128 124454 ns 124452 ns 5774 GFLOPS=33.7021G/s
Gemm/TensorExprTile4x16/128/128/128 174408 ns 174366 ns 3987 GFLOPS=24.0546G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128 72949 ns 72948 ns 9028 GFLOPS=57.4974G/s
Gemm/TensorExprTile4x16Cache/128/128/128 73237 ns 73234 ns 9501 GFLOPS=57.2726G/s
Reduce1D/Torch/16777216 426865265 ns 426853756 ns 2 BYTES=157.217M/s
Reduce1D/Naive/16777216 132347709 ns 132343710 ns 5 BYTES=507.08M/s
Reduce1D/NativeRfactor/16777216 234668375 ns 234664682 ns 3 BYTES=285.978M/s
Reduce1D/TeNaive/16777216 20468304 ns 20467906 ns 34 BYTES=3.27874G/s
Reduce1D/TeSplitTail/16777216 20378995 ns 20378678 ns 34 BYTES=3.29309G/s
Reduce1D/TeSplitMask/16777216 20371783 ns 20371260 ns 36 BYTES=3.29429G/s
Reduce1D/TeRfactorV2/16777216 8235908 ns 8235723 ns 84 BYTES=8.14851G/s
CPU info:
Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows:
*-cpu:0
description: CPU
product: Intel Core Processor (Broadwell)
vendor: Intel Corp.
physical id: 400
bus info: cpu@0
version: 6.61.2
slot: CPU 0
size: 2GHz
capacity: 2GHz
width: 64 bits
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
configuration: cores=1 enabledcores=1 microcode=1 threads=1
Reviewed By: bwasti
Differential Revision: D26275048
fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5
2021-02-17 17:18:28 -08:00
Bert Maher
602434bcbe
[te] Benchmark vml-based logit ( #51771 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771
This benchmarks an NNC implementation of logit based on VML's log
implementation.
It's a modest improvement over the sleef algorithm, but seems to be a bit
slower than aten (at larger sizes), and I'm not totally sure why, since you'd
think a fused logit kernel would be better than doing clamp/sub/div, followed
by log. And yet...
Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2
machine; I suspect that it's needed to give the scheduler enough freedom to
fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory.
ghstack-source-id: 121392349
Test Plan:
```
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
logit_nnc_sleef/64 483 ns 483 ns 1452336 logit/s=132.469M/s
logit_nnc_sleef/512 3019 ns 3019 ns 228059 logit/s=169.577M/s
logit_nnc_sleef/8192 71427 ns 71424 ns 9662 logit/s=114.695M/s
logit_nnc_sleef/32768 307062 ns 306722 ns 2406 logit/s=106.833M/s
logit_nnc_fast/64 147 ns 147 ns 4408910 logit/s=434.908M/s
logit_nnc_fast/512 781 ns 781 ns 881230 logit/s=655.53M/s
logit_nnc_fast/8192 12519 ns 12518 ns 55626 logit/s=654.421M/s
logit_nnc_fast/32768 50530 ns 50526 ns 10000 logit/s=648.536M/s
logit_nnc_vml/64 125 ns 125 ns 5551460 logit/s=511.603M/s
logit_nnc_vml/512 733 ns 733 ns 938444 logit/s=698.955M/s
logit_nnc_vml/8192 11282 ns 11280 ns 61610 logit/s=726.23M/s
logit_nnc_vml/32768 45051 ns 44991 ns 15473 logit/s=728.325M/s
logit_aten/64 450 ns 449 ns 1599269 logit/s=142.429M/s
logit_aten/512 1055 ns 1054 ns 665538 logit/s=485.595M/s
logit_aten/8192 10865 ns 10864 ns 64152 logit/s=754.032M/s
logit_aten/32768 42106 ns 42103 ns 16477 logit/s=778.287M/s
logit_caffe2/64 233 ns 233 ns 2952127 logit/s=274.761M/s
logit_caffe2/512 1795 ns 1795 ns 393354 logit/s=285.177M/s
logit_caffe2/8192 29924 ns 29923 ns 23225 logit/s=273.77M/s
logit_caffe2/32768 123899 ns 123893 ns 5642 logit/s=264.487M/s
```
Reviewed By: bwasti
Differential Revision: D26272325
fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688
2021-02-10 02:09:14 -08:00
Bert Maher
2e35fe9535
[te] Implement log approximation using the VML approach ( #51752 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752
Using a straight power series approximation with enough terms gives
precision down to the denormal range, and avoids the fp division used in the
sleef approach. This is nice because recent CPUs have dual pipelined fma units,
so we can compute 16 logarithms in parallel; whereas there's usually only one
FP divider and it has a fairly high latency/low throughput.
ghstack-source-id: 121392347
Test Plan:
On my avx2+fma broadwell:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64 178 ns 178 ns 3933565 log/s=358.993M/s
log_nnc_sleef/512 1286 ns 1285 ns 559459 log/s=398.354M/s
log_nnc_sleef/8192 19366 ns 19364 ns 36619 log/s=423.053M/s
log_nnc_sleef/32768 79288 ns 79286 ns 8718 log/s=413.287M/s
log_nnc_fast/64 92 ns 92 ns 7644990 log/s=696.939M/s
log_nnc_fast/512 483 ns 483 ns 1426802 log/s=1059.49M/s
log_nnc_fast/8192 7519 ns 7514 ns 95319 log/s=1090.23M/s
log_nnc_fast/32768 31344 ns 31338 ns 22397 log/s=1045.62M/s
log_nnc_vml/64 88 ns 88 ns 7923812 log/s=728.469M/s
log_nnc_vml/512 454 ns 454 ns 1521437 log/s=1.12739G/s
log_nnc_vml/8192 6763 ns 6763 ns 103264 log/s=1.21136G/s
log_nnc_vml/32768 26565 ns 26564 ns 23609 log/s=1.23354G/s
log_aten/64 418 ns 418 ns 1651401 log/s=153.117M/s
log_aten/512 801 ns 801 ns 875857 log/s=638.923M/s
log_aten/8192 6877 ns 6872 ns 100840 log/s=1.19208G/s
log_aten/32768 26989 ns 26988 ns 26268 log/s=1.21416G/s
```
Reviewed By: bwasti, zheng-xq
Differential Revision: D26246400
fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d
2021-02-10 02:09:10 -08:00
Bert Maher
a23e82df10
[nnc] Tweak log_nnc_sleef so vectorization kicks in ( #51491 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491
The vectorizer heuristic is pretty dumb and only kicks in if the
unroll factor is exactly 8 or 4.
It's still slower than direct implementation, which isn't surprising.
ghstack-source-id: 120783426
Test Plan:
`buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench`
Before:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64 438 ns 438 ns 1795511 log/s=146.259M/s
log_nnc_sleef/512 3196 ns 3195 ns 210032 log/s=160.235M/s
log_nnc_sleef/8192 77467 ns 77466 ns 8859 log/s=105.749M/s
log_nnc_sleef/32768 310206 ns 310202 ns 2170 log/s=105.634M/s
log_nnc_fast/64 100 ns 100 ns 7281074 log/s=637.144M/s
log_nnc_fast/512 546 ns 546 ns 1335816 log/s=938.361M/s
log_nnc_fast/8192 7360 ns 7359 ns 91971 log/s=1.11316G/s
log_nnc_fast/32768 30793 ns 30792 ns 22633 log/s=1064.17M/s
log_aten/64 427 ns 427 ns 1634897 log/s=150.021M/s
log_aten/512 796 ns 796 ns 877318 log/s=643.566M/s
log_aten/8192 6690 ns 6690 ns 102649 log/s=1.22452G/s
log_aten/32768 25357 ns 25350 ns 27808 log/s=1.29263G/s
```
After:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64 189 ns 188 ns 3872475 log/s=340.585M/s
log_nnc_sleef/512 1307 ns 1307 ns 557770 log/s=391.709M/s
log_nnc_sleef/8192 20259 ns 20257 ns 34240 log/s=404.404M/s
log_nnc_sleef/32768 81556 ns 81470 ns 8767 log/s=402.209M/s
log_nnc_fast/64 110 ns 110 ns 6564558 log/s=581.116M/s
log_nnc_fast/512 554 ns 554 ns 1279304 log/s=923.376M/s
log_nnc_fast/8192 7774 ns 7774 ns 91421 log/s=1053.75M/s
log_nnc_fast/32768 31008 ns 31006 ns 21279 log/s=1056.83M/s
```
Reviewed By: bwasti
Differential Revision: D26139067
fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2
2021-02-01 16:35:37 -08:00
Nikita Shulga
97ea95ddd7
Delete tabs from becnh_approx.cpp ( #51157 )
...
Summary:
Introduced by D25981260 (f08464f31d )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157
Reviewed By: bwasti
Differential Revision: D26090008
Pulled By: malfet
fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
2021-01-26 15:53:47 -08:00
Bram Wasti
f08464f31d
[nnc] Add benchmarks
...
Summary: Adding a set of benchmarks for key operators
Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench
Reviewed By: ZolotukhinM
Differential Revision: D25981260
fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
2021-01-26 13:51:33 -08:00
Bram Wasti
1047957831
[te][reapply] Add fast log approximation based on sleef ( #49575 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25627157
fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
Edward Yang
ea4ccc730e
Revert D25445815: [te] Add fast log approximation based on sleef
...
Test Plan: revert-hammer
Differential Revision:
D25445815 (1329066b69 )
Original commit changeset: 20696eacd12a
fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
Bram Wasti
1329066b69
[te] Add fast log approximation based on sleef
...
Summary:
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25445815
fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00