mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
d57ae6c46d
24 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d57ae6c46d |
Revert D26906509: Adding parallel support for the LLVM backend.
Test Plan: revert-hammer
Differential Revision:
D26906509 (
|
||
|
|
95d2318510 |
Adding parallel support for the LLVM backend. (#53243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53243 Test Plan: Imported from OSS Reviewed By: bertmaher, Chillee Differential Revision: D26906509 Pulled By: zheng-xq fbshipit-source-id: 12c17f2f21af11e73fa4c5b5199043a7a15ecdec |
||
|
|
8c798e0622 |
Forbid trailing whitespace (#53406)
Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97 |
||
|
|
8af648354f |
[nnc] Benchmarks for concat (#52592)
Summary: This PR adds a c++ benchmark for "concat" with 3 different versions - 1) aten::cat, 2) NNC implementation with if-then-else, 3) NNC implementation using multiple loops. It also adds a python benchmark for "concat" which can now be invoked with and without CPU fusion. Here are the results of these benchmarks on a `Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz` machine with `OMP_NUM_THREADS=1` ``` -------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------------------------------- Concat2D2 ( |
||
|
|
2b202667c1 |
[1/N] CPU pointwise optimization: Add a benchmark for Relu
Summary: As title
Test Plan:
Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated
Total time: 02:32.3 min
Run on (24 X 2394.45 MHz CPU s)
2021-02-16 21:29:30
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
relu_nnc/64 1738 ns 1738 ns 410535 log/s=36.8257M/s
relu_nnc/512 1708 ns 1708 ns 408678 log/s=299.711M/s
relu_nnc/8192 3297 ns 3297 ns 214362 log/s=2.48499G/s
relu_nnc/32768 10725 ns 10722 ns 61032 log/s=3.05603G/s
log_nnc_sleef/64 2076 ns 2075 ns 326248 log/s=30.8436M/s
log_nnc_sleef/512 3070 ns 3069 ns 230616 log/s=166.81M/s
log_nnc_sleef/8192 22214 ns 22210 ns 31251 log/s=368.849M/s
log_nnc_sleef/32768 85835 ns 85824 ns 8366 log/s=381.804M/s
log_nnc_fast/64 1852 ns 1852 ns 379123 log/s=34.5532M/s
log_nnc_fast/512 2456 ns 2456 ns 299463 log/s=208.503M/s
log_nnc_fast/8192 10953 ns 10952 ns 69894 log/s=747.957M/s
log_nnc_fast/32768 35424 ns 35422 ns 19986 log/s=925.08M/s
log_nnc_vml/64 2361 ns 2361 ns 356220 log/s=27.1063M/s
log_nnc_vml/512 2218 ns 2218 ns 313444 log/s=230.857M/s
log_nnc_vml/8192 8420 ns 8420 ns 81594 log/s=972.912M/s
log_nnc_vml/32768 29484 ns 29484 ns 21701 log/s=1.1114G/s
log_aten/64 15970 ns 15970 ns 44401 log/s=4.00742M/s
log_aten/512 18344 ns 18344 ns 41056 log/s=27.9114M/s
log_aten/8192 24894 ns 24893 ns 27414 log/s=329.084M/s
log_aten/32768 29129 ns 29125 ns 22477 log/s=1.12508G/s
logit_nnc_sleef/64 2379 ns 2379 ns 261168 logit/s=26.8981M/s
logit_nnc_sleef/512 5778 ns 5774 ns 114009 logit/s=88.6757M/s
logit_nnc_sleef/8192 57268 ns 57236 ns 12429 logit/s=143.127M/s
logit_nnc_sleef/32768 216356 ns 216344 ns 3026 logit/s=151.462M/s
logit_nnc_fast/64 2178 ns 2173 ns 282306 logit/s=29.4565M/s
logit_nnc_fast/512 2955 ns 2943 ns 202527 logit/s=173.95M/s
logit_nnc_fast/8192 14836 ns 14835 ns 46794 logit/s=552.192M/s
logit_nnc_fast/32768 53999 ns 53997 ns 12842 logit/s=606.846M/s
logit_nnc_vml/64 2132 ns 2132 ns 335874 logit/s=30.018M/s
logit_nnc_vml/512 3029 ns 3029 ns 250988 logit/s=169.058M/s
logit_nnc_vml/8192 13264 ns 13263 ns 53504 logit/s=617.655M/s
logit_nnc_vml/32768 49395 ns 48284 ns 14526 logit/s=678.654M/s
logit_aten/64 88180 ns 86690 ns 9270 logit/s=738.261k/s
logit_aten/512 54682 ns 54489 ns 10000 logit/s=9.3964M/s
logit_aten/8192 170878 ns 164357 ns 6965 logit/s=49.8427M/s
logit_aten/32768 452291 ns 434638 ns 3967 logit/s=75.3915M/s
logit_caffe2/64 30170 ns 29902 ns 24686 logit/s=2.14029M/s
logit_caffe2/512 203517 ns 201201 ns 3570 logit/s=2.54472M/s
logit_caffe2/8192 3199528 ns 3157098 ns 220 logit/s=2.59479M/s
logit_caffe2/32768 12520838 ns 12504846 ns 56 logit/s=2.62042M/s
tanh_nnc_fast/64 1979 ns 1977 ns 309745 tanh/s=32.3752M/s
tanh_nnc_fast/512 2331 ns 2331 ns 300937 tanh/s=219.636M/s
tanh_nnc_fast/8192 8323 ns 8323 ns 83601 tanh/s=984.26M/s
tanh_nnc_fast/32768 30767 ns 30766 ns 23024 tanh/s=1065.06M/s
tanh_aten/64 17181 ns 17180 ns 36818 tanh/s=3.72522M/s
tanh_aten/512 19071 ns 19036 ns 37243 tanh/s=26.8968M/s
tanh_aten/8192 53542 ns 52006 ns 16268 tanh/s=157.521M/s
tanh_aten/32768 619869 ns 587600 ns 1000 tanh/s=55.7658M/s
tanh_caffe2/64 9668 ns 9654 ns 70926 tanh/s=6.62919M/s
tanh_caffe2/512 70409 ns 70409 ns 9881 tanh/s=7.27184M/s
tanh_caffe2/8192 1179098 ns 1179011 ns 644 tanh/s=6.9482M/s
tanh_caffe2/32768 4384300 ns 4382613 ns 156 tanh/s=7.47682M/s
BatchNorm/ATen/1/64/112/112 23186429 ns 23183715 ns 27 GB/s=277.028M/s
BatchNorm/ATen/1/256/14/14 1772907 ns 1770636 ns 394 GB/s=226.703M/s
BatchNorm/ATen/1/128/28/28 3069417 ns 3069229 ns 232 GB/s=261.569M/s
BatchNorm/ATen/1/64/56/56 6367276 ns 6367190 ns 111 GB/s=252.173M/s
BatchNorm/ATen/1/512/7/7 1334734 ns 1334373 ns 516 GB/s=150.411M/s
BatchNorm/ATen/5/64/112/112 131727903 ns 131721364 ns 7 GB/s=243.792M/s
BatchNorm/ATen/5/256/14/14 7879002 ns 7874672 ns 85 GB/s=254.873M/s
BatchNorm/ATen/5/128/28/28 15561373 ns 15269781 ns 42 GB/s=262.877M/s
BatchNorm/ATen/5/64/56/56 29169722 ns 29107393 ns 24 GB/s=275.812M/s
BatchNorm/ATen/5/512/7/7 5042006 ns 5028687 ns 100 GB/s=199.559M/s
BatchNorm/NNC/1/64/112/112 3303598 ns 3271058 ns 188 GB/s=1.96344G/s
BatchNorm/NNC/1/256/14/14 330641 ns 326644 ns 2033 GB/s=1.22889G/s
BatchNorm/NNC/1/128/28/28 498706 ns 497894 ns 1131 GB/s=1.61242G/s
BatchNorm/NNC/1/64/56/56 1116910 ns 1114768 ns 641 GB/s=1.44033G/s
BatchNorm/NNC/1/512/7/7 163380 ns 163351 ns 3493 GB/s=1.22867G/s
BatchNorm/NNC/5/64/112/112 16392078 ns 16386427 ns 41 GB/s=1.95971G/s
BatchNorm/NNC/5/256/14/14 1133781 ns 1133369 ns 674 GB/s=1.77086G/s
BatchNorm/NNC/5/128/28/28 2053208 ns 2053211 ns 276 GB/s=1.95503G/s
BatchNorm/NNC/5/64/56/56 3874949 ns 3874734 ns 165 GB/s=2.07193G/s
BatchNorm/NNC/5/512/7/7 653665 ns 651498 ns 1236 GB/s=1.54033G/s
BatchNorm/ATenRelu/1/64/112/112 36878892 ns 36100523 ns 22 GB/s=177.907M/s
BatchNorm/ATenRelu/1/256/14/14 6404318 ns 5544976 ns 100 GB/s=72.3913M/s
BatchNorm/ATenRelu/1/128/28/28 5897059 ns 5735509 ns 106 GB/s=139.973M/s
BatchNorm/ATenRelu/1/64/56/56 10075458 ns 9965146 ns 62 GB/s=161.125M/s
BatchNorm/ATenRelu/1/512/7/7 2680507 ns 2662541 ns 254 GB/s=75.3806M/s
BatchNorm/ATenRelu/5/64/112/112 145738113 ns 144253693 ns 5 GB/s=222.612M/s
BatchNorm/ATenRelu/5/256/14/14 13582519 ns 13427209 ns 65 GB/s=149.476M/s
BatchNorm/ATenRelu/5/128/28/28 22747138 ns 22627185 ns 31 GB/s=177.401M/s
BatchNorm/ATenRelu/5/64/56/56 53609692 ns 52936728 ns 15 GB/s=151.656M/s
BatchNorm/ATenRelu/5/512/7/7 11378314 ns 11083777 ns 65 GB/s=90.5395M/s
BatchNorm/NNCRelu/1/64/112/112 3154436 ns 3148939 ns 193 GB/s=2.03958G/s
BatchNorm/NNCRelu/1/256/14/14 337341 ns 337163 ns 1926 GB/s=1.19055G/s
BatchNorm/NNCRelu/1/128/28/28 505570 ns 505569 ns 1231 GB/s=1.58794G/s
BatchNorm/NNCRelu/1/64/56/56 903452 ns 903421 ns 659 GB/s=1.77728G/s
BatchNorm/NNCRelu/1/512/7/7 158521 ns 158321 ns 3781 GB/s=1.2677G/s
BatchNorm/NNCRelu/5/64/112/112 15488210 ns 15480019 ns 41 GB/s=2.07446G/s
BatchNorm/NNCRelu/5/256/14/14 1149186 ns 1148963 ns 649 GB/s=1.74683G/s
BatchNorm/NNCRelu/5/128/28/28 2011589 ns 2011424 ns 320 GB/s=1.99564G/s
BatchNorm/NNCRelu/5/64/56/56 3776274 ns 3776060 ns 161 GB/s=2.12607G/s
BatchNorm/NNCRelu/5/512/7/7 699762 ns 699582 ns 975 GB/s=1.43446G/s
BM_CompileSwish 30471825 ns 30470017 ns 24
BM_CompileSwishLLVMOnly 27479624 ns 27473475 ns 25
FusedOverhead 196219 ns 196195 ns 3342
UnfusedOverhead 220210 ns 220119 ns 3302
Gemm/Torch/128/128/128 115526 ns 115343 ns 7414 GFLOPS=36.3637G/s
Gemm/TensorExprNoopt/128/128/128 3155851 ns 3155706 ns 210 GFLOPS=1.32912G/s
Gemm/TensorExprTile32x32/128/128/128 124454 ns 124452 ns 5774 GFLOPS=33.7021G/s
Gemm/TensorExprTile4x16/128/128/128 174408 ns 174366 ns 3987 GFLOPS=24.0546G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128 72949 ns 72948 ns 9028 GFLOPS=57.4974G/s
Gemm/TensorExprTile4x16Cache/128/128/128 73237 ns 73234 ns 9501 GFLOPS=57.2726G/s
Reduce1D/Torch/16777216 426865265 ns 426853756 ns 2 BYTES=157.217M/s
Reduce1D/Naive/16777216 132347709 ns 132343710 ns 5 BYTES=507.08M/s
Reduce1D/NativeRfactor/16777216 234668375 ns 234664682 ns 3 BYTES=285.978M/s
Reduce1D/TeNaive/16777216 20468304 ns 20467906 ns 34 BYTES=3.27874G/s
Reduce1D/TeSplitTail/16777216 20378995 ns 20378678 ns 34 BYTES=3.29309G/s
Reduce1D/TeSplitMask/16777216 20371783 ns 20371260 ns 36 BYTES=3.29429G/s
Reduce1D/TeRfactorV2/16777216 8235908 ns 8235723 ns 84 BYTES=8.14851G/s
CPU info:
Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows:
*-cpu:0
description: CPU
product: Intel Core Processor (Broadwell)
vendor: Intel Corp.
physical id: 400
bus info: cpu@0
version: 6.61.2
slot: CPU 0
size: 2GHz
capacity: 2GHz
width: 64 bits
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
configuration: cores=1 enabledcores=1 microcode=1 threads=1
Reviewed By: bwasti
Differential Revision: D26275048
fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5
|
||
|
|
71d5a8ea62 |
[nnc] Benchmark inference batchnorm (#52251)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52251 Batchnorm in inference is just a bunch of pointwise ops. NNC should be able to do a good job of this, and indeed it does. For fun I've included a fused BN->Relu (although the real fusion fun would be Conv->BN->Relu...). ``` --------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------- BatchNorm/ATen/1/64/112/112 252886 ns 252875 ns 2785 GB/s=25.3981G/s BatchNorm/ATen/1/256/14/14 12145 ns 12145 ns 55347 GB/s=33.0525G/s BatchNorm/ATen/1/128/28/28 18919 ns 18918 ns 37749 GB/s=42.437G/s BatchNorm/ATen/1/64/56/56 61434 ns 61433 ns 11315 GB/s=26.1363G/s BatchNorm/ATen/1/512/7/7 11924 ns 11923 ns 59070 GB/s=16.8327G/s BatchNorm/ATen/5/64/112/112 1873321 ns 1873292 ns 382 GB/s=17.1424G/s BatchNorm/ATen/5/256/14/14 83470 ns 83459 ns 8538 GB/s=24.0483G/s BatchNorm/ATen/5/128/28/28 157521 ns 157520 ns 4440 GB/s=25.4829G/s BatchNorm/ATen/5/64/56/56 314675 ns 314670 ns 2235 GB/s=25.513G/s BatchNorm/ATen/5/512/7/7 48129 ns 48128 ns 14582 GB/s=20.851G/s BatchNorm/NNC/1/64/112/112 249454 ns 249428 ns 2802 GB/s=25.749G/s BatchNorm/NNC/1/256/14/14 9321 ns 9321 ns 74573 GB/s=43.066G/s BatchNorm/NNC/1/128/28/28 16874 ns 16873 ns 40999 GB/s=47.5797G/s BatchNorm/NNC/1/64/56/56 59276 ns 59275 ns 12047 GB/s=27.0878G/s BatchNorm/NNC/1/512/7/7 3452 ns 3452 ns 202610 GB/s=58.1394G/s BatchNorm/NNC/5/64/112/112 1820201 ns 1820038 ns 373 GB/s=17.6439G/s BatchNorm/NNC/5/256/14/14 78429 ns 78420 ns 8871 GB/s=25.5935G/s BatchNorm/NNC/5/128/28/28 155214 ns 155202 ns 4514 GB/s=25.8635G/s BatchNorm/NNC/5/64/56/56 311454 ns 311449 ns 2163 GB/s=25.7768G/s BatchNorm/NNC/5/512/7/7 26853 ns 26851 ns 25283 GB/s=37.3735G/s BatchNorm/ATenRelu/1/64/112/112 378879 ns 378849 ns 1844 GB/s=16.9528G/s BatchNorm/ATenRelu/1/256/14/14 16707 ns 16705 ns 41391 GB/s=24.029G/s BatchNorm/ATenRelu/1/128/28/28 30235 ns 30235 ns 23060 GB/s=26.5529G/s BatchNorm/ATenRelu/1/64/56/56 91164 ns 91160 ns 7662 GB/s=17.6132G/s BatchNorm/ATenRelu/1/512/7/7 14681 ns 14681 ns 46088 GB/s=13.6707G/s BatchNorm/ATenRelu/5/64/112/112 2864060 ns 2863566 ns 243 GB/s=11.2142G/s BatchNorm/ATenRelu/5/256/14/14 118376 ns 118367 ns 5907 GB/s=16.9561G/s BatchNorm/ATenRelu/5/128/28/28 237893 ns 237873 ns 2936 GB/s=16.8749G/s BatchNorm/ATenRelu/5/64/56/56 472452 ns 472386 ns 1479 GB/s=16.9949G/s BatchNorm/ATenRelu/5/512/7/7 61389 ns 61379 ns 11442 GB/s=16.3496G/s BatchNorm/NNCRelu/1/64/112/112 248378 ns 248341 ns 2812 GB/s=25.8618G/s BatchNorm/NNCRelu/1/256/14/14 9965 ns 9964 ns 76013 GB/s=40.2861G/s BatchNorm/NNCRelu/1/128/28/28 16153 ns 16153 ns 43343 GB/s=49.7004G/s BatchNorm/NNCRelu/1/64/56/56 58761 ns 58757 ns 12095 GB/s=27.3265G/s BatchNorm/NNCRelu/1/512/7/7 10529 ns 10529 ns 66590 GB/s=19.0625G/s BatchNorm/NNCRelu/5/64/112/112 1799001 ns 1798757 ns 362 GB/s=17.8527G/s BatchNorm/NNCRelu/5/256/14/14 78252 ns 78246 ns 8974 GB/s=25.6504G/s BatchNorm/NNCRelu/5/128/28/28 154940 ns 154923 ns 4483 GB/s=25.9102G/s BatchNorm/NNCRelu/5/64/56/56 312329 ns 312324 ns 2244 GB/s=25.7046G/s BatchNorm/NNCRelu/5/512/7/7 51203 ns 51199 ns 13559 GB/s=19.6004G/s ``` Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D26440786 Pulled By: bertmaher fbshipit-source-id: 7d3f7bf6eee4c37736e9875d31ae1b483af9fb6f |
||
|
|
602434bcbe |
[te] Benchmark vml-based logit (#51771)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771 This benchmarks an NNC implementation of logit based on VML's log implementation. It's a modest improvement over the sleef algorithm, but seems to be a bit slower than aten (at larger sizes), and I'm not totally sure why, since you'd think a fused logit kernel would be better than doing clamp/sub/div, followed by log. And yet... Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2 machine; I suspect that it's needed to give the scheduler enough freedom to fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory. ghstack-source-id: 121392349 Test Plan: ``` ----------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------- logit_nnc_sleef/64 483 ns 483 ns 1452336 logit/s=132.469M/s logit_nnc_sleef/512 3019 ns 3019 ns 228059 logit/s=169.577M/s logit_nnc_sleef/8192 71427 ns 71424 ns 9662 logit/s=114.695M/s logit_nnc_sleef/32768 307062 ns 306722 ns 2406 logit/s=106.833M/s logit_nnc_fast/64 147 ns 147 ns 4408910 logit/s=434.908M/s logit_nnc_fast/512 781 ns 781 ns 881230 logit/s=655.53M/s logit_nnc_fast/8192 12519 ns 12518 ns 55626 logit/s=654.421M/s logit_nnc_fast/32768 50530 ns 50526 ns 10000 logit/s=648.536M/s logit_nnc_vml/64 125 ns 125 ns 5551460 logit/s=511.603M/s logit_nnc_vml/512 733 ns 733 ns 938444 logit/s=698.955M/s logit_nnc_vml/8192 11282 ns 11280 ns 61610 logit/s=726.23M/s logit_nnc_vml/32768 45051 ns 44991 ns 15473 logit/s=728.325M/s logit_aten/64 450 ns 449 ns 1599269 logit/s=142.429M/s logit_aten/512 1055 ns 1054 ns 665538 logit/s=485.595M/s logit_aten/8192 10865 ns 10864 ns 64152 logit/s=754.032M/s logit_aten/32768 42106 ns 42103 ns 16477 logit/s=778.287M/s logit_caffe2/64 233 ns 233 ns 2952127 logit/s=274.761M/s logit_caffe2/512 1795 ns 1795 ns 393354 logit/s=285.177M/s logit_caffe2/8192 29924 ns 29923 ns 23225 logit/s=273.77M/s logit_caffe2/32768 123899 ns 123893 ns 5642 logit/s=264.487M/s ``` Reviewed By: bwasti Differential Revision: D26272325 fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688 |
||
|
|
2e35fe9535 |
[te] Implement log approximation using the VML approach (#51752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752 Using a straight power series approximation with enough terms gives precision down to the denormal range, and avoids the fp division used in the sleef approach. This is nice because recent CPUs have dual pipelined fma units, so we can compute 16 logarithms in parallel; whereas there's usually only one FP divider and it has a fairly high latency/low throughput. ghstack-source-id: 121392347 Test Plan: On my avx2+fma broadwell: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 178 ns 178 ns 3933565 log/s=358.993M/s log_nnc_sleef/512 1286 ns 1285 ns 559459 log/s=398.354M/s log_nnc_sleef/8192 19366 ns 19364 ns 36619 log/s=423.053M/s log_nnc_sleef/32768 79288 ns 79286 ns 8718 log/s=413.287M/s log_nnc_fast/64 92 ns 92 ns 7644990 log/s=696.939M/s log_nnc_fast/512 483 ns 483 ns 1426802 log/s=1059.49M/s log_nnc_fast/8192 7519 ns 7514 ns 95319 log/s=1090.23M/s log_nnc_fast/32768 31344 ns 31338 ns 22397 log/s=1045.62M/s log_nnc_vml/64 88 ns 88 ns 7923812 log/s=728.469M/s log_nnc_vml/512 454 ns 454 ns 1521437 log/s=1.12739G/s log_nnc_vml/8192 6763 ns 6763 ns 103264 log/s=1.21136G/s log_nnc_vml/32768 26565 ns 26564 ns 23609 log/s=1.23354G/s log_aten/64 418 ns 418 ns 1651401 log/s=153.117M/s log_aten/512 801 ns 801 ns 875857 log/s=638.923M/s log_aten/8192 6877 ns 6872 ns 100840 log/s=1.19208G/s log_aten/32768 26989 ns 26988 ns 26268 log/s=1.21416G/s ``` Reviewed By: bwasti, zheng-xq Differential Revision: D26246400 fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d |
||
|
|
a23e82df10 |
[nnc] Tweak log_nnc_sleef so vectorization kicks in (#51491)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491 The vectorizer heuristic is pretty dumb and only kicks in if the unroll factor is exactly 8 or 4. It's still slower than direct implementation, which isn't surprising. ghstack-source-id: 120783426 Test Plan: `buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench` Before: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 438 ns 438 ns 1795511 log/s=146.259M/s log_nnc_sleef/512 3196 ns 3195 ns 210032 log/s=160.235M/s log_nnc_sleef/8192 77467 ns 77466 ns 8859 log/s=105.749M/s log_nnc_sleef/32768 310206 ns 310202 ns 2170 log/s=105.634M/s log_nnc_fast/64 100 ns 100 ns 7281074 log/s=637.144M/s log_nnc_fast/512 546 ns 546 ns 1335816 log/s=938.361M/s log_nnc_fast/8192 7360 ns 7359 ns 91971 log/s=1.11316G/s log_nnc_fast/32768 30793 ns 30792 ns 22633 log/s=1064.17M/s log_aten/64 427 ns 427 ns 1634897 log/s=150.021M/s log_aten/512 796 ns 796 ns 877318 log/s=643.566M/s log_aten/8192 6690 ns 6690 ns 102649 log/s=1.22452G/s log_aten/32768 25357 ns 25350 ns 27808 log/s=1.29263G/s ``` After: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------- log_nnc_sleef/64 189 ns 188 ns 3872475 log/s=340.585M/s log_nnc_sleef/512 1307 ns 1307 ns 557770 log/s=391.709M/s log_nnc_sleef/8192 20259 ns 20257 ns 34240 log/s=404.404M/s log_nnc_sleef/32768 81556 ns 81470 ns 8767 log/s=402.209M/s log_nnc_fast/64 110 ns 110 ns 6564558 log/s=581.116M/s log_nnc_fast/512 554 ns 554 ns 1279304 log/s=923.376M/s log_nnc_fast/8192 7774 ns 7774 ns 91421 log/s=1053.75M/s log_nnc_fast/32768 31008 ns 31006 ns 21279 log/s=1056.83M/s ``` Reviewed By: bwasti Differential Revision: D26139067 fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2 |
||
|
|
e975169426 |
[TensorExpr] Redesign Tensor class. (#50995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995 This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and merges it with recently introduced 'CompoundTensor'. A statement for the tensor is either passed directly to the Tensor constructor (akin to 'CompoundTensor'), or is built immediately in constructor. LoopNest is no longer responsible for constructing statements from tensors - it simply stitches already constructed statements contained in Tensors. This has a side effect that now we cannot construct several loopnests from the same tensors - we need to explicitly clone statements if we want to do that. A special copy constructor was added to LoopNest to make it more convenient (note: this only affects tests, we don't usually create multiple loopnests in other places). Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D26038223 Pulled By: ZolotukhinM fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17 |
||
|
|
97ea95ddd7 |
Delete tabs from becnh_approx.cpp (#51157)
Summary:
Introduced by D25981260 (
|
||
|
|
c4029444d1 |
[nnc] Per-operator benchmarks (#51093)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51093 Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. ghstack-source-id: 120403675 Test Plan: ``` op eager nnc speedup hardswish 0.187 0.051 3.70 hardswish 0.052 0.052 1.00 sigmoid 0.148 1.177 0.13 reciprocal 0.049 0.050 0.98 neg 0.038 0.037 1.02 relu 0.037 0.036 1.03 isnan 0.119 0.020 5.86 log 0.082 1.330 0.06 log10 0.148 1.848 0.08 log1p 0.204 1.413 0.14 log2 0.285 1.167 0.24 exp 0.063 1.123 0.06 expm1 0.402 1.417 0.28 erf 0.167 0.852 0.20 erfc 0.181 1.098 0.16 cos 0.124 0.793 0.16 sin 0.126 0.838 0.15 tan 0.285 1.777 0.16 acos 0.144 1.358 0.11 asin 0.126 1.193 0.11 cosh 0.384 1.761 0.22 sinh 0.390 2.279 0.17 atan 0.240 1.564 0.15 tanh 0.320 2.259 0.14 sqrt 0.043 0.069 0.63 rsqrt 0.118 0.117 1.01 abs 0.038 0.037 1.03 ceil 0.038 0.038 1.01 floor 0.039 0.039 1.00 round 0.039 0.292 0.13 trunc 0.040 0.036 1.12 lgamma 2.045 2.721 0.75 ``` Reviewed By: zheng-xq Differential Revision: D26069791 fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba |
||
|
|
f08464f31d |
[nnc] Add benchmarks
Summary: Adding a set of benchmarks for key operators Test Plan: buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench Reviewed By: ZolotukhinM Differential Revision: D25981260 fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396 |
||
|
|
b96a6516a6 |
Add CPP Full Reduction Benchmarks. (#50193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193 * Supports aten, native reference implementation, and NNC TE implementations. * Support functionality checks against aten, in addition to performance checks. Test plans: * After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt, * bin/tensorexpr_bench --benchmark_filter=Reduce1D Measurements: On a Broadwell E5-2686 CPU, Reduce1D/Torch/16777216 5638547 ns 5638444 ns 119 BYTES=11.902G/s Reduce1D/Naive/16777216 19308235 ns 19308184 ns 36 BYTES=3.47567G/s Reduce1D/NativeRfactor/16777216 8433348 ns 8433038 ns 85 BYTES=7.95785G/s Reduce1D/NativeVector/16777216 5608836 ns 5608727 ns 124 BYTES=11.9651G/s Reduce1D/NativeTiled/16777216 5550233 ns 5550221 ns 126 BYTES=12.0912G/s Reduce1D/TeNaive/16777216 21451047 ns 21450752 ns 33 BYTES=3.12851G/s Reduce1D/TeSplitTail/16777216 23701732 ns 23701229 ns 30 BYTES=2.83145G/s Reduce1D/TeSplitMask/16777216 23683589 ns 23682978 ns 30 BYTES=2.83363G/s Reduce1D/TeRfactorV2/16777216 5378019 ns 5377909 ns 131 BYTES=12.4786G/s Result summary: * The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart. Follow-up items: * rfactor does not work well with split * We don't have a multi-threaded implementation yet. * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25821880 Pulled By: zheng-xq fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3 |
||
|
|
468c99fba4 |
Reapply D25856891: [te] Benchmark comparing fused overhead to unfused (#50543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543 Original commit changeset: 2d2f07f79986 Was part of a stack that got reverted. This is just a benchmark. ghstack-source-id: 119825594 Test Plan: CI Reviewed By: navahgar Differential Revision: D25912439 fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676 |
||
|
|
4ee631cdf0 |
Revert D25856891: [te] Benchmark comparing fused overhead to unfused
Test Plan: revert-hammer
Differential Revision:
D25856891 (
|
||
|
|
36ae3feb22 |
[te] Benchmark comparing fused overhead to unfused (#50305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305 That's it ghstack-source-id: 119631533 Test Plan: ``` buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead ``` ``` Run on (24 X 2394.67 MHz CPU s) 2021-01-08 16:06:17 ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2157 ns 2157 ns 311314 UnfusedOverhead 2443 ns 2443 ns 311221 ``` Reviewed By: ZolotukhinM Differential Revision: D25856891 fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3 |
||
|
|
1047957831 |
[te][reapply] Add fast log approximation based on sleef (#49575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575 This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25627157 fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9 |
||
|
|
ea4ccc730e |
Revert D25445815: [te] Add fast log approximation based on sleef
Test Plan: revert-hammer
Differential Revision:
D25445815 (
|
||
|
|
1329066b69 |
[te] Add fast log approximation based on sleef
Summary: This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25445815 fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888 |
||
|
|
464d23e6b4 |
[te][benchmark] Add more optimized versions of gemm (#48159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159 Test Plan: Imported from OSS Reviewed By: Chillee, ngimel Differential Revision: D25059742 Pulled By: bertmaher fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55 |
||
|
|
b7261de0df |
[pytorch][te] Add compilation time benchmark (#46124)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124 We want to make sure we can actually fuse kernels within a fairly tight time budget. So here's a quick benchmark of codegen for a simple pointwise activation function (swish). I kept all the intermediate tensors separate to force TE to actually do inlining. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench ``` I've only run in debug mode so results aren't super meaningful, but even in that mode it's 18ms for compilation, 15 of which are in llvm. Update, opt build mode: ``` ---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- BM_CompileSwish 5123276 ns 5119846 ns 148 BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160 ``` Reviewed By: asuhan Differential Revision: D24232801 fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76 |
||
|
|
f2e569461b |
[te] Tiled (m=32 x n=32) gemm benchmark (#45905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142402 Pulled By: bertmaher fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f |
||
|
|
50f89578dd |
[te] Add a benchmark harness (#45875)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875 Adds a googlebenchmark harness for perf testing programs generated by tensorexpr, sans any pytorch wrappings (for python-level benchmarks of tensorexpr, see benchmarks/tensorexpr). Currently there's a harness for gemm that sets up the problem using torch (and also measures the perf of a torch::mm to give a baseline). Right now there's just an unoptimized implementation that is expected to be not very fast. More optimized versions are coming. Sample output from my dev box: ``` Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------- Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s ``` Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142403 Pulled By: bertmaher fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597 |