pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Norman Ponte	2e8b9c7785	[TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035 - modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change Reviewed By: tgalkovskyi Differential Revision: D35257017 fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f (cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)	2022-04-01 00:35:42 +00:00
Sergii Dymchenko	5b011fc6eb	Fix Undefined variable in QInterpolateBenchmark Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130 Approved by: https://github.com/malfet	2022-03-09 00:14:15 +00:00
Sergii Dymchenko	486572223b	Fix command example (#72847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847 Reviewed By: malfet Differential Revision: D34260868 Pulled By: kit1980 fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7 (cherry picked from commit `c9e874c4d8`)	2022-02-16 21:45:45 +00:00
Ben Koopman	c2c859bdf2	[quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560 Test Plan: Imported from OSS Reviewed By: HDCharles Differential Revision: D31618282 Pulled By: b-koopman fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3	2021-11-19 06:29:19 -08:00
Michael Suo	5c3529a86d	[lint] small pass to make lint clean (#68367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367 - bmm_test.py was using syntax not allowed in 3.6 - Some suppressions were not placed on the correct line. With this file, ``` lintrunner --paths-cmd='git grep -Il .' ``` passes successfully. Test Plan: Imported from OSS Reviewed By: janeyx99, mrshenli Differential Revision: D32436644 Pulled By: suo fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2	2021-11-16 10:27:00 -08:00
Shashank Chaudhry	89c4e8c22b	[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8	2021-11-03 12:23:14 -07:00
Bin Wen	6900aacf54	[fbcode] Fix operator_benchmark with jit mode (#67382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382 two simple updates: * fix running benchmark with --use_jit. Previously will fail with error torch.jit.frontend.UnsupportedNodeError: import statements aren't supported: File "/proc/self/fd/3/bmm_test.py", line 9 def __invoke_main(): import ctypes ~~~~~~ <--- HERE import ctypes.util import errno * add matmul to bmm benchmark as D31837588 Test Plan: buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test -- --forward_only=True --mkl_num_threads=1 --omp_num_threads=1 --use_jit=True Reviewed By: ShijunK Differential Revision: D31960528 fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f	2021-10-28 08:48:10 -07:00
Vasiliy Kuznetsov	d802877dfa	speed up quantized interpolate for channels last (#66525 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525 This should solve https://github.com/pytorch/pytorch/issues/60015 There were two `q_zero_point()` accesses inside a for loop which was expensive. Moving them to before the loop sped things up 10x for a microbenchmark. Test Plan: ``` // comment out benchmarks unrelated to original issue, for simplicity cd benchmarks/operator_benchmark python -m pt.qinterpolate_test // before: 2994 us // after: 324 us // full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D31592422 fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459	2021-10-14 08:11:26 -07:00
Alexandr Guzhva	b8e1999253	[quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183 Add a GPU benchmark for fakeQuant, similar to #65241 ghstack-source-id: 139810414 Test Plan: https://pxl.cl/1QjJM Reviewed By: b-koopman Differential Revision: D31288158 fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d	2021-10-06 08:07:42 -07:00
Vasiliy Kuznetsov	e3af4be963	pytorch quantization ao migration phase 2: caffe2/benchmark (#65833 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833 Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks` folder. ``` find caffe2/benchmarks/ -type f -name "*.py" -print0 \| xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g" ``` Test Plan: CI Reviewed By: z-a-f Differential Revision: D31275963 fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116	2021-10-01 06:17:36 -07:00
Philip Meier	aebde1bc2b	deprecate device getter from `torch.testing` namespace (#63844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31141433 Pulled By: mruberry fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732	2021-09-29 02:40:52 -07:00
Ben Koopman	6a6ee92e36	[quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241 Test Plan: Imported from OSS Reviewed By: jingsh Differential Revision: D31150087 Pulled By: b-koopman fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727	2021-09-27 16:01:49 -07:00
Eddie Ren	9c73a48ecf	ND Embeddings benchmark - Standardize randomized inputs (#64707 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707 Use torch.randn instead of torch.from_numpy to generate the tensor Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test Reviewed By: jingsh Differential Revision: D30817302 fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672	2021-09-13 06:47:35 -07:00
Eddie Ren	3fbb49e75d	Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647 Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim. Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test``` Reviewed By: jingsh Differential Revision: D30770085 fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e	2021-09-10 16:49:02 -07:00
Harut Movsisyan	956c8fa01e	Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654 Test Plan: ``` > buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 27.970 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 41.830 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 499.114 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 6.268 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 12.676 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 438.219 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 7.657 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 18.523 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 55.103 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 2.501 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 10.589 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 50.102 Reviewed By: ajyu Differential Revision: D30455179 fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b	2021-08-24 16:26:26 -07:00
Supriya Rao	7a15576a65	[quant] update FakeQuant modules to use tensor qparams (#61318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318 Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator. Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU. Local benchmark P427668213 Before this change ``` Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_aminmax 2.57% 1.507ms 3.10% 1.819ms 36.371us 2.872ms 4.81% 2.872ms 57.446us 50 aten::fake_quantize_per_tensor_affine 1.04% 610.915us 3.60% 2.114ms 42.276us 472.896us 0.79% 2.698ms 53.962us 50 aten::fake_quantize_per_tensor_affine_cachemask 1.69% 993.626us 2.56% 1.503ms 30.058us 2.225ms 3.73% 2.225ms 44.504us 50 aten::is_nonzero 3.85% 2.258ms 19.68% 11.540ms 46.161us 2.168ms 3.63% 11.084ms 44.336us 250 aten::zeros_like 1.82% 1.064ms 6.65% 3.901ms 39.007us 1.531ms 2.57% 3.905ms 39.045us 100 aten::eq 13.80% 8.093ms 25.90% 15.189ms 37.972us 9.580ms 16.05% 15.566ms 38.914us 400 aten::item 5.67% 3.323ms 21.50% 12.607ms 36.019us 3.233ms 5.42% 12.167ms 34.762us 350 aten::zeros 0.94% 549.208us 2.93% 1.717ms 34.343us 688.928us 1.15% 1.695ms 33.894us 50 aten::le 2.52% 1.478ms 4.50% 2.641ms 26.411us 1.753ms 2.94% 2.845ms 28.448us 100 aten::rsub 1.04% 608.715us 2.44% 1.433ms 28.667us 532.000us 0.89% 1.418ms 28.353us 50 aten::max 1.54% 905.401us 4.62% 2.711ms 27.106us 847.488us 1.42% 2.697ms 26.969us 100 aten::ones 0.92% 542.159us 2.16% 1.266ms 25.324us 661.856us 1.11% 1.301ms 26.017us 50 aten::min 0.82% 479.167us 2.15% 1.258ms 25.160us 407.808us 0.68% 1.276ms 25.530us 50 aten::_local_scalar_dense 15.83% 9.284ms 15.83% 9.284ms 26.526us 8.934ms 14.97% 8.934ms 25.524us 350 aten::clamp 2.35% 1.378ms 4.21% 2.467ms 24.669us 1.546ms 2.59% 2.461ms 24.612us 100 aten::zero_ 2.53% 1.482ms 5.65% 3.316ms 22.108us 1.326ms 2.22% 3.380ms 22.531us 150 aten::maximum 3.08% 1.805ms 3.08% 1.805ms 18.052us 1.849ms 3.10% 1.849ms 18.494us 100 aten::minimum 1.33% 778.854us 1.33% 778.854us 15.577us 868.672us 1.46% 868.672us 17.373us 50 aten::round 1.36% 799.910us 1.36% 799.910us 15.998us 809.568us 1.36% 809.568us 16.191us 50 aten::copy_ 6.61% 3.878ms 6.61% 3.878ms 15.513us 4.036ms 6.76% 4.036ms 16.143us 250 aten::div 2.53% 1.483ms 2.53% 1.483ms 14.833us 1.535ms 2.57% 1.535ms 15.353us 100 aten::mul 2.44% 1.431ms 2.44% 1.431ms 14.314us 1.478ms 2.48% 1.478ms 14.782us 100 aten::detach 1.46% 855.670us 2.41% 1.411ms 14.110us 832.448us 1.39% 1.395ms 13.949us 100 aten::add 2.22% 1.301ms 2.22% 1.301ms 13.008us 1.383ms 2.32% 1.383ms 13.828us 100 aten::fill_ 4.18% 2.452ms 4.18% 2.452ms 12.262us 2.693ms 4.51% 2.693ms 13.463us 200 aten::sub 5.06% 2.967ms 5.06% 2.967ms 14.837us 2.675ms 4.48% 2.675ms 13.374us 200 aten::to 2.10% 1.230ms 3.65% 2.140ms 10.701us 1.310ms 2.20% 2.062ms 10.310us 200 aten::select 1.28% 749.144us 1.49% 874.227us 8.742us 863.232us 1.45% 863.232us 8.632us 100 detach 0.95% 555.326us 0.95% 555.326us 5.553us 562.496us 0.94% 562.496us 5.625us 100 aten::as_strided 0.40% 232.289us 0.40% 232.289us 1.161us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 2.93% 1.720ms 2.93% 1.720ms 3.439us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.04% 611.313us 1.04% 611.313us 2.038us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.75% 438.585us 1.77% 1.036ms 5.180us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.36% 799.442us 1.36% 799.442us 3.198us 0.000us 0.00% 0.000us 0.000us 250 --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 58.645ms Self CUDA time total: 59.674ms ``` After this change ``` test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::fake_quantize_per_tensor_affine 0.98% 505.210us 4.38% 2.259ms 45.187us 419.424us 0.78% 3.218ms 64.367us 50 aten::_aminmax 2.78% 1.434ms 3.42% 1.766ms 35.321us 2.825ms 5.27% 2.825ms 56.505us 50 aten::fake_quantize_per_tensor_affine_cachemask_tens... 2.38% 1.229ms 3.40% 1.754ms 35.083us 2.799ms 5.22% 2.799ms 55.979us 50 aten::rsub 0.94% 485.040us 5.02% 2.590ms 51.793us 458.976us 0.86% 2.587ms 51.747us 50 aten::is_nonzero 3.78% 1.952ms 23.64% 12.196ms 48.786us 2.055ms 3.83% 11.986ms 47.944us 250 aten::item 6.92% 3.572ms 19.86% 10.244ms 40.977us 3.670ms 6.85% 9.931ms 39.724us 250 aten::zeros_like 1.65% 848.874us 6.64% 3.426ms 34.260us 1.397ms 2.61% 3.572ms 35.717us 100 aten::zeros 0.85% 436.691us 3.00% 1.549ms 30.984us 551.936us 1.03% 1.576ms 31.516us 50 aten::eq 10.60% 5.467ms 20.26% 10.452ms 26.130us 7.018ms 13.09% 10.832ms 27.079us 400 aten::le 2.58% 1.332ms 4.67% 2.407ms 24.074us 1.580ms 2.95% 2.614ms 26.144us 100 aten::_local_scalar_dense 12.93% 6.673ms 12.93% 6.673ms 26.691us 6.261ms 11.68% 6.261ms 25.046us 250 aten::clamp 2.43% 1.253ms 4.37% 2.256ms 22.560us 1.431ms 2.67% 2.273ms 22.725us 100 aten::ones 0.89% 460.133us 2.18% 1.123ms 22.467us 570.496us 1.06% 1.128ms 22.551us 50 aten::min 0.74% 383.132us 2.06% 1.065ms 21.296us 377.536us 0.70% 1.091ms 21.824us 50 aten::zero_ 2.36% 1.219ms 5.87% 3.029ms 20.194us 1.261ms 2.35% 3.199ms 21.327us 150 aten::max 1.51% 779.081us 4.06% 2.096ms 20.960us 791.680us 1.48% 2.130ms 21.295us 100 aten::sub 7.97% 4.111ms 7.97% 4.111ms 20.556us 3.847ms 7.18% 3.847ms 19.234us 200 aten::div 2.94% 1.516ms 2.94% 1.516ms 15.158us 1.580ms 2.95% 1.580ms 15.798us 100 aten::round 1.45% 750.445us 1.45% 750.445us 15.009us 756.064us 1.41% 756.064us 15.121us 50 aten::copy_ 6.88% 3.548ms 6.88% 3.548ms 14.190us 3.701ms 6.90% 3.701ms 14.803us 250 aten::minimum 1.32% 681.654us 1.32% 681.654us 13.633us 713.664us 1.33% 713.664us 14.273us 50 aten::maximum 2.55% 1.317ms 2.55% 1.317ms 13.169us 1.338ms 2.50% 1.338ms 13.378us 100 aten::mul 2.63% 1.358ms 2.63% 1.358ms 13.581us 1.328ms 2.48% 1.328ms 13.283us 100 aten::detach 1.34% 688.820us 2.35% 1.211ms 12.110us 772.800us 1.44% 1.278ms 12.779us 100 aten::fill_ 4.53% 2.338ms 4.53% 2.338ms 11.692us 2.495ms 4.65% 2.495ms 12.473us 200 aten::add 2.32% 1.197ms 2.32% 1.197ms 11.968us 1.240ms 2.31% 1.240ms 12.405us 100 aten::to 2.07% 1.069ms 3.66% 1.889ms 9.443us 1.224ms 2.28% 1.975ms 9.874us 200 aten::select 1.44% 743.042us 1.64% 848.207us 8.482us 641.600us 1.20% 641.600us 6.416us 100 detach 1.01% 522.155us 1.01% 522.155us 5.222us 505.088us 0.94% 505.088us 5.051us 100 aten::as_strided 0.44% 227.884us 0.44% 227.884us 1.139us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 3.20% 1.652ms 3.20% 1.652ms 3.304us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.25% 646.711us 1.25% 646.711us 2.156us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.79% 407.768us 2.07% 1.067ms 5.334us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.52% 785.788us 1.52% 785.788us 3.143us 0.000us 0.00% 0.000us 0.000us 250 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 51.590ms Self CUDA time total: 53.609ms ghstack-source-id: 133370215 Test Plan: buck test mode/dev-nosan caffe2/test/:quantization Reviewed By: raghuramank100 Differential Revision: D29566512 fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c	2021-07-10 19:43:02 -07:00
Kimish Patel	3176f16691	[Pytorch benchmark] Add BMM benchmark (#59595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595 ghstack-source-id: 130946743 Test Plan: bmm_test Reviewed By: mingzhe09088 Differential Revision: D28873228 fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c	2021-06-10 08:24:29 -07:00
Kimish Patel	8b63573c31	[PyTorch Operator Benchmark] gelu benchmark (#59334 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334 Add gelu op benchmark ghstack-source-id: 130947172 Test Plan: gelu_test Reviewed By: hl475 Differential Revision: D28842959 fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4	2021-06-09 16:09:37 -07:00
Rong Rong (AI Infra)	277f587496	rename benchmark_cpp_extension (#58708 ) Summary: Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708 Test Plan: Run from `./benchmarks/operator_benchmark/pt_extension` folder: ``` python setup.py install python cpp_extension_test.py ``` Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI Reviewed By: robieta Differential Revision: D28585582 Pulled By: walterddr fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57	2021-05-24 11:04:02 -07:00
Peter Bell	0c2d38264a	Improve BatchNorm1d performance (CUDA) (#57786 ) Summary: Part of gh-38915, resubmit of gh-57034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/57786 Reviewed By: mruberry Differential Revision: D28290284 Pulled By: ngimel fbshipit-source-id: 8768578ba9ace6a948cb8145c0091e0ea49b12da	2021-05-08 19:09:29 -07:00
Sam Estep	2992ff3fb8	Revert D28142447: Improve BatchNorm1d performance (CUDA) Test Plan: revert-hammer Differential Revision: D28142447 (`b2936ad8fa`) Original commit changeset: c70109780e20 fbshipit-source-id: e93f6d00d644697b106f5ea8ab79872f353b51c6	2021-05-06 15:01:19 -07:00
Peter Bell	b2936ad8fa	Improve BatchNorm1d performance (CUDA) (#57034 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57034 Resolves gh-38915 For the example given in the issue, BatchNorm1d on cuDNN is around 12x slower than BatchNorm2d. Internally, cuDNN expects at least a 4d tensor (N, C, H, W) so these two modules actually call the same cuDNN code. My assumption is that cuDNN just isn't optimized for H=W=1. Instead, this disables cudnn for 2d batch_norm inputs and improves the CUDA implementation of `native_batch_norm` to be competative with cuDNN. For the example in the issue, `BatchNorm1d` now takes 335 us compared to 6.3 ms before, or a 18x speedup. Before this change, nvprof shows: ``` Type Time(%) Time Calls Avg Min Max Name GPU activities: 99.64% 630.95ms 100 6.3095ms 5.6427ms 8.8800ms void cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>(cudnnTensorStruct, float const , cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>, cudnnTensorStruct, float const , float const , cudnnTensorStruct, cudnnTensorStruct, cudnnTensorStruct, float const , float const , float const , cudnnTensorStruct, cudnnTensorStruct) ``` But after, it shows: ``` Type Time(%) Time Calls Avg Min Max Name GPU activities: 54.76% 14.352ms 100 143.52us 123.52us 756.28us _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07022batch_norm_elementwiseERKNS_6TensorES5_RKN3c108optionalIS3_EESA_S5_S5_ENKUlvE_clEvENKUlvE2_clEvEUlfffffE_NS_6detail5ArrayIPcLi6EEE16OffsetCalculatorILi5EjESI_ILi1EjENS0_6memory15LoadWithoutCastENSL_16StoreWithoutCastEEEviT_T0_T1_T2_T3_T4_ 35.09% 9.1951ms 100 91.950us 84.415us 362.17us void at::native::reduce_kernel<int=256, int=2, at::native::ReduceOp<float, at::native::WelfordOps<float, float, int, float, thrust::pair<float, float>>, unsigned int, float, int=2>>(float) 0.71% 186.14us 100 1.8610us 1.8240us 1.9840us _ZN2at6native72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07045unrolled_elementwise_kernel_for_multi_outputsILi3EZZZNS1_34batch_norm_update_stats_and_invertERKNS_6TensorES5_S5_S5_ddlENKUlvE_clEvENKUlvE2_clEvEUlffffE_NS_6detail5ArrayIPcLi7EEE23TrivialOffsetCalculatorILi4EjESD_ILi3EjEEEviT0_T1_T2_T3_ 0.59% 153.37us 100 1.5330us 1.4720us 2.6240us void at::native::vectorized_elementwise_kernel<int=4, at::native::BUnaryFunctor<at::native::AddFunctor<long>>, at::detail::Array<char*, int=2>>(int, long, at::native::AddFunctor<long>) ``` I think there is similar scope to improve the backward implementation. Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D28142447 Pulled By: ngimel fbshipit-source-id: c70109780e206fa85e50a31e90a1cb4c533199da	2021-05-06 12:14:02 -07:00
Sam Estep	e3900d2ba5	Add lint for unqualified `noqa` (#56272 ) Summary: As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future. Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two: ``` test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999 test/jit/test_misc.py:28: print(f"format blank") # noqa F541 ``` However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored: - If you change them to anything else, the warnings will still be suppressed. - If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally: ``` test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment ``` I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2365189927 Reviewed By: janeyx99 Differential Revision: D27830127 Pulled By: samestep fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb	2021-04-19 13:16:18 -07:00
Sam Estep	cc11aaaa60	Disallow non-breaking spaces (#55465 ) Summary: malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465 Test Plan: The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout `ece075195d` and run it locally: ```sh (! git --no-pager grep -In $'\u00a0' -- . \|\| (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false)) ``` It should print over a hundred lines of output and exit with status 1. Reviewed By: janeyx99 Differential Revision: D27622136 Pulled By: samestep fbshipit-source-id: e7ffd5a9519093e7a0ffdf55e9291f63e21ce841	2021-04-08 15:44:44 -07:00
vfdev-5	2b07bcf9eb	[operator benchmarks] Added more interpolation test cases (#54584 ) Summary: Description: - Added uint8 nearest test case - Added 3d vectorization test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/54584 Reviewed By: malfet Differential Revision: D27291303 Pulled By: fmassa fbshipit-source-id: 236ee5af351c8dc34ec3cdb7dda662c77feb8cf0	2021-03-24 11:46:27 -07:00
Haichuan Yang	25a9f45a5a	fix broken quantization_test in operator_benchmark (#53153 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53153 This diff is a fix for quantization_test in operator_benchmark, which is broken because of removing the py_module for learnable fake_quantization. ghstack-source-id: 123103477 Test Plan: `buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test` Reviewed By: z-a-f Differential Revision: D26764881 fbshipit-source-id: 8d40c6eb5e7090ca65f48982c837f7dc87d14378	2021-03-08 12:12:57 -08:00
Sam Estep	8c798e0622	Forbid trailing whitespace (#53406 ) Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)/contrib/' ':(exclude)third_party' \| xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97	2021-03-05 17:22:55 -08:00
Nicolas Hug	5095332ab9	Minor cleanup of interpolate microbenchmark Summary: Minor cleanup, addresses comments from https://www.internalfb.com/diff/D26780116 (`1559fa6a5c`) Test Plan: ``` ➜ vision buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short Parsing buck files: finished in 0.6 sec Building: finished in 6.2 sec (100%) 10951/10951 jobs, 0 updated Total time: 6.9 sec /data/users/nicolashug/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modenearest # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: nearest Forward Execution Time (us) : 1346.156 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modelinear # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: linear Forward Execution Time (us) : 1283.784 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modebicubic # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: bicubic Forward Execution Time (us) : 4769.578 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modenearest # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: nearest Forward Execution Time (us) : 982.910 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modelinear # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: linear Forward Execution Time (us) : 1182.191 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modebicubic # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: bicubic Forward Execution Time (us) : 3545.873 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modenearest # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: nearest Forward Execution Time (us) : 34373.955 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modelinear # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: linear Forward Execution Time (us) : 42248.109 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modebicubic # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: bicubic Forward Execution Time (us) : 405944.286 ... ``` Reviewed By: fmassa Differential Revision: D26782757 fbshipit-source-id: 2039e1e6b4fea2b56bb4bcf2a017476f928e4928	2021-03-04 05:36:28 -08:00
vfdev-5	1559fa6a5c	[operator benchmarks] Added more modes to interpolation tests (#53186 ) Summary: Description: - Added more modes: bicubic and nearest to interpolation tests - Added a test case for downsampling a small image Pull Request resolved: https://github.com/pytorch/pytorch/pull/53186 Reviewed By: albanD Differential Revision: D26780116 Pulled By: fmassa fbshipit-source-id: f4f498e6e1da1ec131e6d9d9f42dc482135ae9e2	2021-03-03 09:18:38 -08:00
vfdev-5	cb1596a193	[operator_benchmark] Added channels last 3d option to interpolate test (#53117 ) Summary: Description: - Added channels last 3d option to interpolate test - split config non-4d into two : 3d and 5d Pull Request resolved: https://github.com/pytorch/pytorch/pull/53117 Reviewed By: NicolasHug Differential Revision: D26754243 Pulled By: fmassa fbshipit-source-id: 49bbab3bb47de27790e39537d0fbeca0f01782c4	2021-03-02 11:54:45 -08:00
Nicolas Hug	9cf6be6b3e	Fix torch.nn.functional.interpolate microbenchmark for non-4D inputs Summary: This diff fixes the `interpolate` microbenchmark for non-4D inputs, which are not supported by the `bilinear` mode Test Plan: 5D and 3D: ``` # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,16,320,320)_output_size(8,256,256) # Input: input_size: (1, 3, 16, 320, 320), output_size: (8, 256, 256) Forward Execution Time (us) : 221008.660 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(4,512,320)_output_size(256,) # Input: input_size: (4, 512, 320), output_size: (256,) Forward Execution Time (us) : 9727.900 ``` 4D ``` # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True Forward Execution Time (us) : 375.181 ``` Reviewed By: fmassa Differential Revision: D26486678 fbshipit-source-id: 5d476afba3f35da9f8b86db16e21505bdb00888b	2021-02-18 02:07:54 -08:00
Vuk Radovic	4501b52fe5	Benchmark for torch.ops.quantized.linear_prepack_fp16 operator (#52229 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52229 Create benchmarks for torch.ops.quantized.linear_prepack_fp16 and torch.ops.quantized.linear_unpack_fp16 operators Benchmark for these operators are written in the same format as the other benchmarks for other operators. Test Plan: linear_prepack_fp16 test was successfully run with various parameters: Sample test run output: ---------------------------------------- PyTorch/Caffe2 Operator Micro-benchmarks ---------------------------------------- Tag : long Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N32_K256_cpu Input: M: 8, N: 32, K: 256, device: cpu Forward Execution Time (us) : 14.002 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N32_K512_cpu Input: M: 8, N: 32, K: 512, device: cpu Forward Execution Time (us) : 14.114 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N64_K256_cpu Input: M: 8, N: 64, K: 256, device: cpu Forward Execution Time (us) : 19.355 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N64_K512_cpu Input: M: 8, N: 64, K: 512, device: cpu Forward Execution Time (us) : 19.056 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N32_K256_cpu Input: M: 128, N: 32, K: 256, device: cpu Forward Execution Time (us) : 115.963 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N32_K512_cpu Input: M: 128, N: 32, K: 512, device: cpu Forward Execution Time (us) : 116.259 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N64_K256_cpu Input: M: 128, N: 64, K: 256, device: cpu Forward Execution Time (us) : 229.336 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N64_K512_cpu Input: M: 128, N: 64, K: 512, device: cpu Forward Execution Time (us) : 220.016 linear_unpack_fp16 test was successfully run with identical parameters Reviewed By: b-koopman Differential Revision: D26403343 fbshipit-source-id: 11a98e56177952b94f291006975b0b719f48d1b9	2021-02-17 08:02:01 -08:00
Nicolas Hug	50e6f0fdb6	Add benchmark for torch.nn.functional.interpolate Summary: This diff adds a new microbencharmk for the `torch.nn.functional.interpolate` operator, using OpBench Test Plan: ``` [nicolashug@59262.od ~/fbsource/fbcode/caffe2/benchmarks/operator_benchmark/pt (39207820)]$ buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short Starting new Buck daemon... Buck daemon started. Parsing buck files: finished in 06:30.7 min Creating action graph: finished in 33.9 sec Building: finished in 02:53.4 min (100%) 24224/24224 jobs, 24224 updated Total time: 09:58.2 min /data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True Forward Execution Time (us) : 510.818 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False Forward Execution Time (us) : 684.324 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True Forward Execution Time (us) : 33791.970 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastFalse # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: False Forward Execution Time (us) : 50120.585 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastTrue # Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: True Forward Execution Time (us) : 37668.089 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastFalse # Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: False Forward Execution Time (us) : 56869.472 ``` Reviewed By: fmassa Differential Revision: D26225318 fbshipit-source-id: 7757296192e630c42a6e4913c5c1d93af11d286d	2021-02-10 08:28:16 -08:00
Marat Subkhankulov	721ba97eb6	Create op benchmark for stack (#51263 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263 - Add benchmark for stack op Test Plan: ``` buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime \| grep Execution Forward Execution Time (us) : 6.380 Forward Execution Time (us) : 6.553 Forward Execution Time (us) : 14.904 Forward Execution Time (us) : 5.657 Forward Execution Time (us) : 5.612 Forward Execution Time (us) : 6.051 Forward Execution Time (us) : 4.225 Forward Execution Time (us) : 4.240 Forward Execution Time (us) : 6.280 Forward Execution Time (us) : 6.267 Forward Execution Time (us) : 418.932 Forward Execution Time (us) : 417.694 Forward Execution Time (us) : 1592.455 Forward Execution Time (us) : 2919.261 Forward Execution Time (us) : 211.458 Forward Execution Time (us) : 211.518 Forward Execution Time (us) : 783.953 Forward Execution Time (us) : 1457.823 Forward Execution Time (us) : 2032.816 Forward Execution Time (us) : 2090.662 Forward Execution Time (us) : 6487.098 Forward Execution Time (us) : 11874.702 Forward Execution Time (us) : 2123.830 Forward Execution Time (us) : 2195.453 Forward Execution Time (us) : 6435.978 Forward Execution Time (us) : 11852.205 Forward Execution Time (us) : 2036.526 Forward Execution Time (us) : 2055.618 Forward Execution Time (us) : 6417.192 Forward Execution Time (us) : 12468.744 Forward Execution Time (us) : 4959.704 Forward Execution Time (us) : 5121.823 Forward Execution Time (us) : 5082.105 Forward Execution Time (us) : 5395.936 Forward Execution Time (us) : 5162.756 Forward Execution Time (us) : 23798.080 Forward Execution Time (us) : 4957.921 Forward Execution Time (us) : 4971.234 Forward Execution Time (us) : 5005.909 Forward Execution Time (us) : 5159.614 Forward Execution Time (us) : 5013.221 Forward Execution Time (us) : 20238.741 Forward Execution Time (us) : 7632.439 Forward Execution Time (us) : 7589.376 Forward Execution Time (us) : 7859.937 Forward Execution Time (us) : 8214.213 Forward Execution Time (us) : 11606.562 Forward Execution Time (us) : 34612.919 ``` Reviewed By: hlu1 Differential Revision: D25859143 fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1	2021-01-30 10:32:14 -08:00
Vasiliy Kuznetsov	983b8e6b62	fake_quant: add a more memory efficient version (#50561 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561 Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Imported from OSS Reviewed By: ngimel Differential Revision: D25918519 fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6	2021-01-27 19:36:04 -08:00
Marat Subkhankulov	dea9af5c06	Cat benchmark: use mobile feed tensor shapes and torch.cat out-variant (#50778 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50778 - use tensor shapes from ctr_mobilefeed merge net - use pt cat out-variant for a fairer comparison otherwise benchmark includes time to construct result tensor Test Plan: turbo off, devbig machine ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=static_runtime ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : static_runtime # Benchmarking Caffe2: concat # Name: concat_sizes(1,40)_N5_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: (1, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.619 # Benchmarking Caffe2: concat # Name: concat_sizes[(1,160),(1,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(1, 160), (1, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.369 # Benchmarking Caffe2: concat # Name: concat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.590 # Benchmarking Caffe2: concat # Name: concat_sizes[(1,580),(1,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(1, 580), (1, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.412 # Benchmarking Caffe2: concat # Name: concat_sizes(20,40)_N5_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: (20, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 2.464 # Benchmarking Caffe2: concat # Name: concat_sizes[(20,160),(20,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(20, 160), (20, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 1.652 # Benchmarking Caffe2: concat # Name: concat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 9.312 # Benchmarking Caffe2: concat # Name: concat_sizes[(20,580),(20,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(20, 580), (20, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 6.532 ``` ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=static_runtime ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : static_runtime # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cpu # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 3.313 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cpu # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 3.680 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cpu # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 3.452 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cpu # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 4.653 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cpu # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 7.364 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cpu # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 7.055 ``` Reviewed By: hlu1 Differential Revision: D25839036 fbshipit-source-id: 7a6a234f41dfcc56246a80141fe0c84f769a5a85	2021-01-19 22:50:28 -08:00
Marat Subkhankulov	49896c48e0	Caffe2 Concat operator benchmark (#50449 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50449 Port caffe2 operator benchmark from torch.cat to caffe2 concat to measure the difference in performance. previous diff abandoned to rerun github CI tests. D25738076 Test Plan: Tested on devbig by running both pt and c2 benchmarks. Compiled with mode/opt Inputs: ``` size, number of inputs, cat dimension, device ---------------------------------------------------- (1, 1, 1), N: 2, dim: 0, device: cpu (512, 512, 2), N: 2, dim: 1, device: cpu (128, 1024, 2), N: 2, dim: 1, device: cpu (1024, 1024, 2), N: 2, dim: 0, device: cpu (1025, 1023, 2), N: 2, dim: 1, device: cpu (1024, 1024, 2), N: 2, dim: 2, device: cpu [<function <lambda> at 0x7f922718e8c0>, 111, 65], N: 5, dim: 0, device: cpu [96, <function <lambda> at 0x7f9226dad710>, 64], N: 5, dim: 1, device: cpu [128, 64, <function <lambda> at 0x7f91a3625ef0>], N: 5, dim: 2, device: cpu [<function <lambda> at 0x7f91a3625f80>, 32, 64], N: 50, dim: 0, device: cpu [32, <function <lambda> at 0x7f91a3621050>, 64], N: 50, dim: 1, device: cpu [33, 65, <function <lambda> at 0x7f91a36210e0>], N: 50, dim: 2, device: cpu (64, 32, 4, 16, 32), N: 2, dim: 2, device: cpu (16, 32, 4, 16, 32), N: 8, dim: 2, device: cpu (9, 31, 5, 15, 33), N: 17, dim: 4, device: cpu [<function <lambda> at 0x7f91a3621170>], N: 100, dim: 0, device: cpu [<function <lambda> at 0x7f91a3621200>], N: 1000, dim: 0, device: cpu [<function <lambda> at 0x7f91a3621290>], N: 2000, dim: 0, device: cpu [<function <lambda> at 0x7f91a3621320>], N: 3000, dim: 0, device: cpu ``` ``` pytorch: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=all caffe2: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=all ``` ``` Metric: Forward Execution Time (us) pytorch \| caffe2 -------------------------------- 4.066 \| 0.312 351.507 \| 584.033 184.649 \| 292.157 9482.895 \| 6845.112 9558.988 \| 6847.511 13730.016 \| 14118.505 6324.371 \| 4840.883 4613.497 \| 3702.213 7504.718 \| 7889.751 9882.978 \| 7364.350 10087.076 \| 7483.178 16849.556 \| 18092.295 19181.075 \| 13363.742 19296.508 \| 13466.863 34157.449 \| 56320.073 176.483 \| 267.106 322.247 \| 352.782 480.064 \| 460.214 607.381 \| 476.908 ``` Reviewed By: hlu1 Differential Revision: D25890595 fbshipit-source-id: f53e125c0680bc2ebf722d1da5ec964bec585fdd	2021-01-12 18:27:44 -08:00
Shijun Kong	2de345d44d	Add op bench for caffe2 quantile op (#49598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49598 Add op bench for caffe2 quantile op Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:quantile_op_test -- --wramup_iterations=10000 --iterations=10000` Reviewed By: radkris-git Differential Revision: D25590085 fbshipit-source-id: 0db58ac87c595b2bf2958f6299a1bf2ccea019db	2020-12-18 08:32:59 -08:00
Ansha Yu	cb3169d7a8	[aten] index_select dim 1 (#47077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47077 Add benchmarks for pt index_select, batch_index_select, and c2's BatchGather Add batch_index_select implementation based on the C2 BatchGather implementation This currently falls back to index_select for backwards and cuda implementations. Alternatively, we can look into the specifics of why index_select is slower and replace the original implementation instead. Test Plan: ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/c2/batch_gather_test.par ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/index_select_test.par PT results comparing without fix, block_size 1 only, and all dim=1 ``` # no optimization # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K1_dim1_cpu # Input: M: 256, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 353.450 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K1_dim1_cpu # Input: M: 512, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 862.492 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K2_dim1_cpu # Input: M: 256, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 4555.344 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K2_dim1_cpu # Input: M: 512, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 11003.279 ``` ``` # block size 1 only # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K1_dim1_cpu # Input: M: 256, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 129.240 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K1_dim1_cpu # Input: M: 512, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 266.776 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K2_dim1_cpu # Input: M: 256, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 4508.593 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K2_dim1_cpu # Input: M: 512, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 10391.655 ``` ``` # dim 1 # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M8_N8_K1_dim1_cpu # Input: M: 8, N: 8, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 3.736 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K1_dim1_cpu # Input: M: 256, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 130.460 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K1_dim1_cpu # Input: M: 512, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 267.706 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M8_N8_K2_dim1_cpu # Input: M: 8, N: 8, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 4.187 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K2_dim1_cpu # Input: M: 256, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 1739.550 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K2_dim1_cpu # Input: M: 512, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 3468.332 ``` C2 results: ```# Benchmarking Caffe2: batch_gather WARNING: Logging before InitGoogleLogging() is written to STDERR W1203 13:19:35.310904 782584 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: batch_gather_M8_N8_K1_devicecpu # Input: M: 8, N: 8, K: 1, device: cpu Forward Execution Time (us) : 0.308 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M256_N512_K1_devicecpu # Input: M: 256, N: 512, K: 1, device: cpu Forward Execution Time (us) : 90.517 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M512_N512_K1_devicecpu # Input: M: 512, N: 512, K: 1, device: cpu Forward Execution Time (us) : 200.009 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M8_N8_K2_devicecpu # Input: M: 8, N: 8, K: 2, device: cpu Forward Execution Time (us) : 0.539 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M256_N512_K2_devicecpu # Input: M: 256, N: 512, K: 2, device: cpu Forward Execution Time (us) : 1001.540 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M512_N512_K2_devicecpu # Input: M: 512, N: 512, K: 2, device: cpu Forward Execution Time (us) : 2005.870 ``` buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_batch_gather Reviewed By: hlu1 Differential Revision: D24630227 fbshipit-source-id: cd205a30d96a33d239f3266820ada9a90093cf91	2020-12-14 15:39:33 -08:00
Brian Hirsh	c7cc8a48c0	migrating some straggler pytorch ops in fbcode to the new registration API (#48954 ) Summary: I already migrated the majority of fbcode ops to the new registration API, but there are a few stragglers (mostly new files that were created in the last two weeks). The goal is mostly to stamp out as much of the legacy registration API usage as possible, so that people only see the new API when they look around the code for examples of how to register their own ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48954 ghstack-source-id: 118140663 Test Plan: Ran buck targets for each file that I migrated Reviewed By: ezyang Differential Revision: D25380422 fbshipit-source-id: 268139a1d7b9ef14c07befdf9e5a31f15b96a48c	2020-12-09 14:42:29 -08:00
Yang Wang	0125e14c9a	[OpBench] change relu entry point after D24747035 Summary: D24747035 (`1478e5ec2a`) removes the entry point of `nnq.functional.relu`. Adjust op benchmark to `torch.nn.ReLU` accordingly. Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit --iterations 1 --warmup_iterations 1 Reviewed By: mingzhe09088 Differential Revision: D24961625 fbshipit-source-id: 5ed0ec7fa6d8cfefc8e7fc8324cf9a2a3e59de90	2020-11-13 15:38:27 -08:00
Yang Wang	9ee4f499f0	[OpBench] add _consume_op.list for processing input with type of List[Tensor] (#47890 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890 As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op` Test Plan: OSS: python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit Reviewed By: mingzhe09088 Differential Revision: D24774105 fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d	2020-11-12 23:29:40 -08:00
Yang Wang	8ff0b6fef8	[OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767 This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps: 1. create a scripted module from existing benchmark case. 2. run mobile specific optimization pass on the scripted module 3. run the scripted module on AiBench by calling its Python API A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation). Test Plan: ## local op_bench run buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 --use_jit Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version ``` RuntimeError: Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function. The error stack is reproduced here: Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript: File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260 quant_min: int, quant_max: int ): return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE : File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313 axis: int, quant_min: int, quant_max: int ): return self.op_func(input, scale, zero_point, axis, quant_min, quant_max) ~~~~~~~~~~~~ <--- HERE ``` `_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105 ## OSS test python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 ## saved module graph ``` module __torch__.mobile_benchmark_utils.OpBenchmarkMobile { parameters { } attributes { training = True num_iters = 1 benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50> } methods { method forward { graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile): %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4 %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 %1 : int = prim::GetAttr[name="num_iters"](%self) = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 block0(%i : int): %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple) %23 : int = prim::Constant[value=1]() %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 -> (%4) return (%12) } } submodules { module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark { parameters { } attributes { mobile_optimized = True } methods { method forward { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark, %input_one.1 : Tensor, %input_two.1 : Tensor): %3 : int = prim::Constant[value=1]() %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 return (%4) } method get_inputs { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark): %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() return (%self.inputs_tuple) } } submodules { } } } } ``` Reviewed By: kimishpatel Differential Revision: D24322214 fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1	2020-11-12 17:15:05 -08:00
Meng Wang	f692af209d	add unittest for operator benchmark (#47678 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47678 add unittest for operator benchmark. Covers below cases: ``` generate_c2_test generate_c2_gradient_test generate_pt_test generate_pt_gradient_test generate_pt_tests_from_op_list ``` Also fixed two issues (incorrect fn signature) found by the unittest in `benchmark_caffe2.py` Test Plan: arc lint buck run caffe2/benchmarks/operator_benchmark:operator_benchmark_unittest ``` test_c2_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1109 23:08:39.932207 639464 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.474 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.281 ok test_pt_list_of_ops (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.579 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.734 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M8 # Input: M: 8 Forward Execution Time (us) : 148.929 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M8 # Input: M: 8 Forward Execution Time (us) : 71.909 ok test_pt_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.860 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.293 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M8 # Input: M: 8 Forward Execution Time (us) : 148.999 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M8 # Input: M: 8 Forward Execution Time (us) : 71.941 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 179.108 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 1205.902 ok ``` buck run caffe2/benchmarks/operator_benchmark/c2:add_test ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1109 23:20:11.551795 654290 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8_N16_K32_dtypeint # Input: M: 8, N: 16, K: 32, dtype: int Forward Execution Time (us) : 984.510 # Benchmarking Caffe2: add # Name: add_M16_N16_K64_dtypefloat # Input: M: 16, N: 16, K: 64, dtype: float Forward Execution Time (us) : 68.526 # Benchmarking Caffe2: add # Name: add_M64_N64_K128_dtypeint # Input: M: 64, N: 64, K: 128, dtype: int Forward Execution Time (us) : 101617.076 ``` Reviewed By: mingzhe09088 Differential Revision: D24854414 fbshipit-source-id: 6676549909da6700b42f322c4ad6e8e2ef5b86b5	2020-11-10 15:45:36 -08:00
Radhakrishnan Venkataramani	163adb9fa7	Add HalfToFloat + FloatToHalf operators to PyTorch (#45092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092 Adding two operators 1. at::float_to_half -> Converts FP32 tensor to FP16 tensor 2. at::half_to_float -> Converts FP16 tensor to FP32 tensor. These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath. Test Plan: buck test //caffe2/test:torch -- .test_half_tensor. Run benchmark locally using ``` buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test ``` AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days. Benchmark for 512x512 tensor with FbGeMM implementation ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1246.332 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1734.304 ``` Benchmark for 512x512 tensor trunk with no FbGeMM integration. ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 169045.724 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 152382.494 ``` Reviewed By: ngimel Differential Revision: D23824869 fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c	2020-11-10 12:00:53 -08:00
Shijun Kong	220b3bd667	Add op benchmark for batch box cox as baseline (#47275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47275 ``` # Benchmarking Caffe2: batch_box_cox # Name: batch_box_cox_M64_N64_dtypedouble # Input: M: 64, N: 64, dtype: double Forward Execution Time (us) : 49.005 ``` Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:batch_box_cox_test -- --iterations=1000 --warmup 100` Reviewed By: houseroad Differential Revision: D24675426 fbshipit-source-id: 8bb1f3076dc6b01e7b63468136ddf3d9b6d7e5d2	2020-11-05 07:16:32 -08:00
Supriya Rao	d8c3b2b10c	[quant][pyper] Add support for pruned weights in embedding_bag_byte lookup (#47329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47329 Supports pruned weights along with mapping for the compressed indices Test Plan: python test/test_quantization.py TestQuantizedEmbeddingOps Imported from OSS Reviewed By: qizzzh Differential Revision: D24719909 fbshipit-source-id: f998f4039e84bbe1886e492a3bff6aa5f56b6b0f	2020-11-04 22:33:33 -08:00
Sheng Qin	c9222b7471	Implement clip_ranges operator for PyTorch Test Plan: unit test for correctness ``` buck test caffe2/torch/fb/sparsenn:test -- test_clip_ranges Parsing buck files: finished in 1.6 sec Creating action graph: finished in 18.9 sec Building: finished in 15.0 sec (100%) 9442/9442 jobs, 1 updated Total time: 35.6 sec More details at https://www.internalfb.com/intern/buck/build/66fb17de-859e-4d01-89bf-5c5de2950693 Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details. Running with tpx session id: 80f5e0c2-7db2-48a4-b148-25dd34651682 Trace available for this run at /tmp/tpx-20201026-123217.050766/trace.log Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422 ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (14.912) ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (14.098) Summary Pass: 1 ListingSuccess: 1 Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422 ``` new benchmark perf test ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 155.765 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 156.248 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 156.634 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 155.408 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 165.168 ``` Compare with the old implementation, there are around 300us gain ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 443.012 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 446.480 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 444.064 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 445.511 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 450.468 ``` Reviewed By: MarcioPorto Differential Revision: D24546110 fbshipit-source-id: e6c9b38e911f177f97961ede5bf375107f240363	2020-10-28 09:46:37 -07:00
Sheng Qin	c6858fd71a	Set up benchmarks for ClipRanges operator for Caffe2 and PyTorch Summary: As title, adding the benchmark tests for ClipRanges operators. Test Plan: benchmark test for Caffe2 ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: clip_ranges WARNING: Logging before InitGoogleLogging() is written to STDERR W1026 12:30:33.938997 2658759 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypeint32 # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: int32 Forward Execution Time (us) : 5.805 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypeint32 # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: int32 Forward Execution Time (us) : 5.913 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypeint32 # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: int32 Forward Execution Time (us) : 5.941 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypeint32 # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: int32 Forward Execution Time (us) : 5.868 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypeint32 # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: int32 Forward Execution Time (us) : 6.408 ``` benchmark test for PyTorch ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 443.012 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 446.480 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 444.064 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 445.511 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 450.468 ``` Reviewed By: MarcioPorto Differential Revision: D24500468 fbshipit-source-id: a582090a3982005af272cb10cdd257b2b2e787c4	2020-10-28 09:42:10 -07:00
Shijun Kong	d5cd781cd3	Update dper3 to use torch.nan_to_num and nan_to_num_ (#46873 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46873 OSS: Add op benchmark for torch.nan_to_num and torch.nan_to_num_ Test Plan: OSS: `buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:nan_to_num_test` Reviewed By: qizzzh, houseroad Differential Revision: D24521835 fbshipit-source-id: 1fd50a99e5329ffec2d470525ce6976d39424958	2020-10-27 06:41:48 -07:00

1 2 3 4 5 ...

296 Commits