mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
a732bbea23
296 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2e8b9c7785 |
[TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035 - modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change Reviewed By: tgalkovskyi Differential Revision: D35257017 fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f (cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275) |
||
|
|
5b011fc6eb |
Fix Undefined variable in QInterpolateBenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130 Approved by: https://github.com/malfet |
||
|
|
486572223b |
Fix command example (#72847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847
Reviewed By: malfet
Differential Revision: D34260868
Pulled By: kit1980
fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7
(cherry picked from commit
|
||
|
|
c2c859bdf2 |
[quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560 Test Plan: Imported from OSS Reviewed By: HDCharles Differential Revision: D31618282 Pulled By: b-koopman fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3 |
||
|
|
5c3529a86d |
[lint] small pass to make lint clean (#68367)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367 - bmm_test.py was using syntax not allowed in 3.6 - Some suppressions were not placed on the correct line. With this file, ``` lintrunner --paths-cmd='git grep -Il .' ``` passes successfully. Test Plan: Imported from OSS Reviewed By: janeyx99, mrshenli Differential Revision: D32436644 Pulled By: suo fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2 |
||
|
|
89c4e8c22b |
[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8 |
||
|
|
6900aacf54 |
[fbcode] Fix operator_benchmark with jit mode (#67382)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382 two simple updates: * fix running benchmark with --use_jit. Previously will fail with error torch.jit.frontend.UnsupportedNodeError: import statements aren't supported: File "/proc/self/fd/3/bmm_test.py", line 9 def __invoke_main(): import ctypes ~~~~~~ <--- HERE import ctypes.util import errno * add matmul to bmm benchmark as D31837588 Test Plan: buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test -- --forward_only=True --mkl_num_threads=1 --omp_num_threads=1 --use_jit=True Reviewed By: ShijunK Differential Revision: D31960528 fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f |
||
|
|
d802877dfa |
speed up quantized interpolate for channels last (#66525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525 This should solve https://github.com/pytorch/pytorch/issues/60015 There were two `q_zero_point()` accesses inside a for loop which was expensive. Moving them to before the loop sped things up 10x for a microbenchmark. Test Plan: ``` // comment out benchmarks unrelated to original issue, for simplicity cd benchmarks/operator_benchmark python -m pt.qinterpolate_test // before: 2994 us // after: 324 us // full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D31592422 fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459 |
||
|
|
b8e1999253 |
[quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183 Add a GPU benchmark for fakeQuant, similar to #65241 ghstack-source-id: 139810414 Test Plan: https://pxl.cl/1QjJM Reviewed By: b-koopman Differential Revision: D31288158 fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d |
||
|
|
e3af4be963 |
pytorch quantization ao migration phase 2: caffe2/benchmark (#65833)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833 Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks` folder. ``` find caffe2/benchmarks/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g" ``` Test Plan: CI Reviewed By: z-a-f Differential Revision: D31275963 fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116 |
||
|
|
aebde1bc2b |
deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31141433 Pulled By: mruberry fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732 |
||
|
|
6a6ee92e36 |
[quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241 Test Plan: Imported from OSS Reviewed By: jingsh Differential Revision: D31150087 Pulled By: b-koopman fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727 |
||
|
|
9c73a48ecf |
ND Embeddings benchmark - Standardize randomized inputs (#64707)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707 Use torch.randn instead of torch.from_numpy to generate the tensor Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test Reviewed By: jingsh Differential Revision: D30817302 fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672 |
||
|
|
3fbb49e75d |
Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647 Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim. Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test``` Reviewed By: jingsh Differential Revision: D30770085 fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e |
||
|
|
956c8fa01e |
Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654 Test Plan: ``` > buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 27.970 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 41.830 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 499.114 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 6.268 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 12.676 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 438.219 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 7.657 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 18.523 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 55.103 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 2.501 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 10.589 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 50.102 Reviewed By: ajyu Differential Revision: D30455179 fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b |
||
|
|
7a15576a65 |
[quant] update FakeQuant modules to use tensor qparams (#61318)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318 Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator. Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU. Local benchmark P427668213 Before this change ``` Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_aminmax 2.57% 1.507ms 3.10% 1.819ms 36.371us 2.872ms 4.81% 2.872ms 57.446us 50 aten::fake_quantize_per_tensor_affine 1.04% 610.915us 3.60% 2.114ms 42.276us 472.896us 0.79% 2.698ms 53.962us 50 aten::fake_quantize_per_tensor_affine_cachemask 1.69% 993.626us 2.56% 1.503ms 30.058us 2.225ms 3.73% 2.225ms 44.504us 50 aten::is_nonzero 3.85% 2.258ms 19.68% 11.540ms 46.161us 2.168ms 3.63% 11.084ms 44.336us 250 aten::zeros_like 1.82% 1.064ms 6.65% 3.901ms 39.007us 1.531ms 2.57% 3.905ms 39.045us 100 aten::eq 13.80% 8.093ms 25.90% 15.189ms 37.972us 9.580ms 16.05% 15.566ms 38.914us 400 aten::item 5.67% 3.323ms 21.50% 12.607ms 36.019us 3.233ms 5.42% 12.167ms 34.762us 350 aten::zeros 0.94% 549.208us 2.93% 1.717ms 34.343us 688.928us 1.15% 1.695ms 33.894us 50 aten::le 2.52% 1.478ms 4.50% 2.641ms 26.411us 1.753ms 2.94% 2.845ms 28.448us 100 aten::rsub 1.04% 608.715us 2.44% 1.433ms 28.667us 532.000us 0.89% 1.418ms 28.353us 50 aten::max 1.54% 905.401us 4.62% 2.711ms 27.106us 847.488us 1.42% 2.697ms 26.969us 100 aten::ones 0.92% 542.159us 2.16% 1.266ms 25.324us 661.856us 1.11% 1.301ms 26.017us 50 aten::min 0.82% 479.167us 2.15% 1.258ms 25.160us 407.808us 0.68% 1.276ms 25.530us 50 aten::_local_scalar_dense 15.83% 9.284ms 15.83% 9.284ms 26.526us 8.934ms 14.97% 8.934ms 25.524us 350 aten::clamp 2.35% 1.378ms 4.21% 2.467ms 24.669us 1.546ms 2.59% 2.461ms 24.612us 100 aten::zero_ 2.53% 1.482ms 5.65% 3.316ms 22.108us 1.326ms 2.22% 3.380ms 22.531us 150 aten::maximum 3.08% 1.805ms 3.08% 1.805ms 18.052us 1.849ms 3.10% 1.849ms 18.494us 100 aten::minimum 1.33% 778.854us 1.33% 778.854us 15.577us 868.672us 1.46% 868.672us 17.373us 50 aten::round 1.36% 799.910us 1.36% 799.910us 15.998us 809.568us 1.36% 809.568us 16.191us 50 aten::copy_ 6.61% 3.878ms 6.61% 3.878ms 15.513us 4.036ms 6.76% 4.036ms 16.143us 250 aten::div 2.53% 1.483ms 2.53% 1.483ms 14.833us 1.535ms 2.57% 1.535ms 15.353us 100 aten::mul 2.44% 1.431ms 2.44% 1.431ms 14.314us 1.478ms 2.48% 1.478ms 14.782us 100 aten::detach 1.46% 855.670us 2.41% 1.411ms 14.110us 832.448us 1.39% 1.395ms 13.949us 100 aten::add 2.22% 1.301ms 2.22% 1.301ms 13.008us 1.383ms 2.32% 1.383ms 13.828us 100 aten::fill_ 4.18% 2.452ms 4.18% 2.452ms 12.262us 2.693ms 4.51% 2.693ms 13.463us 200 aten::sub 5.06% 2.967ms 5.06% 2.967ms 14.837us 2.675ms 4.48% 2.675ms 13.374us 200 aten::to 2.10% 1.230ms 3.65% 2.140ms 10.701us 1.310ms 2.20% 2.062ms 10.310us 200 aten::select 1.28% 749.144us 1.49% 874.227us 8.742us 863.232us 1.45% 863.232us 8.632us 100 detach 0.95% 555.326us 0.95% 555.326us 5.553us 562.496us 0.94% 562.496us 5.625us 100 aten::as_strided 0.40% 232.289us 0.40% 232.289us 1.161us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 2.93% 1.720ms 2.93% 1.720ms 3.439us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.04% 611.313us 1.04% 611.313us 2.038us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.75% 438.585us 1.77% 1.036ms 5.180us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.36% 799.442us 1.36% 799.442us 3.198us 0.000us 0.00% 0.000us 0.000us 250 --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 58.645ms Self CUDA time total: 59.674ms ``` After this change ``` test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::fake_quantize_per_tensor_affine 0.98% 505.210us 4.38% 2.259ms 45.187us 419.424us 0.78% 3.218ms 64.367us 50 aten::_aminmax 2.78% 1.434ms 3.42% 1.766ms 35.321us 2.825ms 5.27% 2.825ms 56.505us 50 aten::fake_quantize_per_tensor_affine_cachemask_tens... 2.38% 1.229ms 3.40% 1.754ms 35.083us 2.799ms 5.22% 2.799ms 55.979us 50 aten::rsub 0.94% 485.040us 5.02% 2.590ms 51.793us 458.976us 0.86% 2.587ms 51.747us 50 aten::is_nonzero 3.78% 1.952ms 23.64% 12.196ms 48.786us 2.055ms 3.83% 11.986ms 47.944us 250 aten::item 6.92% 3.572ms 19.86% 10.244ms 40.977us 3.670ms 6.85% 9.931ms 39.724us 250 aten::zeros_like 1.65% 848.874us 6.64% 3.426ms 34.260us 1.397ms 2.61% 3.572ms 35.717us 100 aten::zeros 0.85% 436.691us 3.00% 1.549ms 30.984us 551.936us 1.03% 1.576ms 31.516us 50 aten::eq 10.60% 5.467ms 20.26% 10.452ms 26.130us 7.018ms 13.09% 10.832ms 27.079us 400 aten::le 2.58% 1.332ms 4.67% 2.407ms 24.074us 1.580ms 2.95% 2.614ms 26.144us 100 aten::_local_scalar_dense 12.93% 6.673ms 12.93% 6.673ms 26.691us 6.261ms 11.68% 6.261ms 25.046us 250 aten::clamp 2.43% 1.253ms 4.37% 2.256ms 22.560us 1.431ms 2.67% 2.273ms 22.725us 100 aten::ones 0.89% 460.133us 2.18% 1.123ms 22.467us 570.496us 1.06% 1.128ms 22.551us 50 aten::min 0.74% 383.132us 2.06% 1.065ms 21.296us 377.536us 0.70% 1.091ms 21.824us 50 aten::zero_ 2.36% 1.219ms 5.87% 3.029ms 20.194us 1.261ms 2.35% 3.199ms 21.327us 150 aten::max 1.51% 779.081us 4.06% 2.096ms 20.960us 791.680us 1.48% 2.130ms 21.295us 100 aten::sub 7.97% 4.111ms 7.97% 4.111ms 20.556us 3.847ms 7.18% 3.847ms 19.234us 200 aten::div 2.94% 1.516ms 2.94% 1.516ms 15.158us 1.580ms 2.95% 1.580ms 15.798us 100 aten::round 1.45% 750.445us 1.45% 750.445us 15.009us 756.064us 1.41% 756.064us 15.121us 50 aten::copy_ 6.88% 3.548ms 6.88% 3.548ms 14.190us 3.701ms 6.90% 3.701ms 14.803us 250 aten::minimum 1.32% 681.654us 1.32% 681.654us 13.633us 713.664us 1.33% 713.664us 14.273us 50 aten::maximum 2.55% 1.317ms 2.55% 1.317ms 13.169us 1.338ms 2.50% 1.338ms 13.378us 100 aten::mul 2.63% 1.358ms 2.63% 1.358ms 13.581us 1.328ms 2.48% 1.328ms 13.283us 100 aten::detach 1.34% 688.820us 2.35% 1.211ms 12.110us 772.800us 1.44% 1.278ms 12.779us 100 aten::fill_ 4.53% 2.338ms 4.53% 2.338ms 11.692us 2.495ms 4.65% 2.495ms 12.473us 200 aten::add 2.32% 1.197ms 2.32% 1.197ms 11.968us 1.240ms 2.31% 1.240ms 12.405us 100 aten::to 2.07% 1.069ms 3.66% 1.889ms 9.443us 1.224ms 2.28% 1.975ms 9.874us 200 aten::select 1.44% 743.042us 1.64% 848.207us 8.482us 641.600us 1.20% 641.600us 6.416us 100 detach 1.01% 522.155us 1.01% 522.155us 5.222us 505.088us 0.94% 505.088us 5.051us 100 aten::as_strided 0.44% 227.884us 0.44% 227.884us 1.139us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 3.20% 1.652ms 3.20% 1.652ms 3.304us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.25% 646.711us 1.25% 646.711us 2.156us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.79% 407.768us 2.07% 1.067ms 5.334us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.52% 785.788us 1.52% 785.788us 3.143us 0.000us 0.00% 0.000us 0.000us 250 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 51.590ms Self CUDA time total: 53.609ms ghstack-source-id: 133370215 Test Plan: buck test mode/dev-nosan caffe2/test/:quantization Reviewed By: raghuramank100 Differential Revision: D29566512 fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c |
||
|
|
3176f16691 |
[Pytorch benchmark] Add BMM benchmark (#59595)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595 ghstack-source-id: 130946743 Test Plan: bmm_test Reviewed By: mingzhe09088 Differential Revision: D28873228 fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c |
||
|
|
8b63573c31 |
[PyTorch Operator Benchmark] gelu benchmark (#59334)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334 Add gelu op benchmark ghstack-source-id: 130947172 Test Plan: gelu_test Reviewed By: hl475 Differential Revision: D28842959 fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4 |
||
|
|
277f587496 |
rename benchmark_cpp_extension (#58708)
Summary: Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708 Test Plan: Run from `./benchmarks/operator_benchmark/pt_extension` folder: ``` python setup.py install python cpp_extension_test.py ``` Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI Reviewed By: robieta Differential Revision: D28585582 Pulled By: walterddr fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57 |
||
|
|
0c2d38264a |
Improve BatchNorm1d performance (CUDA) (#57786)
Summary: Part of gh-38915, resubmit of gh-57034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/57786 Reviewed By: mruberry Differential Revision: D28290284 Pulled By: ngimel fbshipit-source-id: 8768578ba9ace6a948cb8145c0091e0ea49b12da |
||
|
|
2992ff3fb8 |
Revert D28142447: Improve BatchNorm1d performance (CUDA)
Test Plan: revert-hammer
Differential Revision:
D28142447 (
|
||
|
|
b2936ad8fa |
Improve BatchNorm1d performance (CUDA) (#57034)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57034 Resolves gh-38915 For the example given in the issue, BatchNorm1d on cuDNN is around 12x slower than BatchNorm2d. Internally, cuDNN expects at least a 4d tensor (N, C, H, W) so these two modules actually call the same cuDNN code. My assumption is that cuDNN just isn't optimized for H=W=1. Instead, this disables cudnn for 2d batch_norm inputs and improves the CUDA implementation of `native_batch_norm` to be competative with cuDNN. For the example in the issue, `BatchNorm1d` now takes 335 us compared to 6.3 ms before, or a 18x speedup. Before this change, nvprof shows: ``` Type Time(%) Time Calls Avg Min Max Name GPU activities: 99.64% 630.95ms 100 6.3095ms 5.6427ms 8.8800ms void cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>(cudnnTensorStruct, float const *, cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>, cudnnTensorStruct*, float const *, float const , cudnnTensorStruct*, cudnnTensorStruct*, cudnnTensorStruct**, float const *, float const *, float const *, cudnnTensorStruct*, cudnnTensorStruct*) ``` But after, it shows: ``` Type Time(%) Time Calls Avg Min Max Name GPU activities: 54.76% 14.352ms 100 143.52us 123.52us 756.28us _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07022batch_norm_elementwiseERKNS_6TensorES5_RKN3c108optionalIS3_EESA_S5_S5_ENKUlvE_clEvENKUlvE2_clEvEUlfffffE_NS_6detail5ArrayIPcLi6EEE16OffsetCalculatorILi5EjESI_ILi1EjENS0_6memory15LoadWithoutCastENSL_16StoreWithoutCastEEEviT_T0_T1_T2_T3_T4_ 35.09% 9.1951ms 100 91.950us 84.415us 362.17us void at::native::reduce_kernel<int=256, int=2, at::native::ReduceOp<float, at::native::WelfordOps<float, float, int, float, thrust::pair<float, float>>, unsigned int, float, int=2>>(float) 0.71% 186.14us 100 1.8610us 1.8240us 1.9840us _ZN2at6native72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07045unrolled_elementwise_kernel_for_multi_outputsILi3EZZZNS1_34batch_norm_update_stats_and_invertERKNS_6TensorES5_S5_S5_ddlENKUlvE_clEvENKUlvE2_clEvEUlffffE_NS_6detail5ArrayIPcLi7EEE23TrivialOffsetCalculatorILi4EjESD_ILi3EjEEEviT0_T1_T2_T3_ 0.59% 153.37us 100 1.5330us 1.4720us 2.6240us void at::native::vectorized_elementwise_kernel<int=4, at::native::BUnaryFunctor<at::native::AddFunctor<long>>, at::detail::Array<char*, int=2>>(int, long, at::native::AddFunctor<long>) ``` I think there is similar scope to improve the backward implementation. Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D28142447 Pulled By: ngimel fbshipit-source-id: c70109780e206fa85e50a31e90a1cb4c533199da |
||
|
|
e3900d2ba5 |
Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
|
||
|
|
cc11aaaa60 |
Disallow non-breaking spaces (#55465)
Summary:
malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465
Test Plan:
The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout
|
||
|
|
2b07bcf9eb |
[operator benchmarks] Added more interpolation test cases (#54584)
Summary: Description: - Added uint8 nearest test case - Added 3d vectorization test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/54584 Reviewed By: malfet Differential Revision: D27291303 Pulled By: fmassa fbshipit-source-id: 236ee5af351c8dc34ec3cdb7dda662c77feb8cf0 |
||
|
|
25a9f45a5a |
fix broken quantization_test in operator_benchmark (#53153)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53153 This diff is a fix for quantization_test in operator_benchmark, which is broken because of removing the py_module for learnable fake_quantization. ghstack-source-id: 123103477 Test Plan: `buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test` Reviewed By: z-a-f Differential Revision: D26764881 fbshipit-source-id: 8d40c6eb5e7090ca65f48982c837f7dc87d14378 |
||
|
|
8c798e0622 |
Forbid trailing whitespace (#53406)
Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97 |
||
|
|
5095332ab9 |
Minor cleanup of interpolate microbenchmark
Summary: Minor cleanup, addresses comments from https://www.internalfb.com/diff/D26780116 (
|
||
|
|
1559fa6a5c |
[operator benchmarks] Added more modes to interpolation tests (#53186)
Summary: Description: - Added more modes: bicubic and nearest to interpolation tests - Added a test case for downsampling a small image Pull Request resolved: https://github.com/pytorch/pytorch/pull/53186 Reviewed By: albanD Differential Revision: D26780116 Pulled By: fmassa fbshipit-source-id: f4f498e6e1da1ec131e6d9d9f42dc482135ae9e2 |
||
|
|
cb1596a193 |
[operator_benchmark] Added channels last 3d option to interpolate test (#53117)
Summary: Description: - Added channels last 3d option to interpolate test - split config non-4d into two : 3d and 5d Pull Request resolved: https://github.com/pytorch/pytorch/pull/53117 Reviewed By: NicolasHug Differential Revision: D26754243 Pulled By: fmassa fbshipit-source-id: 49bbab3bb47de27790e39537d0fbeca0f01782c4 |
||
|
|
9cf6be6b3e |
Fix torch.nn.functional.interpolate microbenchmark for non-4D inputs
Summary: This diff fixes the `interpolate` microbenchmark for non-4D inputs, which are not supported by the `bilinear` mode Test Plan: 5D and 3D: ``` # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,16,320,320)_output_size(8,256,256) # Input: input_size: (1, 3, 16, 320, 320), output_size: (8, 256, 256) Forward Execution Time (us) : 221008.660 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(4,512,320)_output_size(256,) # Input: input_size: (4, 512, 320), output_size: (256,) Forward Execution Time (us) : 9727.900 ``` 4D ``` # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True Forward Execution Time (us) : 375.181 ``` Reviewed By: fmassa Differential Revision: D26486678 fbshipit-source-id: 5d476afba3f35da9f8b86db16e21505bdb00888b |
||
|
|
4501b52fe5 |
Benchmark for torch.ops.quantized.linear_prepack_fp16 operator (#52229)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52229 Create benchmarks for torch.ops.quantized.linear_prepack_fp16 and torch.ops.quantized.linear_unpack_fp16 operators Benchmark for these operators are written in the same format as the other benchmarks for other operators. Test Plan: linear_prepack_fp16 test was successfully run with various parameters: Sample test run output: ---------------------------------------- PyTorch/Caffe2 Operator Micro-benchmarks ---------------------------------------- Tag : long Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N32_K256_cpu Input: M: 8, N: 32, K: 256, device: cpu Forward Execution Time (us) : 14.002 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N32_K512_cpu Input: M: 8, N: 32, K: 512, device: cpu Forward Execution Time (us) : 14.114 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N64_K256_cpu Input: M: 8, N: 64, K: 256, device: cpu Forward Execution Time (us) : 19.355 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M8_N64_K512_cpu Input: M: 8, N: 64, K: 512, device: cpu Forward Execution Time (us) : 19.056 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N32_K256_cpu Input: M: 128, N: 32, K: 256, device: cpu Forward Execution Time (us) : 115.963 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N32_K512_cpu Input: M: 128, N: 32, K: 512, device: cpu Forward Execution Time (us) : 116.259 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N64_K256_cpu Input: M: 128, N: 64, K: 256, device: cpu Forward Execution Time (us) : 229.336 Benchmarking PyTorch: linear_prepack_fp16 Mode: Eager Name: linear_prepack_fp16_M128_N64_K512_cpu Input: M: 128, N: 64, K: 512, device: cpu Forward Execution Time (us) : 220.016 linear_unpack_fp16 test was successfully run with identical parameters Reviewed By: b-koopman Differential Revision: D26403343 fbshipit-source-id: 11a98e56177952b94f291006975b0b719f48d1b9 |
||
|
|
50e6f0fdb6 |
Add benchmark for torch.nn.functional.interpolate
Summary: This diff adds a new microbencharmk for the `torch.nn.functional.interpolate` operator, using OpBench Test Plan: ``` [nicolashug@59262.od ~/fbsource/fbcode/caffe2/benchmarks/operator_benchmark/pt (39207820)]$ buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short Starting new Buck daemon... Buck daemon started. Parsing buck files: finished in 06:30.7 min Creating action graph: finished in 33.9 sec Building: finished in 02:53.4 min (100%) 24224/24224 jobs, 24224 updated Total time: 09:58.2 min /data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True Forward Execution Time (us) : 510.818 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse # Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False Forward Execution Time (us) : 684.324 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True Forward Execution Time (us) : 33791.970 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastFalse # Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: False Forward Execution Time (us) : 50120.585 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastTrue # Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: True Forward Execution Time (us) : 37668.089 # Benchmarking PyTorch: interpolate # Mode: Eager # Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastFalse # Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: False Forward Execution Time (us) : 56869.472 ``` Reviewed By: fmassa Differential Revision: D26225318 fbshipit-source-id: 7757296192e630c42a6e4913c5c1d93af11d286d |
||
|
|
721ba97eb6 |
Create op benchmark for stack (#51263)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263 - Add benchmark for stack op Test Plan: ``` buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime | grep Execution Forward Execution Time (us) : 6.380 Forward Execution Time (us) : 6.553 Forward Execution Time (us) : 14.904 Forward Execution Time (us) : 5.657 Forward Execution Time (us) : 5.612 Forward Execution Time (us) : 6.051 Forward Execution Time (us) : 4.225 Forward Execution Time (us) : 4.240 Forward Execution Time (us) : 6.280 Forward Execution Time (us) : 6.267 Forward Execution Time (us) : 418.932 Forward Execution Time (us) : 417.694 Forward Execution Time (us) : 1592.455 Forward Execution Time (us) : 2919.261 Forward Execution Time (us) : 211.458 Forward Execution Time (us) : 211.518 Forward Execution Time (us) : 783.953 Forward Execution Time (us) : 1457.823 Forward Execution Time (us) : 2032.816 Forward Execution Time (us) : 2090.662 Forward Execution Time (us) : 6487.098 Forward Execution Time (us) : 11874.702 Forward Execution Time (us) : 2123.830 Forward Execution Time (us) : 2195.453 Forward Execution Time (us) : 6435.978 Forward Execution Time (us) : 11852.205 Forward Execution Time (us) : 2036.526 Forward Execution Time (us) : 2055.618 Forward Execution Time (us) : 6417.192 Forward Execution Time (us) : 12468.744 Forward Execution Time (us) : 4959.704 Forward Execution Time (us) : 5121.823 Forward Execution Time (us) : 5082.105 Forward Execution Time (us) : 5395.936 Forward Execution Time (us) : 5162.756 Forward Execution Time (us) : 23798.080 Forward Execution Time (us) : 4957.921 Forward Execution Time (us) : 4971.234 Forward Execution Time (us) : 5005.909 Forward Execution Time (us) : 5159.614 Forward Execution Time (us) : 5013.221 Forward Execution Time (us) : 20238.741 Forward Execution Time (us) : 7632.439 Forward Execution Time (us) : 7589.376 Forward Execution Time (us) : 7859.937 Forward Execution Time (us) : 8214.213 Forward Execution Time (us) : 11606.562 Forward Execution Time (us) : 34612.919 ``` Reviewed By: hlu1 Differential Revision: D25859143 fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1 |
||
|
|
983b8e6b62 |
fake_quant: add a more memory efficient version (#50561)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561 Not for review yet, a bunch of TODOs need finalizing. tl;dr; add an alternative implementation of `fake_quantize` which saves a ask during the forward pass and uses it to calculate the backward. There are two benefits: 1. the backward function no longer needs the input Tensor, and it can be gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor to pass around, but its size is 4x smaller than the input tensor. A future optimization would be to pack the mask bitwise and unpack in the backward. 2. the computation of `qval` can be done only once in the forward and reused in the backward. No perf change observed, TODO verify with better matrics. TODO: describe in more detail Test Plan: OSS / torchvision / MobileNetV2 ``` python references/classification/train_quantization.py --print-freq 1 --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/ --output-dir ~/nfs/pytorch_vision_tests/ --backend qnnpack --epochs 5 TODO paste results here ``` TODO more Imported from OSS Reviewed By: ngimel Differential Revision: D25918519 fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6 |
||
|
|
dea9af5c06 |
Cat benchmark: use mobile feed tensor shapes and torch.cat out-variant (#50778)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50778 - use tensor shapes from ctr_mobilefeed merge net - use pt cat out-variant for a fairer comparison otherwise benchmark includes time to construct result tensor Test Plan: turbo off, devbig machine ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=static_runtime ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : static_runtime # Benchmarking Caffe2: concat # Name: concat_sizes(1,40)_N5_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: (1, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.619 # Benchmarking Caffe2: concat # Name: concat_sizes[(1,160),(1,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(1, 160), (1, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.369 # Benchmarking Caffe2: concat # Name: concat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.590 # Benchmarking Caffe2: concat # Name: concat_sizes[(1,580),(1,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(1, 580), (1, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 0.412 # Benchmarking Caffe2: concat # Name: concat_sizes(20,40)_N5_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: (20, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 2.464 # Benchmarking Caffe2: concat # Name: concat_sizes[(20,160),(20,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(20, 160), (20, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 1.652 # Benchmarking Caffe2: concat # Name: concat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 9.312 # Benchmarking Caffe2: concat # Name: concat_sizes[(20,580),(20,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat # Input: sizes: [(20, 580), (20, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float Forward Execution Time (us) : 6.532 ``` ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=static_runtime ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : static_runtime # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cpu # Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 3.313 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cpu # Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 3.680 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cpu # Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 3.452 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cpu # Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 4.653 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cpu # Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 7.364 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cpu # Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cpu Forward Execution Time (us) : 7.055 ``` Reviewed By: hlu1 Differential Revision: D25839036 fbshipit-source-id: 7a6a234f41dfcc56246a80141fe0c84f769a5a85 |
||
|
|
49896c48e0 |
Caffe2 Concat operator benchmark (#50449)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50449 Port caffe2 operator benchmark from torch.cat to caffe2 concat to measure the difference in performance. previous diff abandoned to rerun github CI tests. D25738076 Test Plan: Tested on devbig by running both pt and c2 benchmarks. Compiled with mode/opt Inputs: ``` size, number of inputs, cat dimension, device ---------------------------------------------------- (1, 1, 1), N: 2, dim: 0, device: cpu (512, 512, 2), N: 2, dim: 1, device: cpu (128, 1024, 2), N: 2, dim: 1, device: cpu (1024, 1024, 2), N: 2, dim: 0, device: cpu (1025, 1023, 2), N: 2, dim: 1, device: cpu (1024, 1024, 2), N: 2, dim: 2, device: cpu [<function <lambda> at 0x7f922718e8c0>, 111, 65], N: 5, dim: 0, device: cpu [96, <function <lambda> at 0x7f9226dad710>, 64], N: 5, dim: 1, device: cpu [128, 64, <function <lambda> at 0x7f91a3625ef0>], N: 5, dim: 2, device: cpu [<function <lambda> at 0x7f91a3625f80>, 32, 64], N: 50, dim: 0, device: cpu [32, <function <lambda> at 0x7f91a3621050>, 64], N: 50, dim: 1, device: cpu [33, 65, <function <lambda> at 0x7f91a36210e0>], N: 50, dim: 2, device: cpu (64, 32, 4, 16, 32), N: 2, dim: 2, device: cpu (16, 32, 4, 16, 32), N: 8, dim: 2, device: cpu (9, 31, 5, 15, 33), N: 17, dim: 4, device: cpu [<function <lambda> at 0x7f91a3621170>], N: 100, dim: 0, device: cpu [<function <lambda> at 0x7f91a3621200>], N: 1000, dim: 0, device: cpu [<function <lambda> at 0x7f91a3621290>], N: 2000, dim: 0, device: cpu [<function <lambda> at 0x7f91a3621320>], N: 3000, dim: 0, device: cpu ``` ``` pytorch: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=all caffe2: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=all ``` ``` Metric: Forward Execution Time (us) pytorch | caffe2 -------------------------------- 4.066 | 0.312 351.507 | 584.033 184.649 | 292.157 9482.895 | 6845.112 9558.988 | 6847.511 13730.016 | 14118.505 6324.371 | 4840.883 4613.497 | 3702.213 7504.718 | 7889.751 9882.978 | 7364.350 10087.076 | 7483.178 16849.556 | 18092.295 19181.075 | 13363.742 19296.508 | 13466.863 34157.449 | 56320.073 176.483 | 267.106 322.247 | 352.782 480.064 | 460.214 607.381 | 476.908 ``` Reviewed By: hlu1 Differential Revision: D25890595 fbshipit-source-id: f53e125c0680bc2ebf722d1da5ec964bec585fdd |
||
|
|
2de345d44d |
Add op bench for caffe2 quantile op (#49598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49598 Add op bench for caffe2 quantile op Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:quantile_op_test -- --wramup_iterations=10000 --iterations=10000` Reviewed By: radkris-git Differential Revision: D25590085 fbshipit-source-id: 0db58ac87c595b2bf2958f6299a1bf2ccea019db |
||
|
|
cb3169d7a8 |
[aten] index_select dim 1 (#47077)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47077 Add benchmarks for pt index_select, batch_index_select, and c2's BatchGather Add batch_index_select implementation based on the C2 BatchGather implementation This currently falls back to index_select for backwards and cuda implementations. Alternatively, we can look into the specifics of why index_select is slower and replace the original implementation instead. Test Plan: ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/c2/batch_gather_test.par ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/index_select_test.par PT results comparing without fix, block_size 1 only, and all dim=1 ``` # no optimization # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K1_dim1_cpu # Input: M: 256, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 353.450 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K1_dim1_cpu # Input: M: 512, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 862.492 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K2_dim1_cpu # Input: M: 256, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 4555.344 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K2_dim1_cpu # Input: M: 512, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 11003.279 ``` ``` # block size 1 only # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K1_dim1_cpu # Input: M: 256, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 129.240 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K1_dim1_cpu # Input: M: 512, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 266.776 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K2_dim1_cpu # Input: M: 256, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 4508.593 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K2_dim1_cpu # Input: M: 512, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 10391.655 ``` ``` # dim 1 # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M8_N8_K1_dim1_cpu # Input: M: 8, N: 8, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 3.736 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K1_dim1_cpu # Input: M: 256, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 130.460 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K1_dim1_cpu # Input: M: 512, N: 512, K: 1, dim: 1, device: cpu Forward Execution Time (us) : 267.706 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M8_N8_K2_dim1_cpu # Input: M: 8, N: 8, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 4.187 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M256_N512_K2_dim1_cpu # Input: M: 256, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 1739.550 # Benchmarking PyTorch: index_select # Mode: Eager # Name: index_select_M512_N512_K2_dim1_cpu # Input: M: 512, N: 512, K: 2, dim: 1, device: cpu Forward Execution Time (us) : 3468.332 ``` C2 results: ```# Benchmarking Caffe2: batch_gather WARNING: Logging before InitGoogleLogging() is written to STDERR W1203 13:19:35.310904 782584 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: batch_gather_M8_N8_K1_devicecpu # Input: M: 8, N: 8, K: 1, device: cpu Forward Execution Time (us) : 0.308 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M256_N512_K1_devicecpu # Input: M: 256, N: 512, K: 1, device: cpu Forward Execution Time (us) : 90.517 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M512_N512_K1_devicecpu # Input: M: 512, N: 512, K: 1, device: cpu Forward Execution Time (us) : 200.009 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M8_N8_K2_devicecpu # Input: M: 8, N: 8, K: 2, device: cpu Forward Execution Time (us) : 0.539 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M256_N512_K2_devicecpu # Input: M: 256, N: 512, K: 2, device: cpu Forward Execution Time (us) : 1001.540 # Benchmarking Caffe2: batch_gather # Name: batch_gather_M512_N512_K2_devicecpu # Input: M: 512, N: 512, K: 2, device: cpu Forward Execution Time (us) : 2005.870 ``` buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_batch_gather Reviewed By: hlu1 Differential Revision: D24630227 fbshipit-source-id: cd205a30d96a33d239f3266820ada9a90093cf91 |
||
|
|
c7cc8a48c0 |
migrating some straggler pytorch ops in fbcode to the new registration API (#48954)
Summary: I already migrated the majority of fbcode ops to the new registration API, but there are a few stragglers (mostly new files that were created in the last two weeks). The goal is mostly to stamp out as much of the legacy registration API usage as possible, so that people only see the new API when they look around the code for examples of how to register their own ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48954 ghstack-source-id: 118140663 Test Plan: Ran buck targets for each file that I migrated Reviewed By: ezyang Differential Revision: D25380422 fbshipit-source-id: 268139a1d7b9ef14c07befdf9e5a31f15b96a48c |
||
|
|
0125e14c9a |
[OpBench] change relu entry point after D24747035
Summary: D24747035 (
|
||
|
|
9ee4f499f0 |
[OpBench] add _consume_op.list for processing input with type of List[Tensor] (#47890)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890 As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op` Test Plan: OSS: python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit Reviewed By: mingzhe09088 Differential Revision: D24774105 fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d |
||
|
|
8ff0b6fef8 |
[OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767 This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps: 1. create a scripted module from existing benchmark case. 2. run mobile specific optimization pass on the scripted module 3. run the scripted module on AiBench by calling its Python API A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation). Test Plan: ## local op_bench run buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 --use_jit Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version ``` RuntimeError: Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function. The error stack is reproduced here: Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript: File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260 quant_min: int, quant_max: int ): return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE : File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313 axis: int, quant_min: int, quant_max: int ): return self.op_func(input, scale, zero_point, axis, quant_min, quant_max) ~~~~~~~~~~~~ <--- HERE ``` `_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105 ## OSS test python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 ## saved module graph ``` module __torch__.mobile_benchmark_utils.OpBenchmarkMobile { parameters { } attributes { training = True num_iters = 1 benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50> } methods { method forward { graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile): %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4 %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 %1 : int = prim::GetAttr[name="num_iters"](%self) = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 block0(%i : int): %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple) %23 : int = prim::Constant[value=1]() %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 -> (%4) return (%12) } } submodules { module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark { parameters { } attributes { mobile_optimized = True } methods { method forward { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark, %input_one.1 : Tensor, %input_two.1 : Tensor): %3 : int = prim::Constant[value=1]() %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 return (%4) } method get_inputs { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark): %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() return (%self.inputs_tuple) } } submodules { } } } } ``` Reviewed By: kimishpatel Differential Revision: D24322214 fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1 |
||
|
|
f692af209d |
add unittest for operator benchmark (#47678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47678 add unittest for operator benchmark. Covers below cases: ``` generate_c2_test generate_c2_gradient_test generate_pt_test generate_pt_gradient_test generate_pt_tests_from_op_list ``` Also fixed two issues (incorrect fn signature) found by the unittest in `benchmark_caffe2.py` Test Plan: arc lint buck run caffe2/benchmarks/operator_benchmark:operator_benchmark_unittest ``` test_c2_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1109 23:08:39.932207 639464 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.474 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.281 ok test_pt_list_of_ops (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.579 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.734 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M8 # Input: M: 8 Forward Execution Time (us) : 148.929 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M8 # Input: M: 8 Forward Execution Time (us) : 71.909 ok test_pt_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.860 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.293 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M8 # Input: M: 8 Forward Execution Time (us) : 148.999 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M8 # Input: M: 8 Forward Execution Time (us) : 71.941 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 179.108 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 1205.902 ok ``` buck run caffe2/benchmarks/operator_benchmark/c2:add_test ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1109 23:20:11.551795 654290 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8_N16_K32_dtypeint # Input: M: 8, N: 16, K: 32, dtype: int Forward Execution Time (us) : 984.510 # Benchmarking Caffe2: add # Name: add_M16_N16_K64_dtypefloat # Input: M: 16, N: 16, K: 64, dtype: float Forward Execution Time (us) : 68.526 # Benchmarking Caffe2: add # Name: add_M64_N64_K128_dtypeint # Input: M: 64, N: 64, K: 128, dtype: int Forward Execution Time (us) : 101617.076 ``` Reviewed By: mingzhe09088 Differential Revision: D24854414 fbshipit-source-id: 6676549909da6700b42f322c4ad6e8e2ef5b86b5 |
||
|
|
163adb9fa7 |
Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092 Adding two operators 1. at::float_to_half -> Converts FP32 tensor to FP16 tensor 2. at::half_to_float -> Converts FP16 tensor to FP32 tensor. These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath. Test Plan: buck test //caffe2/test:torch -- .*test_half_tensor.* Run benchmark locally using ``` buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test ``` AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days. Benchmark for 512x512 tensor with FbGeMM implementation ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1246.332 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1734.304 ``` Benchmark for 512x512 tensor trunk with no FbGeMM integration. ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 169045.724 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 152382.494 ``` Reviewed By: ngimel Differential Revision: D23824869 fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c |
||
|
|
220b3bd667 |
Add op benchmark for batch box cox as baseline (#47275)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47275 ``` # Benchmarking Caffe2: batch_box_cox # Name: batch_box_cox_M64_N64_dtypedouble # Input: M: 64, N: 64, dtype: double Forward Execution Time (us) : 49.005 ``` Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:batch_box_cox_test -- --iterations=1000 --warmup 100` Reviewed By: houseroad Differential Revision: D24675426 fbshipit-source-id: 8bb1f3076dc6b01e7b63468136ddf3d9b6d7e5d2 |
||
|
|
d8c3b2b10c |
[quant][pyper] Add support for pruned weights in embedding_bag_byte lookup (#47329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47329 Supports pruned weights along with mapping for the compressed indices Test Plan: python test/test_quantization.py TestQuantizedEmbeddingOps Imported from OSS Reviewed By: qizzzh Differential Revision: D24719909 fbshipit-source-id: f998f4039e84bbe1886e492a3bff6aa5f56b6b0f |
||
|
|
c9222b7471 |
Implement clip_ranges operator for PyTorch
Test Plan: unit test for correctness ``` buck test caffe2/torch/fb/sparsenn:test -- test_clip_ranges Parsing buck files: finished in 1.6 sec Creating action graph: finished in 18.9 sec Building: finished in 15.0 sec (100%) 9442/9442 jobs, 1 updated Total time: 35.6 sec More details at https://www.internalfb.com/intern/buck/build/66fb17de-859e-4d01-89bf-5c5de2950693 Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details. Running with tpx session id: 80f5e0c2-7db2-48a4-b148-25dd34651682 Trace available for this run at /tmp/tpx-20201026-123217.050766/trace.log Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422 ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (14.912) ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (14.098) Summary Pass: 1 ListingSuccess: 1 Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422 ``` new benchmark perf test ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 155.765 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 156.248 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 156.634 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 155.408 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 165.168 ``` Compare with the old implementation, there are **around 300us gain** ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 443.012 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 446.480 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 444.064 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 445.511 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 450.468 ``` Reviewed By: MarcioPorto Differential Revision: D24546110 fbshipit-source-id: e6c9b38e911f177f97961ede5bf375107f240363 |
||
|
|
c6858fd71a |
Set up benchmarks for ClipRanges operator for Caffe2 and PyTorch
Summary: As title, adding the benchmark tests for ClipRanges operators. Test Plan: benchmark test for Caffe2 ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: clip_ranges WARNING: Logging before InitGoogleLogging() is written to STDERR W1026 12:30:33.938997 2658759 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypeint32 # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: int32 Forward Execution Time (us) : 5.805 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypeint32 # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: int32 Forward Execution Time (us) : 5.913 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypeint32 # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: int32 Forward Execution Time (us) : 5.941 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypeint32 # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: int32 Forward Execution Time (us) : 5.868 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypeint32 # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: int32 Forward Execution Time (us) : 6.408 ``` benchmark test for PyTorch ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 443.012 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 446.480 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 444.064 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 445.511 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 450.468 ``` Reviewed By: MarcioPorto Differential Revision: D24500468 fbshipit-source-id: a582090a3982005af272cb10cdd257b2b2e787c4 |
||
|
|
d5cd781cd3 |
Update dper3 to use torch.nan_to_num and nan_to_num_ (#46873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46873 OSS: Add op benchmark for torch.nan_to_num and torch.nan_to_num_ Test Plan: OSS: `buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:nan_to_num_test` Reviewed By: qizzzh, houseroad Differential Revision: D24521835 fbshipit-source-id: 1fd50a99e5329ffec2d470525ce6976d39424958 |