Commit Graph

296 Commits

Author SHA1 Message Date
Norman Ponte
2e8b9c7785 [TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035

- modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change

Reviewed By: tgalkovskyi

Differential Revision: D35257017

fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f
(cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)
2022-04-01 00:35:42 +00:00
Sergii Dymchenko
5b011fc6eb Fix Undefined variable in QInterpolateBenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130
Approved by: https://github.com/malfet
2022-03-09 00:14:15 +00:00
Sergii Dymchenko
486572223b Fix command example (#72847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847

Reviewed By: malfet

Differential Revision: D34260868

Pulled By: kit1980

fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7
(cherry picked from commit c9e874c4d8)
2022-02-16 21:45:45 +00:00
Ben Koopman
c2c859bdf2 [quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31618282

Pulled By: b-koopman

fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3
2021-11-19 06:29:19 -08:00
Michael Suo
5c3529a86d [lint] small pass to make lint clean (#68367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367

- bmm_test.py was using syntax not allowed in 3.6
- Some suppressions were not placed on the correct line.

With this file,
```
lintrunner --paths-cmd='git grep -Il .'
```
passes successfully.

Test Plan: Imported from OSS

Reviewed By: janeyx99, mrshenli

Differential Revision: D32436644

Pulled By: suo

fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2
2021-11-16 10:27:00 -08:00
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Bin Wen
6900aacf54 [fbcode] Fix operator_benchmark with jit mode (#67382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382

two simple updates:

* fix running benchmark with --use_jit. Previously will fail with error

  torch.jit.frontend.UnsupportedNodeError: import statements aren't supported:
  File "/proc/self/fd/3/bmm_test.py", line 9
  def __invoke_main():
    import ctypes
    ~~~~~~ <--- HERE
    import ctypes.util
    import errno

* add matmul to bmm benchmark as D31837588

Test Plan:
buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test --  --forward_only=True --mkl_num_threads=1 --omp_num_threads=1
 --use_jit=True

Reviewed By: ShijunK

Differential Revision: D31960528

fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f
2021-10-28 08:48:10 -07:00
Vasiliy Kuznetsov
d802877dfa speed up quantized interpolate for channels last (#66525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525

This should solve https://github.com/pytorch/pytorch/issues/60015

There were two `q_zero_point()` accesses inside a for loop which was
expensive. Moving them to before the loop sped things up 10x for a
microbenchmark.

Test Plan:
```
// comment out benchmarks unrelated to original issue, for simplicity
cd benchmarks/operator_benchmark
python -m pt.qinterpolate_test

// before: 2994 us
// after: 324 us
// full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31592422

fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459
2021-10-14 08:11:26 -07:00
Alexandr Guzhva
b8e1999253 [quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183

Add a GPU benchmark for fakeQuant, similar to #65241
ghstack-source-id: 139810414

Test Plan: https://pxl.cl/1QjJM

Reviewed By: b-koopman

Differential Revision: D31288158

fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d
2021-10-06 08:07:42 -07:00
Vasiliy Kuznetsov
e3af4be963 pytorch quantization ao migration phase 2: caffe2/benchmark (#65833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833

Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks`
folder.

```
find caffe2/benchmarks/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g"
```

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D31275963

fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116
2021-10-01 06:17:36 -07:00
Philip Meier
aebde1bc2b deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31141433

Pulled By: mruberry

fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732
2021-09-29 02:40:52 -07:00
Ben Koopman
6a6ee92e36 [quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241

Test Plan: Imported from OSS

Reviewed By: jingsh

Differential Revision: D31150087

Pulled By: b-koopman

fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727
2021-09-27 16:01:49 -07:00
Eddie Ren
9c73a48ecf ND Embeddings benchmark - Standardize randomized inputs (#64707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707

Use torch.randn instead of torch.from_numpy to generate the tensor

Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test

Reviewed By: jingsh

Differential Revision: D30817302

fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672
2021-09-13 06:47:35 -07:00
Eddie Ren
3fbb49e75d Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647

Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim.

Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test```

Reviewed By: jingsh

Differential Revision: D30770085

fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e
2021-09-10 16:49:02 -07:00
Harut Movsisyan
956c8fa01e Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654

Test Plan:
```
> buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 27.970

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 41.830

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 499.114

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 6.268

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 12.676

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 438.219

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 7.657

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 18.523

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 55.103

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 2.501

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 10.589

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 50.102

Reviewed By: ajyu

Differential Revision: D30455179

fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b
2021-08-24 16:26:26 -07:00
Supriya Rao
7a15576a65 [quant] update FakeQuant modules to use tensor qparams (#61318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318

Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator.

Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU.
Local benchmark P427668213

Before this change
```
                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::_aminmax         2.57%       1.507ms         3.10%       1.819ms      36.371us       2.872ms         4.81%       2.872ms      57.446us            50
              aten::fake_quantize_per_tensor_affine         1.04%     610.915us         3.60%       2.114ms      42.276us     472.896us         0.79%       2.698ms      53.962us            50
    aten::fake_quantize_per_tensor_affine_cachemask         1.69%     993.626us         2.56%       1.503ms      30.058us       2.225ms         3.73%       2.225ms      44.504us            50
                                   aten::is_nonzero         3.85%       2.258ms        19.68%      11.540ms      46.161us       2.168ms         3.63%      11.084ms      44.336us           250
                                   aten::zeros_like         1.82%       1.064ms         6.65%       3.901ms      39.007us       1.531ms         2.57%       3.905ms      39.045us           100
                                           aten::eq        13.80%       8.093ms        25.90%      15.189ms      37.972us       9.580ms        16.05%      15.566ms      38.914us           400
                                         aten::item         5.67%       3.323ms        21.50%      12.607ms      36.019us       3.233ms         5.42%      12.167ms      34.762us           350
                                        aten::zeros         0.94%     549.208us         2.93%       1.717ms      34.343us     688.928us         1.15%       1.695ms      33.894us            50
                                           aten::le         2.52%       1.478ms         4.50%       2.641ms      26.411us       1.753ms         2.94%       2.845ms      28.448us           100
                                         aten::rsub         1.04%     608.715us         2.44%       1.433ms      28.667us     532.000us         0.89%       1.418ms      28.353us            50
                                          aten::max         1.54%     905.401us         4.62%       2.711ms      27.106us     847.488us         1.42%       2.697ms      26.969us           100
                                         aten::ones         0.92%     542.159us         2.16%       1.266ms      25.324us     661.856us         1.11%       1.301ms      26.017us            50
                                          aten::min         0.82%     479.167us         2.15%       1.258ms      25.160us     407.808us         0.68%       1.276ms      25.530us            50
                          aten::_local_scalar_dense        15.83%       9.284ms        15.83%       9.284ms      26.526us       8.934ms        14.97%       8.934ms      25.524us           350
                                        aten::clamp         2.35%       1.378ms         4.21%       2.467ms      24.669us       1.546ms         2.59%       2.461ms      24.612us           100
                                        aten::zero_         2.53%       1.482ms         5.65%       3.316ms      22.108us       1.326ms         2.22%       3.380ms      22.531us           150
                                      aten::maximum         3.08%       1.805ms         3.08%       1.805ms      18.052us       1.849ms         3.10%       1.849ms      18.494us           100
                                      aten::minimum         1.33%     778.854us         1.33%     778.854us      15.577us     868.672us         1.46%     868.672us      17.373us            50
                                        aten::round         1.36%     799.910us         1.36%     799.910us      15.998us     809.568us         1.36%     809.568us      16.191us            50
                                        aten::copy_         6.61%       3.878ms         6.61%       3.878ms      15.513us       4.036ms         6.76%       4.036ms      16.143us           250
                                          aten::div         2.53%       1.483ms         2.53%       1.483ms      14.833us       1.535ms         2.57%       1.535ms      15.353us           100
                                          aten::mul         2.44%       1.431ms         2.44%       1.431ms      14.314us       1.478ms         2.48%       1.478ms      14.782us           100
                                       aten::detach         1.46%     855.670us         2.41%       1.411ms      14.110us     832.448us         1.39%       1.395ms      13.949us           100
                                          aten::add         2.22%       1.301ms         2.22%       1.301ms      13.008us       1.383ms         2.32%       1.383ms      13.828us           100
                                        aten::fill_         4.18%       2.452ms         4.18%       2.452ms      12.262us       2.693ms         4.51%       2.693ms      13.463us           200
                                          aten::sub         5.06%       2.967ms         5.06%       2.967ms      14.837us       2.675ms         4.48%       2.675ms      13.374us           200
                                           aten::to         2.10%       1.230ms         3.65%       2.140ms      10.701us       1.310ms         2.20%       2.062ms      10.310us           200
                                       aten::select         1.28%     749.144us         1.49%     874.227us       8.742us     863.232us         1.45%     863.232us       8.632us           100
                                             detach         0.95%     555.326us         0.95%     555.326us       5.553us     562.496us         0.94%     562.496us       5.625us           100
                                   aten::as_strided         0.40%     232.289us         0.40%     232.289us       1.161us       0.000us         0.00%       0.000us       0.000us           200
                                        aten::empty         2.93%       1.720ms         2.93%       1.720ms       3.439us       0.000us         0.00%       0.000us       0.000us           500
                                      aten::resize_         1.04%     611.313us         1.04%     611.313us       2.038us       0.000us         0.00%       0.000us       0.000us           300
                                   aten::empty_like         0.75%     438.585us         1.77%       1.036ms       5.180us       0.000us         0.00%       0.000us       0.000us           200
                                aten::empty_strided         1.36%     799.442us         1.36%     799.442us       3.198us       0.000us         0.00%       0.000us       0.000us           250
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 58.645ms
Self CUDA time total: 59.674ms
```

After this change
```

test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                  aten::fake_quantize_per_tensor_affine         0.98%     505.210us         4.38%       2.259ms      45.187us     419.424us         0.78%       3.218ms      64.367us            50
                                         aten::_aminmax         2.78%       1.434ms         3.42%       1.766ms      35.321us       2.825ms         5.27%       2.825ms      56.505us            50
aten::fake_quantize_per_tensor_affine_cachemask_tens...         2.38%       1.229ms         3.40%       1.754ms      35.083us       2.799ms         5.22%       2.799ms      55.979us            50
                                             aten::rsub         0.94%     485.040us         5.02%       2.590ms      51.793us     458.976us         0.86%       2.587ms      51.747us            50
                                       aten::is_nonzero         3.78%       1.952ms        23.64%      12.196ms      48.786us       2.055ms         3.83%      11.986ms      47.944us           250
                                             aten::item         6.92%       3.572ms        19.86%      10.244ms      40.977us       3.670ms         6.85%       9.931ms      39.724us           250
                                       aten::zeros_like         1.65%     848.874us         6.64%       3.426ms      34.260us       1.397ms         2.61%       3.572ms      35.717us           100
                                            aten::zeros         0.85%     436.691us         3.00%       1.549ms      30.984us     551.936us         1.03%       1.576ms      31.516us            50
                                               aten::eq        10.60%       5.467ms        20.26%      10.452ms      26.130us       7.018ms        13.09%      10.832ms      27.079us           400
                                               aten::le         2.58%       1.332ms         4.67%       2.407ms      24.074us       1.580ms         2.95%       2.614ms      26.144us           100
                              aten::_local_scalar_dense        12.93%       6.673ms        12.93%       6.673ms      26.691us       6.261ms        11.68%       6.261ms      25.046us           250
                                            aten::clamp         2.43%       1.253ms         4.37%       2.256ms      22.560us       1.431ms         2.67%       2.273ms      22.725us           100
                                             aten::ones         0.89%     460.133us         2.18%       1.123ms      22.467us     570.496us         1.06%       1.128ms      22.551us            50
                                              aten::min         0.74%     383.132us         2.06%       1.065ms      21.296us     377.536us         0.70%       1.091ms      21.824us            50
                                            aten::zero_         2.36%       1.219ms         5.87%       3.029ms      20.194us       1.261ms         2.35%       3.199ms      21.327us           150
                                              aten::max         1.51%     779.081us         4.06%       2.096ms      20.960us     791.680us         1.48%       2.130ms      21.295us           100
                                              aten::sub         7.97%       4.111ms         7.97%       4.111ms      20.556us       3.847ms         7.18%       3.847ms      19.234us           200
                                              aten::div         2.94%       1.516ms         2.94%       1.516ms      15.158us       1.580ms         2.95%       1.580ms      15.798us           100
                                            aten::round         1.45%     750.445us         1.45%     750.445us      15.009us     756.064us         1.41%     756.064us      15.121us            50
                                            aten::copy_         6.88%       3.548ms         6.88%       3.548ms      14.190us       3.701ms         6.90%       3.701ms      14.803us           250
                                          aten::minimum         1.32%     681.654us         1.32%     681.654us      13.633us     713.664us         1.33%     713.664us      14.273us            50
                                          aten::maximum         2.55%       1.317ms         2.55%       1.317ms      13.169us       1.338ms         2.50%       1.338ms      13.378us           100
                                              aten::mul         2.63%       1.358ms         2.63%       1.358ms      13.581us       1.328ms         2.48%       1.328ms      13.283us           100
                                           aten::detach         1.34%     688.820us         2.35%       1.211ms      12.110us     772.800us         1.44%       1.278ms      12.779us           100
                                            aten::fill_         4.53%       2.338ms         4.53%       2.338ms      11.692us       2.495ms         4.65%       2.495ms      12.473us           200
                                              aten::add         2.32%       1.197ms         2.32%       1.197ms      11.968us       1.240ms         2.31%       1.240ms      12.405us           100
                                               aten::to         2.07%       1.069ms         3.66%       1.889ms       9.443us       1.224ms         2.28%       1.975ms       9.874us           200
                                           aten::select         1.44%     743.042us         1.64%     848.207us       8.482us     641.600us         1.20%     641.600us       6.416us           100
                                                 detach         1.01%     522.155us         1.01%     522.155us       5.222us     505.088us         0.94%     505.088us       5.051us           100
                                       aten::as_strided         0.44%     227.884us         0.44%     227.884us       1.139us       0.000us         0.00%       0.000us       0.000us           200
                                            aten::empty         3.20%       1.652ms         3.20%       1.652ms       3.304us       0.000us         0.00%       0.000us       0.000us           500
                                          aten::resize_         1.25%     646.711us         1.25%     646.711us       2.156us       0.000us         0.00%       0.000us       0.000us           300
                                       aten::empty_like         0.79%     407.768us         2.07%       1.067ms       5.334us       0.000us         0.00%       0.000us       0.000us           200
                                    aten::empty_strided         1.52%     785.788us         1.52%     785.788us       3.143us       0.000us         0.00%       0.000us       0.000us           250
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 51.590ms
Self CUDA time total: 53.609ms
ghstack-source-id: 133370215

Test Plan: buck test mode/dev-nosan caffe2/test/:quantization

Reviewed By: raghuramank100

Differential Revision: D29566512

fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c
2021-07-10 19:43:02 -07:00
Kimish Patel
3176f16691 [Pytorch benchmark] Add BMM benchmark (#59595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595

ghstack-source-id: 130946743

Test Plan: bmm_test

Reviewed By: mingzhe09088

Differential Revision: D28873228

fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c
2021-06-10 08:24:29 -07:00
Kimish Patel
8b63573c31 [PyTorch Operator Benchmark] gelu benchmark (#59334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334

Add gelu op benchmark
ghstack-source-id: 130947172

Test Plan: gelu_test

Reviewed By: hl475

Differential Revision: D28842959

fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4
2021-06-09 16:09:37 -07:00
Rong Rong (AI Infra)
277f587496 rename benchmark_cpp_extension (#58708)
Summary:
Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708

Test Plan:
Run from `./benchmarks/operator_benchmark/pt_extension` folder:
```
python setup.py install
python cpp_extension_test.py
```

Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI

Reviewed By: robieta

Differential Revision: D28585582

Pulled By: walterddr

fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57
2021-05-24 11:04:02 -07:00
Peter Bell
0c2d38264a Improve BatchNorm1d performance (CUDA) (#57786)
Summary:
Part of gh-38915, resubmit of gh-57034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57786

Reviewed By: mruberry

Differential Revision: D28290284

Pulled By: ngimel

fbshipit-source-id: 8768578ba9ace6a948cb8145c0091e0ea49b12da
2021-05-08 19:09:29 -07:00
Sam Estep
2992ff3fb8 Revert D28142447: Improve BatchNorm1d performance (CUDA)
Test Plan: revert-hammer

Differential Revision:
D28142447 (b2936ad8fa)

Original commit changeset: c70109780e20

fbshipit-source-id: e93f6d00d644697b106f5ea8ab79872f353b51c6
2021-05-06 15:01:19 -07:00
Peter Bell
b2936ad8fa Improve BatchNorm1d performance (CUDA) (#57034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57034

Resolves gh-38915

For the example given in the issue, BatchNorm1d on cuDNN is around 12x slower
than BatchNorm2d. Internally, cuDNN expects at least a 4d tensor (N, C, H, W)
so these two modules actually call the same cuDNN code. My assumption is that
cuDNN just isn't optimized for H=W=1.

Instead, this disables cudnn for 2d batch_norm inputs and improves the CUDA
implementation of `native_batch_norm` to be competative with cuDNN. For the
example in the issue, `BatchNorm1d` now takes 335 us compared to 6.3 ms before,
or a 18x speedup.

Before this change, nvprof shows:
```
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.64%  630.95ms       100  6.3095ms  5.6427ms  8.8800ms  void cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>(cudnnTensorStruct, float const *, cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>, cudnnTensorStruct*, float const *, float const , cudnnTensorStruct*, cudnnTensorStruct*, cudnnTensorStruct**, float const *, float const *, float const *, cudnnTensorStruct*, cudnnTensorStruct*)
```

But after, it shows:
```
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.76%  14.352ms       100  143.52us  123.52us  756.28us  _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07022batch_norm_elementwiseERKNS_6TensorES5_RKN3c108optionalIS3_EESA_S5_S5_ENKUlvE_clEvENKUlvE2_clEvEUlfffffE_NS_6detail5ArrayIPcLi6EEE16OffsetCalculatorILi5EjESI_ILi1EjENS0_6memory15LoadWithoutCastENSL_16StoreWithoutCastEEEviT_T0_T1_T2_T3_T4_
                   35.09%  9.1951ms       100  91.950us  84.415us  362.17us  void at::native::reduce_kernel<int=256, int=2, at::native::ReduceOp<float, at::native::WelfordOps<float, float, int, float, thrust::pair<float, float>>, unsigned int, float, int=2>>(float)
                    0.71%  186.14us       100  1.8610us  1.8240us  1.9840us  _ZN2at6native72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07045unrolled_elementwise_kernel_for_multi_outputsILi3EZZZNS1_34batch_norm_update_stats_and_invertERKNS_6TensorES5_S5_S5_ddlENKUlvE_clEvENKUlvE2_clEvEUlffffE_NS_6detail5ArrayIPcLi7EEE23TrivialOffsetCalculatorILi4EjESD_ILi3EjEEEviT0_T1_T2_T3_
                    0.59%  153.37us       100  1.5330us  1.4720us  2.6240us
										void at::native::vectorized_elementwise_kernel<int=4,
										at::native::BUnaryFunctor<at::native::AddFunctor<long>>,
										at::detail::Array<char*, int=2>>(int, long,
										at::native::AddFunctor<long>)
```

I think there is similar scope to improve the backward implementation.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28142447

Pulled By: ngimel

fbshipit-source-id: c70109780e206fa85e50a31e90a1cb4c533199da
2021-05-06 12:14:02 -07:00
Sam Estep
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
Sam Estep
cc11aaaa60 Disallow non-breaking spaces (#55465)
Summary:
malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465

Test Plan:
The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout ece075195d and run it locally:
```sh
(! git --no-pager grep -In $'\u00a0' -- . || (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false))
```
It should print over a hundred lines of output and exit with status 1.

Reviewed By: janeyx99

Differential Revision: D27622136

Pulled By: samestep

fbshipit-source-id: e7ffd5a9519093e7a0ffdf55e9291f63e21ce841
2021-04-08 15:44:44 -07:00
vfdev-5
2b07bcf9eb [operator benchmarks] Added more interpolation test cases (#54584)
Summary:
Description:
- Added uint8 nearest test case
- Added 3d vectorization test case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54584

Reviewed By: malfet

Differential Revision: D27291303

Pulled By: fmassa

fbshipit-source-id: 236ee5af351c8dc34ec3cdb7dda662c77feb8cf0
2021-03-24 11:46:27 -07:00
Haichuan Yang
25a9f45a5a fix broken quantization_test in operator_benchmark (#53153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53153

This diff is a fix for quantization_test in operator_benchmark, which is broken because of removing the py_module for learnable fake_quantization.
ghstack-source-id: 123103477

Test Plan: `buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test`

Reviewed By: z-a-f

Differential Revision: D26764881

fbshipit-source-id: 8d40c6eb5e7090ca65f48982c837f7dc87d14378
2021-03-08 12:12:57 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Nicolas Hug
5095332ab9 Minor cleanup of interpolate microbenchmark
Summary: Minor cleanup, addresses comments from https://www.internalfb.com/diff/D26780116 (1559fa6a5c)

Test Plan:
```
➜  vision buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short
Parsing buck files: finished in 0.6 sec
Building: finished in 6.2 sec (100%) 10951/10951 jobs, 0 updated
  Total time: 6.9 sec
/data/users/nicolashug/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modenearest
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: nearest
Forward Execution Time (us) : 1346.156

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modelinear
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: linear
Forward Execution Time (us) : 1283.784

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modebicubic
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: bicubic
Forward Execution Time (us) : 4769.578

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modenearest
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: nearest
Forward Execution Time (us) : 982.910

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modelinear
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: linear
Forward Execution Time (us) : 1182.191

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modebicubic
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: bicubic
Forward Execution Time (us) : 3545.873

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modenearest
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: nearest
Forward Execution Time (us) : 34373.955

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modelinear
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: linear
Forward Execution Time (us) : 42248.109

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modebicubic
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: bicubic
Forward Execution Time (us) : 405944.286
...
```

Reviewed By: fmassa

Differential Revision: D26782757

fbshipit-source-id: 2039e1e6b4fea2b56bb4bcf2a017476f928e4928
2021-03-04 05:36:28 -08:00
vfdev-5
1559fa6a5c [operator benchmarks] Added more modes to interpolation tests (#53186)
Summary:
Description:
- Added more modes: bicubic and nearest to interpolation tests
- Added a test case for downsampling a small image

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53186

Reviewed By: albanD

Differential Revision: D26780116

Pulled By: fmassa

fbshipit-source-id: f4f498e6e1da1ec131e6d9d9f42dc482135ae9e2
2021-03-03 09:18:38 -08:00
vfdev-5
cb1596a193 [operator_benchmark] Added channels last 3d option to interpolate test (#53117)
Summary:
Description:

- Added channels last 3d option to interpolate test
  - split config non-4d into two : 3d and 5d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53117

Reviewed By: NicolasHug

Differential Revision: D26754243

Pulled By: fmassa

fbshipit-source-id: 49bbab3bb47de27790e39537d0fbeca0f01782c4
2021-03-02 11:54:45 -08:00
Nicolas Hug
9cf6be6b3e Fix torch.nn.functional.interpolate microbenchmark for non-4D inputs
Summary: This diff fixes the `interpolate` microbenchmark for non-4D inputs, which are not supported by the `bilinear` mode

Test Plan:
5D and 3D:

```
# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,16,320,320)_output_size(8,256,256)
# Input: input_size: (1, 3, 16, 320, 320), output_size: (8, 256, 256)
Forward Execution Time (us) : 221008.660

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(4,512,320)_output_size(256,)
# Input: input_size: (4, 512, 320), output_size: (256,)
Forward Execution Time (us) : 9727.900

```

4D
```
# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True
Forward Execution Time (us) : 375.181

```

Reviewed By: fmassa

Differential Revision: D26486678

fbshipit-source-id: 5d476afba3f35da9f8b86db16e21505bdb00888b
2021-02-18 02:07:54 -08:00
Vuk Radovic
4501b52fe5 Benchmark for torch.ops.quantized.linear_prepack_fp16 operator (#52229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52229

Create benchmarks for
torch.ops.quantized.linear_prepack_fp16  and torch.ops.quantized.linear_unpack_fp16 operators

Benchmark for these operators are written in the same format as the other benchmarks for other operators.

Test Plan:
linear_prepack_fp16 test was successfully run with various parameters:

Sample test run output:
 ----------------------------------------
 PyTorch/Caffe2 Operator Micro-benchmarks
 ----------------------------------------
 Tag : long

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N32_K256_cpu
 Input: M: 8, N: 32, K: 256, device: cpu
Forward Execution Time (us) : 14.002

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N32_K512_cpu
 Input: M: 8, N: 32, K: 512, device: cpu
Forward Execution Time (us) : 14.114

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N64_K256_cpu
 Input: M: 8, N: 64, K: 256, device: cpu
Forward Execution Time (us) : 19.355

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N64_K512_cpu
 Input: M: 8, N: 64, K: 512, device: cpu
Forward Execution Time (us) : 19.056

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N32_K256_cpu
 Input: M: 128, N: 32, K: 256, device: cpu
Forward Execution Time (us) : 115.963

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N32_K512_cpu
 Input: M: 128, N: 32, K: 512, device: cpu
Forward Execution Time (us) : 116.259

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N64_K256_cpu
 Input: M: 128, N: 64, K: 256, device: cpu
Forward Execution Time (us) : 229.336

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N64_K512_cpu
 Input: M: 128, N: 64, K: 512, device: cpu
Forward Execution Time (us) : 220.016

linear_unpack_fp16 test was successfully run with identical parameters

Reviewed By: b-koopman

Differential Revision: D26403343

fbshipit-source-id: 11a98e56177952b94f291006975b0b719f48d1b9
2021-02-17 08:02:01 -08:00
Nicolas Hug
50e6f0fdb6 Add benchmark for torch.nn.functional.interpolate
Summary:
This diff adds a new microbencharmk for the
 `torch.nn.functional.interpolate` operator, using OpBench

Test Plan:
```
[nicolashug@59262.od ~/fbsource/fbcode/caffe2/benchmarks/operator_benchmark/pt (39207820)]$ buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short
Starting new Buck daemon...
Buck daemon started.
Parsing buck files: finished in 06:30.7 min
Creating action graph: finished in 33.9 sec
Building: finished in 02:53.4 min (100%) 24224/24224 jobs, 24224 updated
  Total time: 09:58.2 min
/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True
Forward Execution Time (us) : 510.818

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False
Forward Execution Time (us) : 684.324

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True
Forward Execution Time (us) : 33791.970

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastFalse
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: False
Forward Execution Time (us) : 50120.585

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastTrue
# Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: True
Forward Execution Time (us) : 37668.089

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastFalse
# Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: False
Forward Execution Time (us) : 56869.472
```

Reviewed By: fmassa

Differential Revision: D26225318

fbshipit-source-id: 7757296192e630c42a6e4913c5c1d93af11d286d
2021-02-10 08:28:16 -08:00
Marat Subkhankulov
721ba97eb6 Create op benchmark for stack (#51263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263

- Add benchmark for stack op

Test Plan:
```
buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime | grep Execution

Forward Execution Time (us) : 6.380
Forward Execution Time (us) : 6.553
Forward Execution Time (us) : 14.904
Forward Execution Time (us) : 5.657
Forward Execution Time (us) : 5.612
Forward Execution Time (us) : 6.051
Forward Execution Time (us) : 4.225
Forward Execution Time (us) : 4.240
Forward Execution Time (us) : 6.280
Forward Execution Time (us) : 6.267
Forward Execution Time (us) : 418.932
Forward Execution Time (us) : 417.694
Forward Execution Time (us) : 1592.455
Forward Execution Time (us) : 2919.261
Forward Execution Time (us) : 211.458
Forward Execution Time (us) : 211.518
Forward Execution Time (us) : 783.953
Forward Execution Time (us) : 1457.823
Forward Execution Time (us) : 2032.816
Forward Execution Time (us) : 2090.662
Forward Execution Time (us) : 6487.098
Forward Execution Time (us) : 11874.702
Forward Execution Time (us) : 2123.830
Forward Execution Time (us) : 2195.453
Forward Execution Time (us) : 6435.978
Forward Execution Time (us) : 11852.205
Forward Execution Time (us) : 2036.526
Forward Execution Time (us) : 2055.618
Forward Execution Time (us) : 6417.192
Forward Execution Time (us) : 12468.744
Forward Execution Time (us) : 4959.704
Forward Execution Time (us) : 5121.823
Forward Execution Time (us) : 5082.105
Forward Execution Time (us) : 5395.936
Forward Execution Time (us) : 5162.756
Forward Execution Time (us) : 23798.080
Forward Execution Time (us) : 4957.921
Forward Execution Time (us) : 4971.234
Forward Execution Time (us) : 5005.909
Forward Execution Time (us) : 5159.614
Forward Execution Time (us) : 5013.221
Forward Execution Time (us) : 20238.741
Forward Execution Time (us) : 7632.439
Forward Execution Time (us) : 7589.376
Forward Execution Time (us) : 7859.937
Forward Execution Time (us) : 8214.213
Forward Execution Time (us) : 11606.562
Forward Execution Time (us) : 34612.919
```

Reviewed By: hlu1

Differential Revision: D25859143

fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1
2021-01-30 10:32:14 -08:00
Vasiliy Kuznetsov
983b8e6b62 fake_quant: add a more memory efficient version (#50561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:
OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Imported from OSS

Reviewed By: ngimel

Differential Revision: D25918519

fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6
2021-01-27 19:36:04 -08:00
Marat Subkhankulov
dea9af5c06 Cat benchmark: use mobile feed tensor shapes and torch.cat out-variant (#50778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50778

- use tensor shapes from ctr_mobilefeed merge net
- use pt cat out-variant for a fairer comparison otherwise benchmark includes time to construct result tensor

Test Plan:
turbo off, devbig machine
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=static_runtime
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : static_runtime

# Benchmarking Caffe2: concat
# Name: concat_sizes(1,40)_N5_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: (1, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.619

# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,160),(1,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 160), (1, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.369

# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.590

# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,580),(1,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 580), (1, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.412

# Benchmarking Caffe2: concat
# Name: concat_sizes(20,40)_N5_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: (20, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 2.464

# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,160),(20,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 160), (20, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 1.652

# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 9.312

# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,580),(20,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 580), (20, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 6.532
```
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=static_runtime
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : static_runtime

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cpu
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.313

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cpu
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.680

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cpu
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.452

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cpu
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 4.653

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cpu
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 7.364

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cpu
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 7.055
```

Reviewed By: hlu1

Differential Revision: D25839036

fbshipit-source-id: 7a6a234f41dfcc56246a80141fe0c84f769a5a85
2021-01-19 22:50:28 -08:00
Marat Subkhankulov
49896c48e0 Caffe2 Concat operator benchmark (#50449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50449

Port caffe2 operator benchmark from torch.cat to caffe2 concat to measure the difference in performance.

previous diff abandoned to rerun github CI tests. D25738076

Test Plan:
Tested on devbig by running both pt and c2 benchmarks. Compiled with mode/opt

Inputs:
```
size, number of inputs, cat dimension, device
----------------------------------------------------
(1, 1, 1), N: 2, dim: 0, device: cpu
(512, 512, 2), N: 2, dim: 1, device: cpu
(128, 1024, 2), N: 2, dim: 1, device: cpu
(1024, 1024, 2), N: 2, dim: 0, device: cpu
(1025, 1023, 2), N: 2, dim: 1, device: cpu
(1024, 1024, 2), N: 2, dim: 2, device: cpu
[<function <lambda> at 0x7f922718e8c0>, 111, 65], N: 5, dim: 0, device: cpu
[96, <function <lambda> at 0x7f9226dad710>, 64], N: 5, dim: 1, device: cpu
[128, 64, <function <lambda> at 0x7f91a3625ef0>], N: 5, dim: 2, device: cpu
[<function <lambda> at 0x7f91a3625f80>, 32, 64], N: 50, dim: 0, device: cpu
[32, <function <lambda> at 0x7f91a3621050>, 64], N: 50, dim: 1, device: cpu
[33, 65, <function <lambda> at 0x7f91a36210e0>], N: 50, dim: 2, device: cpu
(64, 32, 4, 16, 32), N: 2, dim: 2, device: cpu
(16, 32, 4, 16, 32), N: 8, dim: 2, device: cpu
(9, 31, 5, 15, 33), N: 17, dim: 4, device: cpu
[<function <lambda> at 0x7f91a3621170>], N: 100, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621200>], N: 1000, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621290>], N: 2000, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621320>], N: 3000, dim: 0, device: cpu
```

```
pytorch: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=all
caffe2: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=all
```
```
Metric: Forward Execution Time (us)

pytorch             | caffe2
--------------------------------
 4.066              | 0.312
 351.507            | 584.033
 184.649            | 292.157
 9482.895           | 6845.112
 9558.988           | 6847.511
 13730.016          | 14118.505
 6324.371           | 4840.883
 4613.497           | 3702.213
 7504.718           | 7889.751
 9882.978           | 7364.350
 10087.076          | 7483.178
 16849.556          | 18092.295
 19181.075          | 13363.742
 19296.508          | 13466.863
 34157.449          | 56320.073
 176.483            | 267.106
 322.247            | 352.782
 480.064            | 460.214
 607.381            | 476.908
```

Reviewed By: hlu1

Differential Revision: D25890595

fbshipit-source-id: f53e125c0680bc2ebf722d1da5ec964bec585fdd
2021-01-12 18:27:44 -08:00
Shijun Kong
2de345d44d Add op bench for caffe2 quantile op (#49598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49598

Add op bench for caffe2 quantile op

Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:quantile_op_test -- --wramup_iterations=10000  --iterations=10000`

Reviewed By: radkris-git

Differential Revision: D25590085

fbshipit-source-id: 0db58ac87c595b2bf2958f6299a1bf2ccea019db
2020-12-18 08:32:59 -08:00
Ansha Yu
cb3169d7a8 [aten] index_select dim 1 (#47077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47077

Add benchmarks for pt index_select, batch_index_select, and c2's BatchGather
Add batch_index_select implementation based on the C2 BatchGather implementation

This currently falls back to index_select for backwards and cuda implementations.

Alternatively, we can look into the specifics of why index_select is slower and
replace the original implementation instead.

Test Plan:
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/c2/batch_gather_test.par
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/index_select_test.par

PT results comparing without fix, block_size 1 only, and all dim=1
```
# no optimization
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 353.450

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 862.492

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4555.344

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 11003.279
```
```
# block size 1 only
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 129.240

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 266.776

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4508.593

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 10391.655
```
```
# dim 1
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K1_dim1_cpu
# Input: M: 8, N: 8, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 3.736

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 130.460

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 267.706

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K2_dim1_cpu
# Input: M: 8, N: 8, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4.187

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 1739.550

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 3468.332
```
C2 results:

```# Benchmarking Caffe2: batch_gather
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1203 13:19:35.310904 782584 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: batch_gather_M8_N8_K1_devicecpu
# Input: M: 8, N: 8, K: 1, device: cpu
Forward Execution Time (us) : 0.308

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K1_devicecpu
# Input: M: 256, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 90.517

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K1_devicecpu
# Input: M: 512, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 200.009

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M8_N8_K2_devicecpu
# Input: M: 8, N: 8, K: 2, device: cpu
Forward Execution Time (us) : 0.539

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K2_devicecpu
# Input: M: 256, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 1001.540

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K2_devicecpu
# Input: M: 512, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 2005.870
```

buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_batch_gather

Reviewed By: hlu1

Differential Revision: D24630227

fbshipit-source-id: cd205a30d96a33d239f3266820ada9a90093cf91
2020-12-14 15:39:33 -08:00
Brian Hirsh
c7cc8a48c0 migrating some straggler pytorch ops in fbcode to the new registration API (#48954)
Summary:
I already migrated the majority of fbcode ops to the new registration API, but there are a few stragglers (mostly new files that were created in the last two weeks).

The goal is mostly to stamp out as much of the legacy registration API usage as possible, so that people only see the new API when they look around the code for examples of how to register their own ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48954

ghstack-source-id: 118140663

Test Plan: Ran buck targets for each file that I migrated

Reviewed By: ezyang

Differential Revision: D25380422

fbshipit-source-id: 268139a1d7b9ef14c07befdf9e5a31f15b96a48c
2020-12-09 14:42:29 -08:00
Yang Wang
0125e14c9a [OpBench] change relu entry point after D24747035
Summary: D24747035 (1478e5ec2a) removes the entry point of `nnq.functional.relu`. Adjust op benchmark to `torch.nn.ReLU` accordingly.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit  --iterations 1 --warmup_iterations 1

Reviewed By: mingzhe09088

Differential Revision: D24961625

fbshipit-source-id: 5ed0ec7fa6d8cfefc8e7fc8324cf9a2a3e59de90
2020-11-13 15:38:27 -08:00
Yang Wang
9ee4f499f0 [OpBench] add _consume_op.list for processing input with type of List[Tensor] (#47890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890

As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op`

Test Plan:
OSS:
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24774105

fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d
2020-11-12 23:29:40 -08:00
Yang Wang
8ff0b6fef8 [OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767

This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps:

1. create a scripted module from existing benchmark case.
2. run mobile specific optimization pass on the scripted module
3. run the scripted module on AiBench by calling its Python API

A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation).

Test Plan:
## local op_bench run

buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test --  --iterations 1 --warmup_iterations 1

buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test --  --iterations 1 --warmup_iterations 1 --use_jit

Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version

```
RuntimeError:
Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function.
The error stack is reproduced here:

Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript:
  File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260
    quant_min: int, quant_max: int
):
    return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
:
  File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313
        axis: int, quant_min: int, quant_max: int
    ):
        return self.op_func(input, scale, zero_point, axis, quant_min, quant_max)
               ~~~~~~~~~~~~ <--- HERE
```

`_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105

## OSS test

python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1

## saved module graph
```
module __torch__.mobile_benchmark_utils.OpBenchmarkMobile {
  parameters {
  }
  attributes {
    training = True
    num_iters = 1
    benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50>
  }
  methods {
    method forward {
      graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile):
        %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4
        %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
        %1 : int = prim::GetAttr[name="num_iters"](%self)
         = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
          block0(%i : int):
            %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
            %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
            %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
            %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple)
            %23 : int = prim::Constant[value=1]()
            %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
            -> (%4)
        return (%12)

    }
  }
  submodules {
    module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark {
      parameters {
      }
      attributes {
        mobile_optimized = True
      }
      methods {
        method forward {
          graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark,
                %input_one.1 : Tensor,
                %input_two.1 : Tensor):
            %3 : int = prim::Constant[value=1]()
            %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
            return (%4)

        }
        method get_inputs {
          graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark):
            %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
            return (%self.inputs_tuple)

        }
      }
      submodules {
      }
    }
  }
}

```

Reviewed By: kimishpatel

Differential Revision: D24322214

fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1
2020-11-12 17:15:05 -08:00
Meng Wang
f692af209d add unittest for operator benchmark (#47678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47678

add unittest for operator benchmark.
Covers below cases:
```
generate_c2_test
generate_c2_gradient_test
generate_pt_test
generate_pt_gradient_test
generate_pt_tests_from_op_list
```
Also fixed two issues (incorrect fn signature) found by the unittest in `benchmark_caffe2.py`

Test Plan:
arc lint
buck run caffe2/benchmarks/operator_benchmark:operator_benchmark_unittest
```
test_c2_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1109 23:08:39.932207 639464 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.474

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.281

ok
test_pt_list_of_ops (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.579

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.734

# Benchmarking PyTorch: abs
# Mode: Eager
# Name: abs_M8
# Input: M: 8
Forward Execution Time (us) : 148.929

# Benchmarking PyTorch: abs_
# Mode: Eager
# Name: abs__M8
# Input: M: 8
Forward Execution Time (us) : 71.909

ok
test_pt_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.860

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.293

# Benchmarking PyTorch: abs
# Mode: Eager
# Name: abs_M8
# Input: M: 8
Forward Execution Time (us) : 148.999

# Benchmarking PyTorch: abs_
# Mode: Eager
# Name: abs__M8
# Input: M: 8
Forward Execution Time (us) : 71.941

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 179.108

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 1205.902

ok
```
buck run caffe2/benchmarks/operator_benchmark/c2:add_test
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1109 23:20:11.551795 654290 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 984.510

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 68.526

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: int
Forward Execution Time (us) : 101617.076
```

Reviewed By: mingzhe09088

Differential Revision: D24854414

fbshipit-source-id: 6676549909da6700b42f322c4ad6e8e2ef5b86b5
2020-11-10 15:45:36 -08:00
Radhakrishnan Venkataramani
163adb9fa7 Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092

Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.

These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.

Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*

Run benchmark locally using

```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```

AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.

Benchmark for 512x512 tensor with FbGeMM implementation

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```

Benchmark for 512x512 tensor trunk with no FbGeMM integration.

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```

Reviewed By: ngimel

Differential Revision: D23824869

fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
2020-11-10 12:00:53 -08:00
Shijun Kong
220b3bd667 Add op benchmark for batch box cox as baseline (#47275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47275

```
# Benchmarking Caffe2: batch_box_cox
# Name: batch_box_cox_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 49.005
```

Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:batch_box_cox_test -- --iterations=1000  --warmup 100`

Reviewed By: houseroad

Differential Revision: D24675426

fbshipit-source-id: 8bb1f3076dc6b01e7b63468136ddf3d9b6d7e5d2
2020-11-05 07:16:32 -08:00
Supriya Rao
d8c3b2b10c [quant][pyper] Add support for pruned weights in embedding_bag_byte lookup (#47329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47329

Supports pruned weights along with mapping for the compressed indices

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingOps

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24719909

fbshipit-source-id: f998f4039e84bbe1886e492a3bff6aa5f56b6b0f
2020-11-04 22:33:33 -08:00
Sheng Qin
c9222b7471 Implement clip_ranges operator for PyTorch
Test Plan:
unit test for correctness
```
buck test caffe2/torch/fb/sparsenn:test -- test_clip_ranges
Parsing buck files: finished in 1.6 sec
Creating action graph: finished in 18.9 sec
Building: finished in 15.0 sec (100%) 9442/9442 jobs, 1 updated
  Total time: 35.6 sec
More details at https://www.internalfb.com/intern/buck/build/66fb17de-859e-4d01-89bf-5c5de2950693
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 80f5e0c2-7db2-48a4-b148-25dd34651682
Trace available for this run at /tmp/tpx-20201026-123217.050766/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (14.912)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (14.098)
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422
```

new  benchmark perf test
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 155.765

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 156.248

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 156.634

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 155.408

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 165.168
```

Compare with the old implementation, there are **around 300us gain**
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 443.012

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 446.480

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 444.064

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 445.511

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 450.468
```

Reviewed By: MarcioPorto

Differential Revision: D24546110

fbshipit-source-id: e6c9b38e911f177f97961ede5bf375107f240363
2020-10-28 09:46:37 -07:00
Sheng Qin
c6858fd71a Set up benchmarks for ClipRanges operator for Caffe2 and PyTorch
Summary: As title, adding the benchmark tests for ClipRanges operators.

Test Plan:
benchmark test for Caffe2
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: clip_ranges
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1026 12:30:33.938997 2658759 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypeint32
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: int32
Forward Execution Time (us) : 5.805

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypeint32
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: int32
Forward Execution Time (us) : 5.913

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypeint32
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: int32
Forward Execution Time (us) : 5.941

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypeint32
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: int32
Forward Execution Time (us) : 5.868

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypeint32
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: int32
Forward Execution Time (us) : 6.408
```

benchmark test for PyTorch
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 443.012

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 446.480

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 444.064

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 445.511

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 450.468
```

Reviewed By: MarcioPorto

Differential Revision: D24500468

fbshipit-source-id: a582090a3982005af272cb10cdd257b2b2e787c4
2020-10-28 09:42:10 -07:00
Shijun Kong
d5cd781cd3 Update dper3 to use torch.nan_to_num and nan_to_num_ (#46873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46873

OSS:
Add op benchmark for torch.nan_to_num and torch.nan_to_num_

Test Plan:
OSS:
`buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:nan_to_num_test`

Reviewed By: qizzzh, houseroad

Differential Revision: D24521835

fbshipit-source-id: 1fd50a99e5329ffec2d470525ce6976d39424958
2020-10-27 06:41:48 -07:00