Commit Graph

246 Commits

Author SHA1 Message Date
Shijun Kong
6ae0a7c919 Add ReplaceNaN benchmark as baseline (#46685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685

as title

Test Plan:
caffe2

```
./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: replace_nan
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: replace_nan_M16_N16_dtypefloat
# Input: M: 16, N: 16, dtype: float
Forward Execution Time (us) : 30.742

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M16_N16_dtypedouble
# Input: M: 16, N: 16, dtype: double
Forward Execution Time (us) : 29.135

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypefloat
# Input: M: 64, N: 64, dtype: float
Forward Execution Time (us) : 94.059

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 93.569
```

Reviewed By: qizzzh, houseroad

Differential Revision: D24448483

fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f
2020-10-22 19:12:14 -07:00
Yang Wang
920ec6651f [OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694

For the op with parameters (e.g. conv), the jit mode run currently will raise an error of
`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module`

Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par  --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24451206

fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6
2020-10-22 11:10:28 -07:00
Shijun Kong
e5a2ba2ea1 Fix benchmark_caffe2
Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only.

Test Plan:
`buck run caffe2/benchmarks/operator_benchmark/c2:add_test`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 652.748

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 63.570

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: in
```

Reviewed By: qizzzh

Differential Revision: D24448374

fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6
2020-10-22 08:09:06 -07:00
Mingzhe Li
8908f6ad8e [op-bench] modify import path of configs (#46679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679

Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978

Test Plan: waitforsandcastle

Reviewed By: vkuzo

Differential Revision: D24459631

fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
2020-10-21 16:15:11 -07:00
Bugra Akyildiz
03c7d5be6b Add operator benchmark for 4bit/8bit embedding lookups
Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`.

Test Plan:
```
buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test'
````

The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474
https://www.internalfb.com/intern/aibench/details/346463246

Reviewed By: radkris-git

Differential Revision: D24268413

fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d
2020-10-15 13:51:32 -07:00
Supriya Rao
31888b2e77 [quant][pyper] Rename the sparse argument for embedding_bag ops (#46003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003

sparse is confusing because itt is used in training for sparse gradients

Test Plan: Imported from OSS

Reviewed By: radkris-git, qizzzh

Differential Revision: D24178248

fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
2020-10-08 16:15:28 -07:00
Shijun Kong
7d4f5060ad Fix doc about operator benchmark (#45853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853

The method name in README is not consistent with actual implementation.

Reviewed By: qizzzh

Differential Revision: D24114849

fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883
2020-10-08 09:13:53 -07:00
Mingzhe Li
e829d4fba9 [op-bench] fix jit mode (#45774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774

Fix RuntimeError: No such operator operator_benchmark::_consume

Test Plan: waitforsandcastle

Reviewed By: ngimel

Differential Revision: D24064982

fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c
2020-10-05 09:29:41 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
taivu
8722952dbd Add benchmark for channel_shuffle operator (#43509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23299972

Pulled By: kimishpatel

fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754
2020-09-02 08:15:19 -07:00
Supriya Rao
7024ce8a2c [quant] Add benchmarks for quantized embeddingbag module (#43296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43296

Use common config for float and quantized embedding_bag modules

Test Plan:
```
python -m pt.qembeddingbag_test

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 35.738

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 62.708

python -m pt.embeddingbag_test

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 46.878

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 103.904

```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23245531

fbshipit-source-id: 81b44fde522238d3eef469434e93dd7f94b528a8
2020-08-24 09:51:03 -07:00
Supriya Rao
4fc9e958c4 [quant] Add benchmakrs for embedding_bag coversion ops (#43291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43291

Test Float2Fused and Fused2Float conversion operators for embedding_bag byte and 4-bit ops

Test Plan:
```
python -m pt.qembedding_pack_tes
```

Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23231641

fbshipit-source-id: a2afe51bba52980d2e96dfd7dbc183327e9349fd
2020-08-20 11:26:20 -07:00
Vasiliy Kuznetsov
5aa61afbfb quant bench: update observer configs (#42956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956

In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23093996

fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
2020-08-17 17:07:56 -07:00
Paul Shao
8b5642a786 Fix to Learnable Fake Quantization Op Benchmarking (#43018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018

In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.

Test Plan:
Use the following command to execute the benchmark test:

`buck test mode/dev-nosan pt:quantization_test`

Reviewed By: vkuzo

Differential Revision: D23107846

fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
2020-08-13 16:32:13 -07:00
Vasiliy Kuznetsov
57b056b5f2 align qlinear benchmark to linear benchmark (#42767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767

Same as previous PR, forcing the qlinear benchmark to follow the fp one

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23013937

fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
2020-08-11 10:35:16 -07:00
Vasiliy Kuznetsov
a7bdf575cb align qconv benchmark to conv benchmark (#42761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761

Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23012533

fbshipit-source-id: af30ee585389395569a6322f5210828432963077
2020-08-11 10:33:19 -07:00
Paul Shao
d28639a080 Optimization with Backward Implementation of Learnable Fake Quantize Per Channel Kernel (CPU and GPU) (#42810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis.

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~5.4x
**Speedup from non-backprop kernel**: ~1.8x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_channel`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs for CPU are as follows:

```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 989024.686

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 95654.079

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 176948.970
```
4. The relevant outputs for GPU are as follows:
The relevant outputs are as follows

**Pre-optimization**:

```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6795.173

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 4321.351

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1052.066
```

**Post-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6737.106

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 2112.484

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1078.79

Reviewed By: vkuzo

Differential Revision: D22946853

fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd
2020-08-11 08:41:53 -07:00
Vasiliy Kuznetsov
faca3c43e6 fix celu in quantized benchmark (#42756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756

Similar to ELU, CELU was also broken in the quantized benchmark, fixing.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23010863

fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b
2020-08-07 15:23:50 -07:00
Presley Graham
5ca08b8891 Add benchmark for calculate_qparams (#42138)
Summary:
Adds a benchmark for `HistogramObserver.calculate_qparams` to the quantized op benchmarks. The next diff in this stack adds a ~15x speedup for this benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42138

Test Plan:
While in the folder `benchmarks/operator_benchmark`, the benchmark can be run using `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`.

Benchmark results before speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 185818.566

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 165325.916
```

Benchmark results after speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 12242.241

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 12655.354
```

Reviewed By: supriyar

Differential Revision: D22779291

Pulled By: durumu

fbshipit-source-id: 1fe17d20eda5dd99e0e2590480142034c3574d4e
2020-08-06 11:10:12 -07:00
Vasiliy Kuznetsov
50f0d2b97d quant: add q_batchnorm_1d op (#42491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491

Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode
hookup will be in a future PR, and graph mode should work after this PR.

Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d
because we convert back to contiguous memory format at the end, since
channels_last is only defined for rank >= 4. If further optimization is
needed, that can be a separate PR (will need the NHWC folks to see if
there is a workaround).  Meanwhile, having this is better than not having anything.

Context: There have been both internal and external requests for various
quantized BN1d use cases.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu
python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm

// performance:
// https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e

```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22926254

fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12
2020-08-05 17:20:18 -07:00
Vasiliy Kuznetsov
153673c33b fix quantized elu benchmark (#42318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318

We forgot to update this benchmark when quantized elu's signature
changed to require observation, fixing.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22845251

fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca
2020-07-30 14:57:12 -07:00
Paul Shao
01b794f169 Operator-level Benchmark Test for Per Tensor and Per Channel Fake Quantization (#41974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974

In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels.

Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`

Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free):
Each sample has dimensions **3x256x256**;

### In **microseconds** (`1e-6` second),

|                           | Python Module | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|
| Per Tensor CPU Forward    | 3112.666      | 3270.740   | 3596.864                |
| Per Tensor Cuda Forward   | 797.258       | 258.961    | 133.953                 |
| Per Channel CPU Forward   | 6587.693      | 6931.461   | 6352.417                |
| Per Channel Cuda Forward  | 1579.576      | 555.723    | 479.016                 |
| Per Tensor CPU Backward   | 72278.390     | 22466.648  | 12922.195               |
| Per Tensor Cuda Backward  | 6512.280      | 1546.218   | 652.942                 |
| Per Channel CPU Backward  | 74138.545     | 41212.777  | 14131.576               |
| Per Channel Cuda Backward | 6795.173      | 4321.351   | 1052.066                |

Reviewed By: z-a-f

Differential Revision: D22715683

fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd
2020-07-29 11:12:17 -07:00
Presley Graham
445e7eb01b Add quantized CELU operator by adding additional parameters to quantized ELU (#39199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39199

Test Plan: Imported from OSS

Differential Revision: D21771202

Pulled By: durumu

fbshipit-source-id: 910de6202fa3d5780497c5bf85208568a09297dd
2020-07-17 17:56:33 -07:00
Stanislau Hlebik
b774ce54f8 remediation of S205607
fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3
2020-07-17 17:19:47 -07:00
Stanislau Hlebik
8fdea489af remediation of S205607
fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac
2020-07-17 17:17:03 -07:00
Paul Shao
b7147fe6d7 Learnable Fake Quantizer Benchmark Test (#41429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429

This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages.

Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`

Benchmark Results (Locally on CPU):
Each sample has dimensions **3x256x256**; Each batch has 16 samples (`N=16`)
- Per Tensor Forward: 0.023688 sec/sample
- Per Tensor Backward: 0.165926 sec/sample
- Per Channel Forward: 0.040432 sec / sample
- Per Channel Backward: 0.173528 sec / sample

Reviewed By: vkuzo

Differential Revision: D22535252

fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f
2020-07-15 14:00:20 -07:00
Peter Bell
dddac948a3 Add CUDA to pooling benchmark configs (#41438)
Summary:
Related to https://github.com/pytorch/pytorch/issues/41368

These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438

Reviewed By: zhangguanheng66

Differential Revision: D22540756

Pulled By: ezyang

fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46
2020-07-15 10:51:43 -07:00
Wojciech Baranowski
20f3051f7d [adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665

Differential Revision: D22463538

Pulled By: ezyang

fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4
2020-07-14 21:51:40 -07:00
Mingzhe Li
4ddf27ba48 [op-bench] check device attribute in user inputs
Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff.

Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1

Reviewed By: ngimel

Differential Revision: D22538252

fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3
2020-07-14 17:17:59 -07:00
Mingzhe Li
144f04e7ef Fix qobserver test
Summary: Change the device config in qobserver test to a string to honor --device flag.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test  -- --iterations 1 --device cpu

Reviewed By: ngimel

Differential Revision: D22536379

fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a
2020-07-14 15:47:03 -07:00
Xiaomeng Yang
80d5b3785b Add torch.logit function (#41062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062

Add torch.logit function

Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit"

Reviewed By: hl475

Differential Revision: D22406912

fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606
2020-07-13 19:33:20 -07:00
Peter Bell
3dcc329746 Use tree-based sum for floats to avoid numerical instability (#39516)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234

This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous.

e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.

This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.

WIP to write a vectorized version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516

Reviewed By: ezyang

Differential Revision: D22106251

Pulled By: ngimel

fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9
2020-06-24 17:06:38 -07:00
Wojciech Baranowski
43331609a4 Port addmm, addbmm, addr to ATen (CUDA) (#38421)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24536, fixes https://github.com/pytorch/pytorch/issues/24534 and fixes https://github.com/pytorch/pytorch/issues/24533
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38421

Differential Revision: D22138333

Pulled By: VitalyFedyunin

fbshipit-source-id: f4411d0df0a001bbb95089eb55fdcac3aba86700
2020-06-22 13:02:33 -07:00
Vasiliy Kuznetsov
e35199a691 observer bench: add CUDA (#39360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360

Makes the observer microbenchmarks also run on CUDA. This is useful
now that QAT is supported in DDP and is more likely to be run
on GPUs.

Test Plan:
```
python -m pt.qobserver_test
```

Imported from OSS

Differential Revision: D21828985

fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435
2020-06-05 14:18:32 -07:00
Edward Yang
da2004e132 Upgrade lint. (#39483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483

I fixed all of the new errors that occurred because of the upgrade.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21884575

Pulled By: ezyang

fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
2020-06-04 12:56:43 -07:00
Nikita Shulga
c02e7c464a Replace import cpp_benchmark with torch.utils.cpp_benchmark (#38832)
Summary:
Otherwise, I don't understand how those could have been invoked

Also, what is the benefit of importing the same module twice?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832

Differential Revision: D21675081

Pulled By: malfet

fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421
2020-05-20 18:53:09 -07:00
Peter Bell
0a159b0a3a Fix precision issues in CPU remainder (#38293)
Summary:
Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861.

This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals.

Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.

I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293

Differential Revision: D21539801

Pulled By: ezyang

fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
2020-05-14 08:54:32 -07:00
Supriya Rao
ae11718c45 [quant] Add quantized::conv1d op benchmarck (#38332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38332

Test Plan:
python -m pt.qconv_test --test QConv1d_N1_IC128_OC256_L64_G1_kernel3_stride1_pad0
Forward Execution Time (us) : 147.844

python -m pt.conv_test --test Conv1d_IC128_OC256_kernel3_stride1_N1_L64_cpu
Forward Execution Time (us) : 470.750

Imported from OSS

Differential Revision: D21553662

fbshipit-source-id: 9c240a141f9cd3a82a20aa462e8e5577e002a387
2020-05-13 16:59:19 -07:00
Vasiliy Kuznetsov
4fa049c525 add quantized instancenorm operator (#36847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847

Adds a quantized instancenorm operator, which can reuse most of
groupnorm's logic.

Benchmarking shows that the quantized version is about 10x faster than
floating point for equivalent input sizes
(https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0)

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm
```

Imported from OSS

Differential Revision: D21107925

fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4
2020-05-06 19:01:33 -07:00
Vasiliy Kuznetsov
b837d5d418 add quantized groupnorm operator (#36835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835

Adds a quantized groupnorm operator.  We reuse most of the layernorm
kernel, modifying it to be able to perform channel-wise scaling.

Benchmark results: the quantized layer is between 6x to 15x faster
from fp to q, depending on input shapes
(full results:
https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9)

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm
python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm
```

Numerics are nearly equivalent, with the only difference documented
in the test case.  The difference is the same type as with quantized
layernorm.  Making numerics equivalent is possible but will sacrifice
speed.

Imported from OSS

Differential Revision: D21107926

fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45
2020-05-06 19:01:26 -07:00
Vasiliy Kuznetsov
2773ed3082 hardswish: remove unnecessary quantize call (#36980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980

Missed this on the original diff, fixing.  Create the output tensor directly instead of quantizing it.

Test Plan:
tests still pass
microbenchmarks show a 2x performance improvment for int8:
https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this
will depend on input size)

Imported from OSS

Differential Revision: D21185970

fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2
2020-04-22 16:15:54 -07:00
Vasiliy Kuznetsov
13391cebe2 ai-pep: match the qlinear benchmark to linear (#36674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674

Slight changes to qlinear benchmark to have it be in the same format
as linear, for fairer comparisons between FP and Q.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Differential Revision: D21102562

fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6
2020-04-20 09:46:32 -07:00
Vasiliy Kuznetsov
25649684ed ai-pep: align qconv benchmark to conv (#36673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673

Slight changes to the qconv benchmark to make it match the floating
point benchmark, so we can compare across the two better.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test --tag_filter all
python -m pt.conv_test --tag_filter all
```

Imported from OSS

Differential Revision: D21102563

fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659
2020-04-20 09:44:09 -07:00
Vasiliy Kuznetsov
a5d0d762fa redo of add quantized layer norm implementation (#36593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36593

This is a redo of https://github.com/pytorch/pytorch/pull/35329 with a
better test.

Adds a quantized implementation of LayerNorm for server.

A future PR will add the Python wrapper.

Test Plan:
numerics match the floating point implementation

benchmarks by input size:
v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13
v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2
v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b

Differential Revision: D21030268

Pulled By: vkuzo

fbshipit-source-id: b3594c3393cfce37a881319e2e0560620d51080f
2020-04-15 19:47:18 -07:00
Supriya Rao
73f11a0b23 Update qbatch_norm2d opbenchmark test (#36630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36630

Test Plan:
OMP_NUM_THREADS=1 python -m pt.qbatchnorm_test

Imported from OSS

Differential Revision: D21030508

fbshipit-source-id: 1ece1bd7429207732eae4dd1982ceddcdc5d3a91
2020-04-14 17:09:18 -07:00
Edward Yang
88c22070fe Revert D20768930: add quantized layer norm implementation
Test Plan: revert-hammer

Differential Revision:
D20768930

Original commit changeset: ddf8727e9840

fbshipit-source-id: a190e1d1e42281eba627b0dbb6de1b3651cd5e97
2020-04-09 14:36:37 -07:00
Vasiliy Kuznetsov
f813e7184e add quantized layer norm implementation (#35329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35329

Adds a quantized implementation of LayerNorm for server.

A future PR will add the Python wrapper.

Test Plan:
numerics match the floating point implementation

benchmarks by input size:
v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13
v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2
v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b

Imported from OSS

Differential Revision: D20768930

fbshipit-source-id: ddf8727e9840c65ead3b890220af0638c5637028
2020-04-09 09:11:41 -07:00
Vasiliy Kuznetsov
cc78914755 qactivation_benchmarks: small bug fix (#35731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731

Changes relu and relu6 to point to the functional implementations here.
The previous behavior tested the time to create the module, but didn't actually run the
function (I noticed this when adding the new input sizes and seeing
the measured time not change).

Test Plan:
run the benchmark, the time now changes as expected with input size for
these.

Imported from OSS

Differential Revision: D20875542

fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555
2020-04-06 15:02:33 -07:00
Vasiliy Kuznetsov
6405f26a02 add more quantized activation benchmarks and input sizes (#35729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729

* there were a few quantized activations which had implementations but not benchmarks, adds them
* adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations

Test Plan:
```
python -m pt.qactivation_test
```

Imported from OSS

Differential Revision: D20875544

fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d
2020-04-06 15:02:29 -07:00