Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694
For the op with parameters (e.g. conv), the jit mode run currently will raise an error of
`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module`
Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par --use_jit
Reviewed By: mingzhe09088
Differential Revision: D24451206
fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6
Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only.
Test Plan:
`buck run caffe2/benchmarks/operator_benchmark/c2:add_test`
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 652.748
# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 63.570
# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: in
```
Reviewed By: qizzzh
Differential Revision: D24448374
fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679
Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978
Test Plan: waitforsandcastle
Reviewed By: vkuzo
Differential Revision: D24459631
fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`.
Test Plan:
```
buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test'
````
The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474https://www.internalfb.com/intern/aibench/details/346463246
Reviewed By: radkris-git
Differential Revision: D24268413
fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003
sparse is confusing because itt is used in training for sparse gradients
Test Plan: Imported from OSS
Reviewed By: radkris-git, qizzzh
Differential Revision: D24178248
fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853
The method name in README is not consistent with actual implementation.
Reviewed By: qizzzh
Differential Revision: D24114849
fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955
resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`
This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460526
Pulled By: anjali411
fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956
In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23093996
fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018
In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.
Test Plan:
Use the following command to execute the benchmark test:
`buck test mode/dev-nosan pt:quantization_test`
Reviewed By: vkuzo
Differential Revision: D23107846
fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767
Same as previous PR, forcing the qlinear benchmark to follow the fp one
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23013937
fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761
Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23012533
fbshipit-source-id: af30ee585389395569a6322f5210828432963077
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810
In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis.
In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~5.4x
**Speedup from non-backprop kernel**: ~1.8x
Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command
`buck test //caffe2/test:quantization -- learnable_backward_per_channel`
To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs for CPU are as follows:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 989024.686
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 95654.079
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 176948.970
```
4. The relevant outputs for GPU are as follows:
The relevant outputs are as follows
**Pre-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6795.173
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 4321.351
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1052.066
```
**Post-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6737.106
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 2112.484
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1078.79
Reviewed By: vkuzo
Differential Revision: D22946853
fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756
Similar to ELU, CELU was also broken in the quantized benchmark, fixing.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23010863
fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491
Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode
hookup will be in a future PR, and graph mode should work after this PR.
Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d
because we convert back to contiguous memory format at the end, since
channels_last is only defined for rank >= 4. If further optimization is
needed, that can be a separate PR (will need the NHWC folks to see if
there is a workaround). Meanwhile, having this is better than not having anything.
Context: There have been both internal and external requests for various
quantized BN1d use cases.
Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu
python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm
// performance:
// https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22926254
fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318
We forgot to update this benchmark when quantized elu's signature
changed to require observation, fixing.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D22845251
fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974
In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels.
Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`
Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free):
Each sample has dimensions **3x256x256**;
### In **microseconds** (`1e-6` second),
| | Python Module | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|
| Per Tensor CPU Forward | 3112.666 | 3270.740 | 3596.864 |
| Per Tensor Cuda Forward | 797.258 | 258.961 | 133.953 |
| Per Channel CPU Forward | 6587.693 | 6931.461 | 6352.417 |
| Per Channel Cuda Forward | 1579.576 | 555.723 | 479.016 |
| Per Tensor CPU Backward | 72278.390 | 22466.648 | 12922.195 |
| Per Tensor Cuda Backward | 6512.280 | 1546.218 | 652.942 |
| Per Channel CPU Backward | 74138.545 | 41212.777 | 14131.576 |
| Per Channel Cuda Backward | 6795.173 | 4321.351 | 1052.066 |
Reviewed By: z-a-f
Differential Revision: D22715683
fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429
This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages.
Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`
Benchmark Results (Locally on CPU):
Each sample has dimensions **3x256x256**; Each batch has 16 samples (`N=16`)
- Per Tensor Forward: 0.023688 sec/sample
- Per Tensor Backward: 0.165926 sec/sample
- Per Channel Forward: 0.040432 sec / sample
- Per Channel Backward: 0.173528 sec / sample
Reviewed By: vkuzo
Differential Revision: D22535252
fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f
Summary:
Related to https://github.com/pytorch/pytorch/issues/41368
These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438
Reviewed By: zhangguanheng66
Differential Revision: D22540756
Pulled By: ezyang
fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46
Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff.
Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1
Reviewed By: ngimel
Differential Revision: D22538252
fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3
Summary: Change the device config in qobserver test to a string to honor --device flag.
Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test -- --iterations 1 --device cpu
Reviewed By: ngimel
Differential Revision: D22536379
fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234
This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous.
e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.
This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.
WIP to write a vectorized version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516
Reviewed By: ezyang
Differential Revision: D22106251
Pulled By: ngimel
fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360
Makes the observer microbenchmarks also run on CUDA. This is useful
now that QAT is supported in DDP and is more likely to be run
on GPUs.
Test Plan:
```
python -m pt.qobserver_test
```
Imported from OSS
Differential Revision: D21828985
fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483
I fixed all of the new errors that occurred because of the upgrade.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21884575
Pulled By: ezyang
fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
Summary:
Otherwise, I don't understand how those could have been invoked
Also, what is the benefit of importing the same module twice?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832
Differential Revision: D21675081
Pulled By: malfet
fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421
Summary:
Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861.
This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals.
Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.
I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293
Differential Revision: D21539801
Pulled By: ezyang
fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847
Adds a quantized instancenorm operator, which can reuse most of
groupnorm's logic.
Benchmarking shows that the quantized version is about 10x faster than
floating point for equivalent input sizes
(https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0)
Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm
```
Imported from OSS
Differential Revision: D21107925
fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835
Adds a quantized groupnorm operator. We reuse most of the layernorm
kernel, modifying it to be able to perform channel-wise scaling.
Benchmark results: the quantized layer is between 6x to 15x faster
from fp to q, depending on input shapes
(full results:
https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9)
Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm
python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm
```
Numerics are nearly equivalent, with the only difference documented
in the test case. The difference is the same type as with quantized
layernorm. Making numerics equivalent is possible but will sacrifice
speed.
Imported from OSS
Differential Revision: D21107926
fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980
Missed this on the original diff, fixing. Create the output tensor directly instead of quantizing it.
Test Plan:
tests still pass
microbenchmarks show a 2x performance improvment for int8:
https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this
will depend on input size)
Imported from OSS
Differential Revision: D21185970
fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674
Slight changes to qlinear benchmark to have it be in the same format
as linear, for fairer comparisons between FP and Q.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```
Imported from OSS
Differential Revision: D21102562
fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673
Slight changes to the qconv benchmark to make it match the floating
point benchmark, so we can compare across the two better.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test --tag_filter all
python -m pt.conv_test --tag_filter all
```
Imported from OSS
Differential Revision: D21102563
fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731
Changes relu and relu6 to point to the functional implementations here.
The previous behavior tested the time to create the module, but didn't actually run the
function (I noticed this when adding the new input sizes and seeing
the measured time not change).
Test Plan:
run the benchmark, the time now changes as expected with input size for
these.
Imported from OSS
Differential Revision: D20875542
fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729
* there were a few quantized activations which had implementations but not benchmarks, adds them
* adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations
Test Plan:
```
python -m pt.qactivation_test
```
Imported from OSS
Differential Revision: D20875544
fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d