Commit Graph

23 Commits

Author SHA1 Message Date
Gao, Xiang
45e4b614d1 Per channel quantization performance improvement (#33772)
Summary:
Benchmark:
NVIDIA GTX 1650 + AMD Ryzen Threadripper 3970X
```python
import torch
print(torch.__version__)

for i in range(1000):
    torch.randn(1024 * 128, device='cuda')

def cuda(e):
    a = torch.randn(2 ** e, 32, device='cuda')
    s = torch.randn(32, device='cuda')
    z = torch.randn(32, device='cuda')
    torch.cuda.synchronize()
    %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999); torch.cuda.synchronize()

def cpu(e):
    a = torch.randn(2 ** e, 32, device='cpu')
    s = torch.randn(32, device='cpu')
    z = torch.randn(32, device='cpu')
    %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999);

for i in range(10, 24):
    cuda(i)
print()
for i in range(10, 32):
    cpu(i)
```
Before
```
1.5.0a0+9bc922d
849 µs ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
817 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
814 µs ± 2.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.11 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.19 ms ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.14 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.41 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
26.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.6 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
207 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

249 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
420 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
766 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.45 ms ± 574 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.84 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.69 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.29 ms ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.4 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
47.5 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
187 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
379 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
652 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.22 s ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.34 s ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.56 s ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.97 s ± 33.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.8 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
35.2 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
After
```
1.5.0a0+a7ec8cc
92.5 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
97.7 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
119 µs ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
146 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
211 µs ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
347 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
624 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.17 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.25 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.43 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.51 ms ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.9 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
33.7 ms ± 7.64 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

201 µs ± 234 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
285 µs ± 465 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
287 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
287 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
287 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
347 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
675 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.34 ms ± 643 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.82 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.7 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
20.3 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.4 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
78.8 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
153 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
285 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
541 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.03 s ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.97 s ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.81 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Fixes https://github.com/pytorch/pytorch/issues/33647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33772

Differential Revision: D20112531

Pulled By: ngimel

fbshipit-source-id: f90e3ef1b5be8276851637f3e1251cb8f1af411f
2020-02-26 10:19:25 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00
James Reed
4fd20c0816 Kill hypothesis deadline testing (#30890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30890

We've received way too many complaints about this functionality making tests flaky, and it's not providing value to us anyway. Let's cut the shit and kill deadline testing

Test Plan: Imported from OSS

Differential Revision: D18857597

Pulled By: jamesr66a

fbshipit-source-id: 67e3412795ef2fb7b7ee896169651084e434d2f6
2019-12-06 13:36:14 -08:00
Jeremy Lilley
e6000a7c04 Temporarily disable test_numerical_consistency_per_tensor (#30600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30600

test_numerical_consistency_per_tensor in test_fake_quant is failing on Windows.
ghstack-source-id: 94742124

Test Plan: CircleCI tests

Differential Revision: D18760287

fbshipit-source-id: 7f59355eab74e811bb370ad2836ed2f1def1f621
2019-12-02 06:57:14 -08:00
Jeremy Lilley
c780610f2d Disable test_backward_per_tensor in test_fake_quant (#30594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30594

This testcase started breaking, clean up for the build.
ghstack-source-id: 94736837

Test Plan: Unittest disabling change

Differential Revision: D18758635

fbshipit-source-id: 05df1158ff0ccd75e401f352da529fb663b1cae0
2019-12-01 22:26:28 -08:00
Jianyu Huang
6f90567e0c Add the unittest import for test_fake_quant.py (#28815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28815

Add the unittest import
ghstack-source-id: 92789329

Test Plan: CI

Differential Revision: D18191989

fbshipit-source-id: c54e0309e21156c33e4fec01bfba17a1c30463c9
2019-10-28 17:52:57 -07:00
Jianyu Huang
02d318461e Temporarily disable test_numerical_consistency_per_channel due to failure (#28807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28807

`FAIL: test_numerical_consistency_per_channel (_main_.TestFakeQuantizePerChannel)`

This test is failing consistently on master, we can't find a clean blame.
ghstack-source-id: 92763176

Test Plan: CI

Differential Revision: D18181496

fbshipit-source-id: 5948af06c4cb7dea9a8db1366deb7c12f6ec1c72
2019-10-28 13:51:10 -07:00
Raghuraman Krishnamoorthi
9e3ba35500 Add control for observers in Fake-quantize module (#27113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27113

Fix bug in fake quant control of observer and fake-quantize operations.
Add test to ensure that features work as expected
ghstack-source-id: 91071181

Test Plan: buck test mode/dev-nosan caffe2/test:fake_quant -- test_fake_quant_control

Differential Revision: D17678875

fbshipit-source-id: 2912ad8b6e674daa1d129f7a7c6f27d8c1b4f93b
2019-09-30 18:23:26 -07:00
Raghuraman Krishnamoorthi
7dc7075795 Per channel fake quant (#26623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26623

Per-channel fake quant cpu and cuda operators,
per-channel support in fake quant module,
tests for per-channel fake-quant and serializability of fake quant modules

ghstack-source-id: 91008299
ghstack-source-id: 91008299

Test Plan:
buck test mode/dev caffe2/test:fake_quant  --
 Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/1970324848875929
      ✓ caffe2/test:fake_quant - test_backward_per_tensor (test_fake_quant.TestFakeQuantizePerTensor) 0.242 1/10 (passed)
      ✓ caffe2/test:fake_quant - test_numerical_consistency_per_tensor (test_fake_quant.TestFakeQuantizePerTensor) 0.204 2/10 (passed)
      ✓ caffe2/test:fake_quant - test_fq_serializable (test_fake_quant.TestFakeQuantizePerTensor) 0.174 3/10 (passed)
      ✓ caffe2/test:fake_quant - test_numerical_consistency_per_channel (test_fake_quant.TestFakeQuantizePerChannel) 0.279 4/10 (passed)
      ✓ caffe2/test:fake_quant - test_forward_per_tensor (test_fake_quant.TestFakeQuantizePerTensor) 0.241 5/10 (passed)
      ✓ caffe2/test:fake_quant - test_forward_per_channel (test_fake_quant.TestFakeQuantizePerChannel) 0.353 6/10 (passed)
      ✓ caffe2/test:fake_quant - test_fq_module (test_fake_quant.TestFakeQuantizePerTensor) 0.354 7/10 (passed)
      ✓ caffe2/test:fake_quant - test_backward_per_channel (test_fake_quant.TestFakeQuantizePerChannel) 0.334 8/10 (passed)
      ✓ caffe2/test:fake_quant - test_fq_serializable (test_fake_quant.TestFakeQuantizePerChannel) 0.168 9/10 (passed)
      ✓ caffe2/test:fake_quant - test_fq_module (test_fake_quant.TestFakeQuantizePerChannel) 0.429 10/10 (passed)
      ✓ caffe2/test:fake_quant - main 0.000 (passed)

Differential Revision: D17439406

fbshipit-source-id: 64bfff5e4f40bc2ab8af2b432c7bc33805418077
2019-09-30 00:21:25 -07:00
Raghuraman Krishnamoorthi
b0a2f6f2f5 Serialization and range reduction support for Fake Quant/Observer (#26519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26519

ghstack-source-id: 90895631

Test Plan:
buck test caffe2/test:quantization -- 'test_histogram_observer \(test_quantization\.ObserverTest\)' --print-passing-details
and
buck test caffe2/test:fake_quant -- 'test_fq_serializable \(test_fake_quant\.TestFakeQuantizePerTensorAffine\)' --print-passing-details

Differential Revision: D17217408

fbshipit-source-id: 0da7efdcdae0c065dd035c5dd2b6a78231545ece
2019-09-27 10:09:39 -07:00
Jerry Zhang
254122dd4e quantize_linear -> quantize_per_tensor (#26574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26574

Since we also have `quantized::linear`, `quantize_linear` sounds
confusing, so we plan to rename it before the branch cut

Test Plan:
ci

Imported from OSS

Differential Revision: D17514876

fbshipit-source-id: 01d9005e6ec8cb9950b9d8bba122109c389641d3
2019-09-20 21:58:48 -07:00
Jerry Zhang
3c6009e6f1 derandomize hypothesis tests (#25513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25513

Randomized tests are flaky, this PR derandomized some of them

Test Plan:
python test/test_fake_quant.py
python test/test_quantized_nn_mods.py

Imported from OSS

Differential Revision: D17221273

fbshipit-source-id: f6978704ba0139071c26f443e923955a2f849832
2019-09-06 10:53:05 -07:00
Jerry Zhang
76b6b1b1a6 move no_deadline to hypothesis_utils.py (#25598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25598

att

Test Plan:
CI

Imported from OSS

Differential Revision: D17192467

fbshipit-source-id: 9ee93b02cc293bb71ed114534d92eedda3ddee88
2019-09-04 17:06:33 -07:00
Zafar Takhirov
e8acc2ebb1 Removing future imports from the test fixtures.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25296

Test Plan: Imported from OSS

Differential Revision: D17090201

Pulled By: zafartahirov

fbshipit-source-id: 5a4f6ac0ea475b55d2c610e2f9f4f0cef8690e8f
2019-08-29 01:39:59 -07:00
James Reed
40f0b1c844 Enable OSS quantization tests (#23858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23718

Changes:

- Enable tests for quantization test files in `run_tests.py`
- Remove `__future__` imports from `torch/nn/qat/modules/__init__.py`, since `unicode_literals` messes up imports on python2 because the elements in `__all__` will be Unicode and not string
- Skip PostTrainingQuantTests if the build doesn't have FBGEMM (only a small subset of targets in tests) or if testing under UBSAN (the suppression file doesn't seem to work)

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D16639467

Pulled By: jamesr66a

fbshipit-source-id: 532766797c216976dd7e07d751f768ff8e0fc207
2019-08-06 11:20:30 -07:00
Zafar Takhirov
35b6cdc2eb Rewriting hypothesis_utils (#22830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22830

Separating the tensor generation and the generation of the quantization parameters

- Introducing hypothesis filter `assume_not_overflowing`, which makes sure that the generated tensor and qparams play well with each other. **Note: This is an expensive filter!**
- `qtensor` -> Renameed to `tensor`
- `qtensor_conv` -> Renamed to `tensor_conv2d`
- The tensors don't return the quantization parameters anymore, use `qparams` for it
- The `dtypes` argument is just a quantized dtype now.
- The enforcement for zero_point is predefined as before. As before, if set to `None` the zero_point will be sampled. However, if `None`, you can override sampling with `zero_point_min` and `zero_point_max`
- Scale sampling can also be overriden using `scale_min` and `scale_max`

Reviewed By: jerryzh168

Differential Revision: D16234314

fbshipit-source-id: 5b538a5aa9772b7add4f2ce5eff6fd0decd48f8e
2019-07-17 10:16:13 -07:00
Jerry Zhang
f7de9be3c0 Add FakeQuantize Module (#21767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21767

Adding FakeQuantize Module
for quantization aware training

Reviewed By: dzhulgakov

Differential Revision: D15728503

fbshipit-source-id: 2a9a6a362812ede3deac42b93dddca35987bd8e6
2019-07-15 14:08:55 -07:00
Jerry Zhang
0a0ff83124 replace num_bits with quant_min and quant_max (#21097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21097

att

Differential Revision: D15547166

fbshipit-source-id: 60bc7f7d82c424558b67881627fb74f1eff515af
2019-05-30 20:57:57 -07:00
Jerry Zhang
74375299e0 add torch.nn._intrinsic and torch.nn._intrinsic.quantized namespace (#20940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20940

- `torch.nn._intrinsic` will contain normal(unquantized) fused modules like Conv2DRelu, Conv2DBnRelu, FakeQuantize ops etc.
- `torch.nn._intrinsic` will contain fused and quantized modules like Quantized Conv2DRelu, Quantized LinearRelu etc.
Right now I only added FakeQuantize op in `torch.nn._intrinsic` namespace, we'll have more later

Differential Revision: D15505228

fbshipit-source-id: d380929e38af7a5bcfbea27474d5b80f95d43b03
2019-05-29 14:09:37 -07:00
Jerry Zhang
05543153dd CUDA implementation of fakequant (#20252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20252

Add CUDA implementation for fakequant op for quantization aware training.

Reviewed By: zafartahirov

Differential Revision: D15243386

fbshipit-source-id: 37610ab046786ffc69aaec5235e5df8304c353d6
2019-05-22 14:46:39 -07:00
Jerry Zhang
abb3698976 Add QInt32 ScalarType and qint32 data type (#19816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19816

We need this for quantization for bias
add third argument of ScalarType to `quantize_linear`

Differential Revision: D15094174

fbshipit-source-id: f19ec8f4716cf5fe0aa21b38d45af6d27c9ab377
2019-05-15 18:50:18 -07:00
Jerry Zhang
176bdc0722 fix lint (#19632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19632

at

Differential Revision: D15052952

fbshipit-source-id: 7c38fad99799e5ac914685c36eadf932afe52b74
2019-04-23 15:29:38 -07:00
Jerry Zhang
f3be2816ae Adds fakeQuantizePerTensorAffineOp to pytorch (#19387)
Summary:
Adding fakequant op so that we can use it in pytorch models, the exact implementation might change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/19387

Differential Revision: D13739657

fbshipit-source-id: d5cb084e843d236bb1da9827ac1ba3900ed99786
2019-04-23 11:12:53 -07:00