`Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development.
This change will reflect the changes in the documentation as well.
**TODO:**
- [ ] Change the tutorials
- [ ] Confirm no bc-breakages
- [ ] Reflect the changes in the trackers and RFC docs
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867
Approved by: https://github.com/supriyar
Summary: Needed to refactor this PR to add tests for some new layers without copy pasting the entirety of the code. Its basically just a helper that does exactly what the other tests did since they were essentially copies of one another. Its possible to do similar with the quantized kernels test but its different enough that it seemed more effort than it was worth. Also bugfix: Originally line 150 I believe was wrong since model.weight is never used, though the only effect was that the specific weight wasn't used.
Test Plan: python test/test_ao_sparsity.py TestQuantizedSparseLayers
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81802
Approved by: https://github.com/supriyar
Summary:
This PR adds a new quantization backend, ONEDNN, with quantized conv and linear kernels in the same code path as the FBGEMM backend
The ONEDNN backend is an alternative of FBGEMM and QNNPACK backends. It takes advantage of features of the latest Intel® CPU products. It supports VNNI on Cascade Lake and the AMX instruction set to be available on Sapphire Rapids which has 8X int8 peak TOPS over VNNI.
ONEDNN demonstrates better performance on conv kernels of popular CNN models than FBGEMM. It also supports more fused ops, such as convolution-add-ReLU, than FBGEMM and QNNPACK.
To use this backend, users only need to set the quantization backend to 'onednn' before any calculation without a single change to models.
```python
torch.backends.quantized.engine = 'onednn'
```
## Design docs
https://github.com/pytorch/pytorch/issues/21120#issuecomment-562371983https://github.com/pytorch/pytorch/pull/67177#issuecomment-963787096
## File changes
**Add ONEDNN to qengine list**
- aten/src/ATen/Context.cpp
- c10/core/QEngine.h
- torch/ao/quantization/qconfig.py
- torch/backends/quantized/\_\_init\_\_.py
**Implement qconv & qlinear for ONEDNN backend**
- aten/src/ATen/native/quantized/cpu/conv_serialization.h
- aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
- aten/src/ATen/native/quantized/cpu/onednn_utils.h
- aten/src/ATen/native/quantized/cpu/qconv.cpp
- aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
- aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
- aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp
- aten/src/ATen/native/quantized/cpu/qlinear.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp
**Skip tests that are not supported by ONEDNN**
- test/ao/sparsity/test_kernels.py
- test/quantization/core/test_quantized_module.py
- test/quantization/core/test_quantized_op.py
## Validation results
This PR has passed `test_quantization.py` and `test_mkldnn.py`.
Below are performance data of int8 2d convolution and linear on the Cascade Lake Xeon® platform:
(Note: Tested with single instance on single core. Using the latest oneDNN library.)
**Table 1. Performance comparison of int8 2d convolution operator**
|No.| Shape| FBGEMM| ONEDNN| Gain|
|-|-|-|-|-|
|1| IC=128, OC=128, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 668.310us| 535.630us| 24.8%|
|2| IC=128, OC=128, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 290.630us| 281.810us| 3.1%|
|3| IC=128, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 1.045ms| 893.010us| 17.0%|
|4| IC=128, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 385.320us| 373.720us| 3.1%|
|5| IC=256, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 1.876ms| 1.641ms| 14.3%|
|6| IC=256, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 660.460us| 638.470us| 3.4%|
**Table 2. Performance comparison of int8 linear operator**
|No.| Shape (m, n, k)| FBGEMM| ONEDNN| Gap|
|-|-|-|-|-|
|1| 64, 800, 320| 80.550us| 96.770us| 20.10%|
|2| 64, 768, 512| 101.230us| 130.720us| 29.10%|
|3| 16, 256, 512| 30.230us| 51.450us| 70.20%|
|4| 128, 128, 128| 33.810us| 50.480us| 49.30%|
|5| 256, 512, 256| 154.490us| 195.050us| 26.30%|
|6| 1024, 1024, 1024| 3.134ms| 3.514ms| 12.10%|
ONEDNN showed advantages over FBGEMM for convolution. However, it has performance gap to FBGEMM for Linear ops. The gap is a known issue and further optimization is in progress in the oneDNN library. On the latest platforms, better performance of ONEDNN is achieved for both conv and linear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69820
Reviewed By: HDCharles
Differential Revision: D33716039
Pulled By: jerryzh168
fbshipit-source-id: 6f7bb807e85798142dfcffccfca8b8bd652fb3dd
(cherry picked from commit 91526b373560f42ba0ad307f9cccfc0eb5218b1f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66778
This removes the hack of the context manager that would communicate the zeros block shape to the quantization convert.
The conversion will assume that the converted modules have `sparse_params` (which is added by the sparsifier).
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D31835721
Pulled By: z-a-f
fbshipit-source-id: c5fd2da3b09a728a2296765c00ca69275dbca3b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60032
There will be more sparse tests coming. This PR creates a separate folder for the sparse tests
Test Plan: `python test/test_ao.py`
Reviewed By: raghuramank100
Differential Revision: D29139265
fbshipit-source-id: d0db915f00e6bc8d89a5651f08f72e362a912a6b