Revert "[Docs] Convert to markdown to fix 155032 (#155520)"

This reverts commit cd66ff8030.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
This commit is contained in:
PyTorch MergeBot 2025-06-17 22:22:50 +00:00
parent 54998c2daa
commit fa4f07b5b8
5 changed files with 495 additions and 503 deletions

View File

@ -1,4 +1,5 @@
# Quantization Accuracy Debugging
Quantization Accuracy Debugging
-------------------------------
This document provides high level strategies for improving quantization
accuracy. If a quantized model has error compared to the original model,
@ -10,9 +11,11 @@ we can categorize the error into:
portion of input data has large error
3. **implementation error** - quantized kernel is not matching reference implementation
## Data insensitive error
Data insensitive error
~~~~~~~~~~~~~~~~~~~~~~
### General tips
General tips
^^^^^^^^^^^^
1. For PTQ, ensure that the data you are calibrating with is representative
of your dataset. For example, for a classification problem a general
@ -38,7 +41,8 @@ we can categorize the error into:
4. If you are using PTQ, consider using QAT to recover some of the accuracy loss
from quantization.
### Int8 quantization tips
Int8 quantization tips
^^^^^^^^^^^^^^^^^^^^^^
1. If you are using per-tensor weight quantization, consider using per-channel
weight quantization.
@ -48,7 +52,8 @@ we can categorize the error into:
If this variation is high, the layer may be suitable for dynamic quantization
but not static quantization.
## Data sensitive error
Data sensitive error
~~~~~~~~~~~~~~~~~~~~
If you are using static quantization and a small portion of your input data is
resulting in high quantization error, you can try:
@ -60,7 +65,8 @@ resulting in high quantization error, you can try:
the observer settings to choose a better scale and zero_point.
## Implementation error
Implementation error
~~~~~~~~~~~~~~~~~~~~
If you are using PyTorch quantization with your own backend
you may see differences between the reference implementation of an
@ -74,23 +80,19 @@ operation (such as ``dequant -> op_fp32 -> quant``) and the quantized implementa
2. the kernel on the target hardware has an accuracy issue. In this case, reach
out to the kernel developer.
## Numerical Debugging Tooling (prototype)
Numerical Debugging Tooling (prototype)
---------------------------------------
```{eval-rst}
.. toctree::
:hidden:
torch.ao.ns._numeric_suite
torch.ao.ns._numeric_suite_fx
```
```{warning}
.. warning ::
Numerical debugging tooling is early prototype and subject to change.
```
```{eval-rst}
* :ref:`torch_ao_ns_numeric_suite`
Eager mode numeric suite
* :ref:`torch_ao_ns_numeric_suite_fx`
FX numeric suite
```

View File

@ -1,4 +1,5 @@
# Quantization Backend Configuration
Quantization Backend Configuration
----------------------------------
FX Graph Mode Quantization allows the user to configure various
quantization behaviors of an op in order to match the expectation
@ -7,13 +8,13 @@ of their backend.
In the future, this document will contain a detailed spec of
these configurations.
## Default values for native configurations
Default values for native configurations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Below is the output of the configuration for quantization of ops
in x86 and qnnpack (PyTorch's default quantized backends).
Results:
```{eval-rst}
.. literalinclude:: scripts/quantization_backend_configs/default_backend_config.txt
```

View File

@ -1,16 +1,16 @@
# Quantization API Reference
Quantization API Reference
-------------------------------
## torch.ao.quantization
torch.ao.quantization
~~~~~~~~~~~~~~~~~~~~~
This module contains Eager mode quantization APIs.
```{eval-rst}
.. currentmodule:: torch.ao.quantization
```
### Top level APIs
Top level APIs
^^^^^^^^^^^^^^
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -22,11 +22,10 @@ This module contains Eager mode quantization APIs.
prepare
prepare_qat
convert
```
### Preparing model for quantization
Preparing model for quantization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -37,11 +36,10 @@ This module contains Eager mode quantization APIs.
DeQuantStub
QuantWrapper
add_quant_dequant
```
### Utility functions
Utility functions
^^^^^^^^^^^^^^^^^
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -50,17 +48,15 @@ This module contains Eager mode quantization APIs.
swap_module
propagate_qconfig_
default_eval_fn
```
## torch.ao.quantization.quantize_fx
torch.ao.quantization.quantize_fx
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module contains FX graph mode quantization APIs (prototype).
```{eval-rst}
.. currentmodule:: torch.ao.quantization.quantize_fx
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -70,17 +66,14 @@ This module contains FX graph mode quantization APIs (prototype).
prepare_qat_fx
convert_fx
fuse_fx
```
## torch.ao.quantization.qconfig_mapping
torch.ao.quantization.qconfig_mapping
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module contains QConfigMapping for configuring FX graph mode quantization.
```{eval-rst}
.. currentmodule:: torch.ao.quantization.qconfig_mapping
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -89,19 +82,16 @@ This module contains QConfigMapping for configuring FX graph mode quantization.
QConfigMapping
get_default_qconfig_mapping
get_default_qat_qconfig_mapping
```
## torch.ao.quantization.backend_config
torch.ao.quantization.backend_config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module contains BackendConfig, a config object that defines how quantization is supported
in a backend. Currently only used by FX Graph Mode Quantization, but we may extend Eager Mode
Quantization to work with this as well.
```{eval-rst}
.. currentmodule:: torch.ao.quantization.backend_config
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -112,17 +102,15 @@ Quantization to work with this as well.
DTypeConfig
DTypeWithConstraints
ObservationType
```
## torch.ao.quantization.fx.custom_config
torch.ao.quantization.fx.custom_config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module contains a few CustomConfig classes that's used in both eager mode and FX graph mode quantization
```{eval-rst}
.. currentmodule:: torch.ao.quantization.fx.custom_config
```
```{eval-rst}
.. currentmodule:: torch.ao.quantization.fx.custom_config
.. autosummary::
:toctree: generated
:nosignatures:
@ -132,62 +120,48 @@ This module contains a few CustomConfig classes that's used in both eager mode a
PrepareCustomConfig
ConvertCustomConfig
StandaloneModuleConfigEntry
```
## torch.ao.quantization.quantizer
torch.ao.quantization.quantizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```{eval-rst}
.. automodule:: torch.ao.quantization.quantizer
```
## torch.ao.quantization.pt2e (quantization in pytorch 2.0 export implementation)
torch.ao.quantization.pt2e (quantization in pytorch 2.0 export implementation)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```{eval-rst}
.. automodule:: torch.ao.quantization.pt2e
.. automodule:: torch.ao.quantization.pt2e.representation
```
## torch.ao.quantization.pt2e.export_utils
torch.ao.quantization.pt2e.export_utils
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```{eval-rst}
.. currentmodule:: torch.ao.quantization.pt2e.export_utils
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
model_is_exported
```
```{eval-rst}
.. currentmodule:: torch.ao.quantization
```
## torch.ao.quantization.pt2e.lowering
torch.ao.quantization.pt2e.lowering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```{eval-rst}
.. currentmodule:: torch.ao.quantization.pt2e.lowering
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
lower_pt2e_quantized_to_x86
```
```{eval-rst}
.. currentmodule:: torch.ao.quantization
```
## PT2 Export (pt2e) Numeric Debugger
```{eval-rst}
PT2 Export (pt2e) Numeric Debugger
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
@ -199,17 +173,14 @@ This module contains a few CustomConfig classes that's used in both eager mode a
prepare_for_propagation_comparison
extract_results_from_loggers
compare_results
```
## torch (quantization related functions)
torch (quantization related functions)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This describes the quantization related functions of the `torch` namespace.
```{eval-rst}
.. currentmodule:: torch
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -218,18 +189,15 @@ This describes the quantization related functions of the `torch` namespace.
quantize_per_tensor
quantize_per_channel
dequantize
```
## torch.Tensor (quantization related methods)
torch.Tensor (quantization related methods)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Quantized Tensors support a limited subset of data manipulation methods of the
regular full-precision tensor.
```{eval-rst}
.. currentmodule:: torch.Tensor
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -262,18 +230,16 @@ regular full-precision tensor.
resize_
sort
topk
```
## torch.ao.quantization.observer
torch.ao.quantization.observer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module contains observers which are used to collect statistics about
the values observed during calibration (PTQ) or training (QAT).
```{eval-rst}
.. currentmodule:: torch.ao.quantization.observer
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -310,18 +276,15 @@ the values observed during calibration (PTQ) or training (QAT).
TorchAODType
ZeroPointDomain
get_block_size
```
## torch.ao.quantization.fake_quantize
torch.ao.quantization.fake_quantize
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module implements modules which are used to perform fake quantization
during QAT.
```{eval-rst}
.. currentmodule:: torch.ao.quantization.fake_quantize
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -342,18 +305,15 @@ during QAT.
enable_fake_quant
disable_observer
enable_observer
```
## torch.ao.quantization.qconfig
torch.ao.quantization.qconfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This module defines `QConfig` objects which are used
to configure quantization settings for individual ops.
```{eval-rst}
.. currentmodule:: torch.ao.quantization.qconfig
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -372,23 +332,17 @@ to configure quantization settings for individual ops.
default_weight_only_qconfig
default_activation_only_qconfig
default_qat_qconfig_v2
```
## torch.ao.nn.intrinsic
```{eval-rst}
torch.ao.nn.intrinsic
~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.intrinsic
.. automodule:: torch.ao.nn.intrinsic.modules
```
This module implements the combined (fused) modules conv + relu which can
then be quantized.
```{eval-rst}
.. currentmodule:: torch.ao.nn.intrinsic
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -406,23 +360,18 @@ then be quantized.
ConvBnReLU3d
BNReLU2d
BNReLU3d
```
## torch.ao.nn.intrinsic.qat
```{eval-rst}
torch.ao.nn.intrinsic.qat
~~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.intrinsic.qat
.. automodule:: torch.ao.nn.intrinsic.qat.modules
```
This module implements the versions of those fused operations needed for
quantization aware training.
```{eval-rst}
.. currentmodule:: torch.ao.nn.intrinsic.qat
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -439,24 +388,19 @@ quantization aware training.
ConvReLU3d
update_bn_stats
freeze_bn_stats
```
## torch.ao.nn.intrinsic.quantized
```{eval-rst}
torch.ao.nn.intrinsic.quantized
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.intrinsic.quantized
.. automodule:: torch.ao.nn.intrinsic.quantized.modules
```
This module implements the quantized implementations of fused operations
like conv + relu. No BatchNorm variants as it's usually folded into convolution
for inference.
```{eval-rst}
.. currentmodule:: torch.ao.nn.intrinsic.quantized
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -468,47 +412,35 @@ for inference.
ConvReLU2d
ConvReLU3d
LinearReLU
```
## torch.ao.nn.intrinsic.quantized.dynamic
```{eval-rst}
torch.ao.nn.intrinsic.quantized.dynamic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.intrinsic.quantized.dynamic
.. automodule:: torch.ao.nn.intrinsic.quantized.dynamic.modules
```
This module implements the quantized dynamic implementations of fused operations
like linear + relu.
```{eval-rst}
.. currentmodule:: torch.ao.nn.intrinsic.quantized.dynamic
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
LinearReLU
```
## torch.ao.nn.qat
```{eval-rst}
torch.ao.nn.qat
~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.qat
.. automodule:: torch.ao.nn.qat.modules
```
This module implements versions of the key nn modules **Conv2d()** and
**Linear()** which run in FP32 but with rounding applied to simulate the
effect of INT8 quantization.
```{eval-rst}
.. currentmodule:: torch.ao.nn.qat
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -517,48 +449,36 @@ effect of INT8 quantization.
Conv2d
Conv3d
Linear
```
## torch.ao.nn.qat.dynamic
```{eval-rst}
torch.ao.nn.qat.dynamic
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.qat.dynamic
.. automodule:: torch.ao.nn.qat.dynamic.modules
```
This module implements versions of the key nn modules such as **Linear()**
which run in FP32 but with rounding applied to simulate the effect of INT8
quantization and will be dynamically quantized during inference.
```{eval-rst}
.. currentmodule:: torch.ao.nn.qat.dynamic
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
Linear
```
## torch.ao.nn.quantized
```{eval-rst}
torch.ao.nn.quantized
~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.quantized
:noindex:
.. automodule:: torch.ao.nn.quantized.modules
```
This module implements the quantized versions of the nn layers such as
`~torch.nn.Conv2d` and `torch.nn.ReLU`.
```{eval-rst}
.. currentmodule:: torch.ao.nn.quantized
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -588,25 +508,17 @@ This module implements the quantized versions of the nn layers such as
InstanceNorm1d
InstanceNorm2d
InstanceNorm3d
```
## torch.ao.nn.quantized.functional
```{eval-rst}
torch.ao.nn.quantized.functional
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.quantized.functional
```
```{eval-rst}
This module implements the quantized versions of the functional layers such as
`~torch.nn.functional.conv2d` and `torch.nn.functional.relu`. Note:
:math:`~torch.nn.functional.relu` supports quantized inputs.
```
:meth:`~torch.nn.functional.relu` supports quantized inputs.
```{eval-rst}
.. currentmodule:: torch.ao.nn.quantized.functional
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -634,19 +546,16 @@ This module implements the quantized versions of the functional layers such as
upsample
upsample_bilinear
upsample_nearest
```
## torch.ao.nn.quantizable
torch.ao.nn.quantizable
~~~~~~~~~~~~~~~~~~~~~~~
This module implements the quantizable versions of some of the nn layers.
These modules can be used in conjunction with the custom module mechanism,
by providing the ``custom_module_config`` argument to both prepare and convert.
```{eval-rst}
.. currentmodule:: torch.ao.nn.quantizable
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -654,24 +563,19 @@ by providing the ``custom_module_config`` argument to both prepare and convert.
LSTM
MultiheadAttention
```
## torch.ao.nn.quantized.dynamic
```{eval-rst}
torch.ao.nn.quantized.dynamic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automodule:: torch.ao.nn.quantized.dynamic
.. automodule:: torch.ao.nn.quantized.dynamic.modules
```
Dynamically quantized {class}`~torch.nn.Linear`, {class}`~torch.nn.LSTM`,
{class}`~torch.nn.LSTMCell`, {class}`~torch.nn.GRUCell`, and
{class}`~torch.nn.RNNCell`.
Dynamically quantized :class:`~torch.nn.Linear`, :class:`~torch.nn.LSTM`,
:class:`~torch.nn.LSTMCell`, :class:`~torch.nn.GRUCell`, and
:class:`~torch.nn.RNNCell`.
```{eval-rst}
.. currentmodule:: torch.ao.nn.quantized.dynamic
```
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
@ -683,9 +587,9 @@ Dynamically quantized {class}`~torch.nn.Linear`, {class}`~torch.nn.LSTM`,
RNNCell
LSTMCell
GRUCell
```
## Quantized dtypes and quantization schemes
Quantized dtypes and quantization schemes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note that operator implementations currently only
support per channel quantization for weights of the **conv** and **linear**
@ -693,7 +597,6 @@ operators. Furthermore, the input data is
mapped linearly to the quantized data and vice versa
as follows:
```{eval-rst}
.. math::
\begin{aligned}
@ -702,15 +605,11 @@ as follows:
\text{Dequantization:}&\\
&x_\text{out} = (Q_\text{input}-z)*s
\end{aligned}
```
```{eval-rst}
where :math:`\text{clamp}(.)` is the same as :func:`~torch.clamp` while the
scale :math:`s` and zero point :math:`z` are then computed
as described in :class:`~torch.ao.quantization.observer.MinMaxObserver`, specifically:
```
```{eval-rst}
.. math::
\begin{aligned}
@ -726,7 +625,6 @@ as described in :class:`~torch.ao.quantization.observer.MinMaxObserver`, specifi
\left( Q_\text{max} - Q_\text{min} \right ) \\
&z = Q_\text{min} - \text{round}(x_\text{min} / s)
\end{aligned}
```
where :math:`[x_\text{min}, x_\text{max}]` denotes the range of the input data while
:math:`Q_\text{min}` and :math:`Q_\text{max}` are respectively the minimum and maximum values of the quantized dtype.
@ -737,7 +635,6 @@ the range of the input data or symmetric quantization is being used.
Additional data types and quantization schemes can be implemented through
the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html>`_.
```{eval-rst}
* :attr:`torch.qscheme` — Type to describe the quantization scheme of a tensor.
Supported types:
@ -751,9 +648,8 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
* :attr:`torch.quint8` — 8-bit unsigned integer
* :attr:`torch.qint8` — 8-bit signed integer
* :attr:`torch.qint32` — 32-bit signed integer
```
```{eval-rst}
.. These modules are missing docs. Adding them here only for tracking
.. automodule:: torch.ao.nn.quantizable.modules
:noindex:
@ -782,4 +678,3 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
.. automodule:: torch.nn.quantized.dynamic.modules
.. automodule:: torch.quantization
.. automodule:: torch.nn.intrinsic.modules
```

View File

@ -1,17 +1,16 @@
(quantization-doc)=
.. _quantization-doc:
# Quantization
Quantization
============
```{eval-rst}
.. automodule:: torch.ao.quantization
.. automodule:: torch.ao.quantization.fx
```
```{warning}
.. warning ::
Quantization is in beta and subject to change.
```
## Introduction to Quantization
Introduction to Quantization
----------------------------
Quantization refers to techniques for performing computations and storing
tensors at lower bitwidths than floating point precision. A quantized model
@ -39,13 +38,14 @@ that perform all or part of the computation in lower precision. Higher-level
APIs are provided that incorporate typical workflows of converting FP32 model
to lower precision with minimal accuracy loss.
## Quantization API Summary
Quantization API Summary
-----------------------------
PyTorch provides three different modes of quantization: Eager Mode Quantization, FX Graph Mode Quantization (maintenance) and PyTorch 2 Export Quantization.
Eager Mode Quantization is a beta feature. User needs to do fusion and specify where quantization and dequantization happens manually, also it only supports modules and not functionals.
FX Graph Mode Quantization is an automated quantization workflow in PyTorch, and currently it's a prototype feature, it is in maintenance mode since we have PyTorch 2 Export Quantization. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with `torch.fx`). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with `torch.fx`, especially on how to make a model symbolically traceable.
FX Graph Mode Quantization is an automated quantization workflow in PyTorch, and currently it's a prototype feature, it is in maintenance mode since we have PyTorch 2 Export Quantization. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with ``torch.fx``). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with ``torch.fx``, especially on how to make a model symbolically traceable.
PyTorch 2 Export Quantization is the new full graph mode quantization workflow, released as prototype feature in PyTorch 2.1. With PyTorch 2, we are moving to a better solution for full program capture (torch.export) since it can capture a higher percentage (88.8% on 14K models) of models compared to torch.fx.symbolic_trace (72.7% on 14K models), the program capture solution used by FX Graph Mode Quantization. torch.export still has limitations around some python constructs and requires user involvement to support dynamism in the exported model, but overall it is an improvement over the previous program capture solution. PyTorch 2 Export Quantization is built for models captured by torch.export, with flexibility and productivity of both modeling users and backend developers in mind. The main features are
(1). Programmable API for configuring how a model is quantized that can scale to many more use cases
@ -56,17 +56,49 @@ New users of quantization are encouraged to try out PyTorch 2 Export Quantizatio
The following table compares the differences between Eager Mode Quantization, FX Graph Mode Quantization and PyTorch 2 Export Quantization:
| | Eager Mode Quantization | FX Graph Mode Quantization | PyTorch 2 Export Quantization |
|-------------------------|-------------------------|-----------------------------|-------------------------------|
+-----------------+-------------------+-------------------+-------------------------+
| |Eager Mode |FX Graph |PyTorch 2 Export |
| |Quantization |Mode |Quantization |
| | |Quantization | |
+-----------------+-------------------+-------------------+-------------------------+
|Release |beta |prototype |prototype |
|Status | |(maintenance) | |
| Operator Fusion | Manual | Automatic | Automatic |
| Quant/DeQuant Placement | Manual | Automatic | Automatic |
| Quantizing Modules | Supported | Supported | Supported |
| Quantizing Functionals/Torch Ops | Manual | Automatic | Supported |
| Support for Customization | Limited Support | Fully Supported | Fully Supported |
| Quantization Mode Support | Post Training Quantization: Static, Dynamic, Weight Only<br><br>Quantization Aware Training: Static | Post Training Quantization: Static, Dynamic, Weight Only<br><br>Quantization Aware Training: Static | Defined by Backend Specific Quantizer |
| Input/Output Model Type | `torch.nn.Module` | `torch.nn.Module` (May need some refactors to make the model compatible with FX Graph Mode Quantization) | `torch.fx.GraphModule` (captured by `torch.export`) |
+-----------------+-------------------+-------------------+-------------------------+
|Operator |Manual |Automatic |Automatic |
|Fusion | | | |
+-----------------+-------------------+-------------------+-------------------------+
|Quant/DeQuant |Manual |Automatic |Automatic |
|Placement | | | |
+-----------------+-------------------+-------------------+-------------------------+
|Quantizing |Supported |Supported |Supported |
|Modules | | | |
+-----------------+-------------------+-------------------+-------------------------+
|Quantizing |Manual |Automatic |Supported |
|Functionals/Torch| | | |
|Ops | | | |
+-----------------+-------------------+-------------------+-------------------------+
|Support for |Limited Support |Fully |Fully Supported |
|Customization | |Supported | |
+-----------------+-------------------+-------------------+-------------------------+
|Quantization Mode|Post Training |Post Training |Defined by |
|Support |Quantization: |Quantization: |Backend Specific |
| |Static, Dynamic, |Static, Dynamic, |Quantizer |
| |Weight Only |Weight Only | |
| | | | |
| |Quantization Aware |Quantization Aware | |
| |Training: |Training: | |
| |Static |Static | |
+-----------------+-------------------+-------------------+-------------------------+
|Input/Output |``torch.nn.Module``|``torch.nn.Module``|``torch.fx.GraphModule`` |
|Model Type | |(May need some |(captured by |
| | |refactors to make |``torch.export`` |
| | |the model | |
| | |compatible with FX | |
| | |Graph Mode | |
| | |Quantization) | |
+-----------------+-------------------+-------------------+-------------------------+
There are three types of quantization supported:
@ -77,31 +109,48 @@ There are three types of quantization supported:
3. static quantization aware training (weights quantized, activations quantized,
quantization numerics modeled during training)
Please see our [Introduction to Quantization on PyTorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)
blog post for a more comprehensive overview of the tradeoffs between these quantization
Please see our `Introduction to Quantization on PyTorch
<https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_ blog post
for a more comprehensive overview of the tradeoffs between these quantization
types.
Operator coverage varies between dynamic and static quantization and is captured in the table below.
| | Static Quantization | Dynamic Quantization |
|---------------------------|-----------------------------|----------------------------|
| nn.Linear | Y | Y |
| nn.Conv1d/2d/3d | Y | N |
| nn.LSTM | Y (through custom modules) | Y |
| nn.GRU | N | Y |
| nn.RNNCell | N | Y |
| nn.GRUCell | N | Y |
| nn.LSTMCell | N | Y |
| nn.EmbeddingBag | Y (activations are in fp32) | Y |
+---------------------------+-------------------+--------------------+
| |Static | Dynamic |
| |Quantization | Quantization |
+---------------------------+-------------------+--------------------+
| | nn.Linear | | Y | | Y |
| | nn.Conv1d/2d/3d | | Y | | N |
+---------------------------+-------------------+--------------------+
| | nn.LSTM | | Y (through | | Y |
| | | | custom modules) | | |
| | nn.GRU | | N | | Y |
+---------------------------+-------------------+--------------------+
| | nn.RNNCell | | N | | Y |
| | nn.GRUCell | | N | | Y |
| | nn.LSTMCell | | N | | Y |
+---------------------------+-------------------+--------------------+
|nn.EmbeddingBag | Y (activations | |
| | are in fp32) | Y |
+---------------------------+-------------------+--------------------+
|nn.Embedding | Y | Y |
| nn.MultiheadAttention | Y (through custom modules) | Not supported |
| Activations | Broadly supported | Un-changed, computations stay in fp32 |
+---------------------------+-------------------+--------------------+
| nn.MultiheadAttention | Y (through | Not supported |
| | custom modules) | |
+---------------------------+-------------------+--------------------+
| Activations | Broadly supported | Un-changed, |
| | | computations |
| | | stay in fp32 |
+---------------------------+-------------------+--------------------+
### Eager Mode Quantization
For a general introduction to the quantization flow, including different types of quantization, please take a look at {ref}`general-quantization-flow`.
Eager Mode Quantization
^^^^^^^^^^^^^^^^^^^^^^^
For a general introduction to the quantization flow, including different types of quantization, please take a look at `General Quantization Flow`_.
#### Post Training Dynamic Quantization
Post Training Dynamic Quantization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is the simplest to apply form of quantization where the weights are
quantized ahead of time but the activations are dynamically quantized
@ -110,9 +159,8 @@ is dominated by loading weights from memory rather than computing the matrix
multiplications. This is true for LSTM and Transformer type models with
small batch size.
Diagram:
Diagram::
```python
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
@ -124,11 +172,9 @@ Diagram:
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
/
linear_weight_int8
```
PTDQ API Example:
PTDQ API Example::
```python
import torch
# define a floating point model
@ -152,11 +198,12 @@ PTDQ API Example:
# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)
```
To learn more about dynamic quantization please see our [dynamic quantization tutorial](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html).
To learn more about dynamic quantization please see our `dynamic quantization tutorial
<https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html>`_.
#### Post Training Static Quantization
Post Training Static Quantization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Post Training Static Quantization (PTQ static) quantizes the weights and activations of the model. It
fuses activations into preceding layers where possible. It requires
@ -165,11 +212,10 @@ parameters for activations. Post Training Static Quantization is typically used
both memory bandwidth and compute savings are important with CNNs being a
typical use case.
We may need to modify the model before applying post training static quantization. Please see {ref}`model-preparation-for-eager-mode-static-quantization`.
We may need to modify the model before applying post training static quantization. Please see `Model Preparation for Eager Mode Static Quantization`_.
Diagram:
Diagram::
```python
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
@ -181,11 +227,9 @@ Diagram:
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
/
linear_weight_int8
```
PTSQ API Example:
PTSQ API Example::
```python
import torch
# define a floating point model where some layers could be statically quantized
@ -248,11 +292,12 @@ PTSQ API Example:
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
```
To learn more about static quantization, please see the [static quantization tutorial](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html).
To learn more about static quantization, please see the `static quantization tutorial
<https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html>`_.
#### Quantization Aware Training for Static Quantization
Quantization Aware Training for Static Quantization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Quantization Aware Training (QAT) models the effects of quantization during training
allowing for higher accuracy compared to other quantization methods. We can do QAT for static, dynamic or weight only quantization. During
@ -263,11 +308,10 @@ activations are quantized, and activations are fused into the preceding layer
where possible. It is commonly used with CNNs and yields a higher accuracy
compared to static quantization.
We may need to modify the model before applying post training static quantization. Please see {ref}`model-preparation-for-eager-mode-static-quantization`.
We may need to modify the model before applying post training static quantization. Please see `Model Preparation for Eager Mode Static Quantization`_.
Diagram:
Diagram::
```python
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
@ -284,11 +328,9 @@ Diagram:
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
/
linear_weight_int8
```
QAT API Example:
QAT API Example::
```python
import torch
# define a floating point model where some layers could benefit from QAT
@ -349,14 +391,13 @@ QAT API Example:
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
```
To learn more about quantization aware training, please see the
[QAT tutorial](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html).
To learn more about quantization aware training, please see the `QAT
tutorial
<https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html>`_.
(model-preparation-for-eager-mode-static-quantization)=
#### Model Preparation for Eager Mode Static Quantization
Model Preparation for Eager Mode Static Quantization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is necessary to currently make some modifications to the model definition
prior to Eager mode quantization. This is because currently quantization works on a module
@ -364,39 +405,38 @@ by module basis. Specifically, for all quantization techniques, the user needs t
1. Convert any operations that require output requantization (and thus have
additional parameters) from functionals to module form (for example,
using `torch.nn.ReLU` instead of `torch.nn.functional.relu`).
using ``torch.nn.ReLU`` instead of ``torch.nn.functional.relu``).
2. Specify which parts of the model need to be quantized either by assigning
`.qconfig` attributes on submodules or by specifying `qconfig_mapping`.
For example, setting `model.conv1.qconfig = None` means that the
`model.conv` layer will not be quantized, and setting
`model.linear1.qconfig = custom_qconfig` means that the quantization
settings for `model.linear1` will be using `custom_qconfig` instead
``.qconfig`` attributes on submodules or by specifying ``qconfig_mapping``.
For example, setting ``model.conv1.qconfig = None`` means that the
``model.conv`` layer will not be quantized, and setting
``model.linear1.qconfig = custom_qconfig`` means that the quantization
settings for ``model.linear1`` will be using ``custom_qconfig`` instead
of the global qconfig.
For static quantization techniques which quantize activations, the user needs
to do the following in addition:
1. Specify where activations are quantized and de-quantized. This is done using
{class}`~torch.ao.quantization.QuantStub` and
{class}`~torch.ao.quantization.DeQuantStub` modules.
2. Use {class}`~torch.ao.nn.quantized.FloatFunctional` to wrap tensor operations
:class:`~torch.ao.quantization.QuantStub` and
:class:`~torch.ao.quantization.DeQuantStub` modules.
2. Use :class:`~torch.ao.nn.quantized.FloatFunctional` to wrap tensor operations
that require special handling for quantization into modules. Examples
are operations like `add` and `cat` which require special handling to
are operations like ``add`` and ``cat`` which require special handling to
determine output quantization parameters.
3. Fuse modules: combine operations/modules into a single module to obtain
higher accuracy and performance. This is done using the
{func}`~torch.ao.quantization.fuse_modules.fuse_modules` API, which takes in lists of modules
:func:`~torch.ao.quantization.fuse_modules.fuse_modules` API, which takes in lists of modules
to be fused. We currently support the following fusions:
`[Conv, Relu]`, `[Conv, BatchNorm]`, `[Conv, BatchNorm, Relu]`, `[Linear, Relu]`
[Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]
(prototype-maintenance-mode-fx-graph-mode-quantization)=
### (Prototype - maintenance mode) FX Graph Mode Quantization
(Prototype - maintenance mode) FX Graph Mode Quantization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are multiple quantization types in post training quantization (weight only, dynamic and static) and the configuration is done through `qconfig_mapping` (an argument of the `prepare_fx` function).
FXPTQ API Example:
FXPTQ API Example::
```python
import torch
from torch.ao.quantization import (
get_default_qconfig_mapping,
@ -455,19 +495,17 @@ FXPTQ API Example:
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)
```
Please follow the tutorials below to learn more about FX Graph Mode Quantization:
- [User Guide on Using FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html)
- [FX Graph Mode Post Training Static Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html)
- [FX Graph Mode Post Training Dynamic Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html)
- `User Guide on Using FX Graph Mode Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html>`_
- `FX Graph Mode Post Training Static Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html>`_
- `FX Graph Mode Post Training Dynamic Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html>`_
### (Prototype) PyTorch 2 Export Quantization
(Prototype) PyTorch 2 Export Quantization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
API Example::
API Example:
```python
import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch.export import export_for_training
@ -514,31 +552,29 @@ API Example:
# Step 3. lowering
# lower to target backend
```
Please follow these tutorials to get started on PyTorch 2 Export Quantization:
Modeling Users:
- [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html)
- [PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor](https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html)
- [PyTorch 2 Export Quantization Aware Training](https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html)
- `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_
- `PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor <https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html>`_
- `PyTorch 2 Export Quantization Aware Training <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_
Backend Developers (please check out all Modeling Users docs as well):
- [How to Write a Quantizer for PyTorch 2 Export Quantization](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html)
- `How to Write a Quantizer for PyTorch 2 Export Quantization <https://pytorch.org/tutorials/prototype/pt2e_quantizer.html>`_
## Quantization Stack
Quantization is the process to convert a floating point model to a quantized model. So at high level the quantization stack can be split into two parts:
1. The building blocks or abstractions for a quantized model
2. The building blocks or abstractions for the quantization flow that converts a floating point model to a quantized model
### Quantized Model
#### Quantized Tensor
Quantization Stack
------------------------
Quantization is the process to convert a floating point model to a quantized model. So at high level the quantization stack can be split into two parts: 1). The building blocks or abstractions for a quantized model 2). The building blocks or abstractions for the quantization flow that converts a floating point model to a quantized model
Quantized Model
^^^^^^^^^^^^^^^^^^^^^^^
Quantized Tensor
~~~~~~~~~~~~~~~~~
In order to do quantization in PyTorch, we need to be able to represent
quantized data in Tensors. A Quantized Tensor allows for storing
quantized data (represented as int8/uint8/int32) along with quantization
@ -550,10 +586,8 @@ PyTorch supports both per tensor and per channel symmetric and asymmetric quanti
The mapping is performed by converting the floating point tensors using
```{eval-rst}
.. image:: math-quantizer-equation.png
:width: 40%
```
Note that, we ensure that zero in floating point is represented with no error
after quantization, thereby ensuring that operations like padding do not cause
@ -587,8 +621,8 @@ Here are a few key attributes for quantized Tensor:
* per_channel_zero_points (list of int)
* axis (int)
#### Quantize and Dequantize
Quantize and Dequantize
~~~~~~~~~~~~~~~~~~~~~~~
The input and output of a model are floating point Tensors, but activations in the quantized model are quantized, so we need operators to convert between floating point and quantized Tensors.
* Quantize (float -> quantized)
@ -603,19 +637,20 @@ The input and output of a model are floating point Tensors, but activations in t
* quantized_tensor.dequantize() - calling dequantize on a torch.float16 Tensor will convert the Tensor back to torch.float
* torch.dequantize(x)
#### Quantized Operators/Modules
Quantized Operators/Modules
~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Quantized Operator are the operators that takes quantized Tensor as inputs, and outputs a quantized Tensor.
* Quantized Modules are PyTorch Modules that performs quantized operations. They are typically defined for weighted operations like linear and conv.
#### Quantized Engine
Quantized Engine
~~~~~~~~~~~~~~~~~~~~
When a quantized model is executed, the qengine (torch.backends.quantized.engine) specifies which backend is to be used for execution. It is important to ensure that the qengine is compatible with the quantized model in terms of value range of quantized activation and weights.
### Quantization Flow
#### Observer and FakeQuantize
Quantization Flow
^^^^^^^^^^^^^^^^^^^^^^^
Observer and FakeQuantize
~~~~~~~~~~~~~~~~~~~~~~~~~~
* Observer are PyTorch Modules used to:
* collect tensor statistics like min value and max value of the Tensor passing through the observer
@ -625,8 +660,8 @@ When a quantized model is executed, the qengine (torch.backends.quantized.engine
* simulate quantization (performing quantize/dequantize) for a Tensor in the network
* it can calculate quantization parameters based on the collected statistics from observer, or it can learn the quantization parameters as well
#### QConfig
QConfig
~~~~~~~~~~~
* QConfig is a namedtuple of Observer or FakeQuantize Module class that can are configurable with qscheme, dtype etc. it is used to configure how an operator should be observed
* Quantization configuration for an operator/module
@ -638,10 +673,8 @@ When a quantized model is executed, the qengine (torch.backends.quantized.engine
* Currently supports configuration for activation and weight
* We insert input/weight/output observer based on the qconfig that is configured for a given operator or module
(general-quantization-flow)=
#### General Quantization Flow
General Quantization Flow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In general, the flow is the following
* prepare
@ -671,62 +704,131 @@ And in terms of how we quantize the operators, we can have:
We can mix different ways of quantizing operators in the same quantization flow. For example, we can have post training quantization that has both statically and dynamically quantized operators.
## Quantization Support Matrix
Quantization Support Matrix
--------------------------------------
Quantization Mode Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------------------+------------------------------------------------------+----------------+----------------+------------+-----------------+
| |Quantization |Dataset | Works Best For | Accuracy | Notes |
| |Mode |Requirement | | | |
+-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+
|Post Training Quantization |Dynamic/Weight Only Quantization |activation |None |LSTM, MLP, |good |Easy to use, |
| | |dynamically | |Embedding, | |close to static |
| | |quantized (fp16, | |Transformer | |quantization when|
| | |int8) or not | | | |performance is |
| | |quantized, weight | | | |compute or memory|
| | |statically quantized| | | |bound due to |
| | |(fp16, int8, in4) | | | |weights |
| +---------------------------------+--------------------+----------------+----------------+------------+-----------------+
| |Static Quantization |activation and |calibration |CNN |good |Provides best |
| | |weights statically |dataset | | |perf, may have |
| | |quantized (int8) | | | |big impact on |
| | | | | | |accuracy, good |
| | | | | | |for hardwares |
| | | | | | |that only support|
| | | | | | |int8 computation |
+-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+
| |Dynamic Quantization |activation and |fine-tuning |MLP, Embedding |best |Limited support |
| | |weight are fake |dataset | | |for now |
| | |quantized | | | | |
| +---------------------------------+--------------------+----------------+----------------+------------+-----------------+
| |Static Quantization |activation and |fine-tuning |CNN, MLP, |best |Typically used |
| | |weight are fake |dataset |Embedding | |when static |
| | |quantized | | | |quantization |
| | | | | | |leads to bad |
| | | | | | |accuracy, and |
| | | | | | |used to close the|
| | | | | | |accuracy gap |
|Quantization Aware Training | | | | | | |
+-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+
### Quantization Mode Support
| | Quantization Mode | Dataset Requirement | Works Best For | Accuracy | Notes |
|--|-------------------|---------------------|----------------|----------|-------|
| Post Training Quantization | Dynamic/Weight Only Quantization<br>(activation dynamically quantized - fp16, int8 or not quantized;<br>weight statically quantized - fp16, int8, in4) | None | LSTM, MLP, Embedding, Transformer | good | Easy to use, close to static quantization when performance is compute or memory bound due to weights |
| | Static Quantization<br>(activation and weights statically quantized - int8) | Calibration dataset | CNN | good | Provides best performance, may have big impact on accuracy, good for hardware that only supports int8 computation |
| | Dynamic Quantization<br>(activation and weights are fake quantized) | Fine-tuning dataset | MLP, Embedding | best | Limited support for now |
| Quantization Aware Training | Static Quantization<br>(activation and weights are fake quantized) | Fine-tuning dataset | CNN, MLP, Embedding | best | Typically used when static quantization leads to bad accuracy, helps close the accuracy gap |
Please see our [Introduction to Quantization on Pytorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)
blog post for a more comprehensive overview of the tradeoffs between these quantization
Please see our `Introduction to Quantization on Pytorch
<https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_ blog post
for a more comprehensive overview of the tradeoffs between these quantization
types.
### Quantization Flow Support
Quantization Flow Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch provides two modes of quantization: Eager Mode Quantization and FX Graph Mode Quantization.
Eager Mode Quantization is a beta feature. User needs to do fusion and specify where quantization and dequantization happens manually, also it only supports modules and not functionals.
FX Graph Mode Quantization is an automated quantization framework in PyTorch, and currently it's a prototype feature. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with `torch.fx`). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with `torch.fx`, especially on how to make a model symbolically traceable.
FX Graph Mode Quantization is an automated quantization framework in PyTorch, and currently it's a prototype feature. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with ``torch.fx``). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with ``torch.fx``, especially on how to make a model symbolically traceable.
New users of quantization are encouraged to try out FX Graph Mode Quantization first, if it does not work, user may try to follow the guideline of [using FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html) or fall back to eager mode quantization.
New users of quantization are encouraged to try out FX Graph Mode Quantization first, if it does not work, user may try to follow the guideline of `using FX Graph Mode Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html>`_ or fall back to eager mode quantization.
The following table compares the differences between Eager Mode Quantization and FX Graph Mode Quantization:
| | Eager Mode Quantization | FX Graph Mode Quantization |
|-------------------------|-------------------------------|-------------------------------------------------------------|
+-----------------+-------------------+-------------------+
| |Eager Mode |FX Graph |
| |Quantization |Mode |
| | |Quantization |
+-----------------+-------------------+-------------------+
|Release |beta |prototype |
|Status | | |
| Operator Fusion | Manual | Automatic |
| Quant/DeQuant Placement | Manual | Automatic |
| Quantizing Modules | Supported | Supported |
| Quantizing Functionals/Torch Ops | Manual | Automatic |
| Support for Customization | Limited Support | Fully Supported |
| Quantization Mode Support | Post Training Quantization: <br>Static, Dynamic, Weight Only <br><br>Quantization Aware Training: <br>Static | Post Training Quantization: <br>Static, Dynamic, Weight Only <br><br>Quantization Aware Training: <br>Static |
| Input/Output Model Type | `torch.nn.Module` | `torch.nn.Module` <br>(May need some refactors to make the model compatible with FX Graph Mode Quantization) |
### Backend/Hardware Support
| Hardware | Kernel Library | Eager Mode Quantization | FX Graph Mode Quantization | Quantization Mode Support |
|-------------|--------------------|-----------------------------------|-----------------------------|----------------------------|
| server CPU | fbgemm/onednn | Supported | All Supported | |
| mobile CPU | qnnpack/xnnpack | | | |
| server GPU | TensorRT (early prototype) | Not supported (requires a graph) | Supported | Static Quantization |
+-----------------+-------------------+-------------------+
|Operator |Manual |Automatic |
|Fusion | | |
+-----------------+-------------------+-------------------+
|Quant/DeQuant |Manual |Automatic |
|Placement | | |
+-----------------+-------------------+-------------------+
|Quantizing |Supported |Supported |
|Modules | | |
+-----------------+-------------------+-------------------+
|Quantizing |Manual |Automatic |
|Functionals/Torch| | |
|Ops | | |
+-----------------+-------------------+-------------------+
|Support for |Limited Support |Fully |
|Customization | |Supported |
+-----------------+-------------------+-------------------+
|Quantization Mode|Post Training |Post Training |
|Support |Quantization: |Quantization: |
| |Static, Dynamic, |Static, Dynamic, |
| |Weight Only |Weight Only |
| | | |
| |Quantization Aware |Quantization Aware |
| |Training: |Training: |
| |Static |Static |
+-----------------+-------------------+-------------------+
|Input/Output |``torch.nn.Module``|``torch.nn.Module``|
|Model Type | |(May need some |
| | |refactors to make |
| | |the model |
| | |compatible with FX |
| | |Graph Mode |
| | |Quantization) |
+-----------------+-------------------+-------------------+
Backend/Hardware Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------+---------------+------------+------------+------------+
|Hardware |Kernel Library |Eager Mode |FX Graph |Quantization|
| | |Quantization|Mode |Mode Support|
| | | |Quantization| |
+-----------------+---------------+------------+------------+------------+
|server CPU |fbgemm/onednn |Supported |All |
| | | |Supported |
+-----------------+---------------+ | +
|mobile CPU |qnnpack/xnnpack| | |
| | | | |
+-----------------+---------------+------------+------------+------------+
|server GPU |TensorRT (early|Not support |Supported |Static |
| |prototype) |this it | |Quantization|
| | |requires a | | |
| | |graph | | |
+-----------------+---------------+------------+------------+------------+
Today, PyTorch supports the following backends for running quantized operators efficiently:
* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `x86` optimized by [fbgemm](https://github.com/pytorch/FBGEMM) and [onednn](https://github.com/oneapi-src/oneDNN) (see the details at [RFC](https://github.com/pytorch/pytorch/issues/83888))
* ARM CPUs (typically found in mobile/embedded devices), via [qnnpack](https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native/quantized/cpu/qnnpack)
* (early prototype) support for NVidia GPU via [TensorRT](https://developer.nvidia.com/tensorrt) through `fx2trt` (to be open sourced)
* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `x86` optimized by `fbgemm <https://github.com/pytorch/FBGEMM>`_ and `onednn <https://github.com/oneapi-src/oneDNN>`_ (see the details at `RFC <https://github.com/pytorch/pytorch/issues/83888>`_)
* ARM CPUs (typically found in mobile/embedded devices), via `qnnpack <https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native/quantized/cpu/qnnpack>`_
* (early prototype) support for NVidia GPU via `TensorRT <https://developer.nvidia.com/tensorrt>`_ through `fx2trt` (to be open sourced)
#### Note for native CPU backends
Note for native CPU backends
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We expose both `x86` and `qnnpack` with the same native pytorch quantized operators, so we need additional flag to distinguish between them. The corresponding implementation of `x86` and `qnnpack` is chosen automatically based on the PyTorch build mode, though users have the option to override this by setting `torch.backends.quantization.engine` to `x86` or `qnnpack`.
When preparing a quantized model, it is necessary to ensure that qconfig
@ -736,9 +838,8 @@ during the quantization passes. The qengine controls whether `x86` or `qnnpack`
specific packing function is used when packing weights for
linear and convolution functions and modules. For example:
Default settings for x86:
Default settings for x86::
```python
# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
@ -746,78 +847,86 @@ qconfig = torch.ao.quantization.get_default_qconfig('x86')
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86'
```
Default settings for qnnpack:
Default settings for qnnpack::
```python
# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'
```
### Operator Support
Operator Support
^^^^^^^^^^^^^^^^^^^^
Operator coverage varies between dynamic and static quantization and is captured in the table below.
Note that for FX Graph Mode Quantization, the corresponding functionals are also supported.
| | Static Quantization | Dynamic Quantization |
|---------------------------|-----------------------------|-------------------------------|
| nn.Linear | Y | Y |
| nn.Conv1d/2d/3d | Y | N |
| nn.LSTM | N | Y |
| nn.GRU | N | Y |
| nn.RNNCell | N | Y |
| nn.GRUCell | N | Y |
| nn.LSTMCell | N | Y |
| nn.EmbeddingBag | Y (activations are in fp32) | Y |
+---------------------------+-------------------+--------------------+
| |Static | Dynamic |
| |Quantization | Quantization |
+---------------------------+-------------------+--------------------+
| | nn.Linear | | Y | | Y |
| | nn.Conv1d/2d/3d | | Y | | N |
+---------------------------+-------------------+--------------------+
| | nn.LSTM | | N | | Y |
| | nn.GRU | | N | | Y |
+---------------------------+-------------------+--------------------+
| | nn.RNNCell | | N | | Y |
| | nn.GRUCell | | N | | Y |
| | nn.LSTMCell | | N | | Y |
+---------------------------+-------------------+--------------------+
|nn.EmbeddingBag | Y (activations | |
| | are in fp32) | Y |
+---------------------------+-------------------+--------------------+
|nn.Embedding | Y | Y |
| nn.MultiheadAttention | Not Supported | Not Supported |
| Activations | Broadly supported | Un-changed, computations stay in fp32 |
+---------------------------+-------------------+--------------------+
|nn.MultiheadAttention |Not Supported | Not supported |
+---------------------------+-------------------+--------------------+
|Activations |Broadly supported | Un-changed, |
| | | computations |
| | | stay in fp32 |
+---------------------------+-------------------+--------------------+
Note: this will be updated with some information generated from native backend_config_dict soon.
## Quantization API Reference
Quantization API Reference
---------------------------
The [Quantization API Reference](./quantization-support.md) contains documentation
The :doc:`Quantization API Reference <quantization-support>` contains documentation
of quantization APIs, such as quantization passes, quantized tensor operations,
and supported quantized modules and functions.
```{eval-rst}
.. toctree::
:hidden:
quantization-support
```
## Quantization Backend Configuration
Quantization Backend Configuration
----------------------------------
The :doc:`Quantization Backend Configuration <quantization-backend-configuration>` contains documentation
on how to configure the quantization workflows for various backends.
```{eval-rst}
.. toctree::
:hidden:
quantization-backend-configuration
```
## Quantization Accuracy Debugging
Quantization Accuracy Debugging
-------------------------------
The [Quantization Accuracy Debugging](./quantization-accuracy-debugging.md) contains documentation
The :doc:`Quantization Accuracy Debugging <quantization-accuracy-debugging>` contains documentation
on how to debug quantization accuracy.
```{eval-rst}
.. toctree::
:hidden:
quantization-accuracy-debugging
```
## Quantization Customizations
Quantization Customizations
---------------------------
While default implementations of observers to select the scale factor and bias
based on observed tensor data are provided, developers can provide their own
@ -828,12 +937,13 @@ We also provide support for per channel quantization for **conv1d()**, **conv2d(
**conv3d()** and **linear()**.
Quantization workflows work by adding (e.g. adding observers as
`.observer` submodule) or replacing (e.g. converting `nn.Conv2d` to
`nn.quantized.Conv2d`) submodules in the model's module hierarchy. It
means that the model stays a regular `nn.Module`-based instance throughout the
``.observer`` submodule) or replacing (e.g. converting ``nn.Conv2d`` to
``nn.quantized.Conv2d``) submodules in the model's module hierarchy. It
means that the model stays a regular ``nn.Module``-based instance throughout the
process and thus can work with the rest of PyTorch APIs.
### Quantization Custom Module API
Quantization Custom Module API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Both Eager mode and FX graph mode quantization APIs provide a hook for the user
to specify module quantized in a custom way, with user defined logic for
@ -848,6 +958,7 @@ observation and quantization. The user needs to specify:
created from the observed module.
4. A configuration describing (1), (2), (3) above, passed to the quantization APIs.
The framework will then do the following:
1. during the `prepare` module swaps, it will convert every module of type
@ -863,9 +974,8 @@ on that output. The observer will be stored under the `activation_post_process`
as an attribute of the custom module instance. Relaxing these restrictions may
be done at a future time.
Custom API Example:
Custom API Example::
```python
import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
@ -959,56 +1069,55 @@ Custom API Example:
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
mp, convert_custom_config=convert_custom_config_dict)
```
## Best Practices
Best Practices
--------------
1. If you are using the `x86` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the `quant\_min`, `quant\_max`, e.g.
if `dtype` is `torch.quint8`, make sure to set a custom `quant_min` to be `0` and `quant_max` to be `127` (`255` / `2`)
if `dtype` is `torch.qint8`, make sure to set a custom `quant_min` to be `-64` (`-128` / `2`) and `quant_max` to be `63` (`127` / `2`), we already set this correctly if
you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default `qconfig` for
`x86` or `qnnpack` backend
1. If you are using the ``x86`` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the ``quant\_min``, ``quant\_max``, e.g.
if ``dtype`` is ``torch.quint8``, make sure to set a custom ``quant_min`` to be ``0`` and ``quant_max`` to be ``127`` (``255`` / ``2``)
if ``dtype`` is ``torch.qint8``, make sure to set a custom ``quant_min`` to be ``-64`` (``-128`` / ``2``) and ``quant_max`` to be ``63`` (``127`` / ``2``), we already set this correctly if
you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default ``qconfig`` for
``x86`` or ``qnnpack`` backend
1. If `onednn` backend is selected, 8 bits for activation will be used in the default qconfig mapping `torch.ao.quantization.get_default_qconfig_mapping('onednn')`
and default qconfig `torch.ao.quantization.get_default_qconfig('onednn')`. It is recommended to be used on CPUs with Vector Neural Network Instruction (VNNI)
support. Otherwise, setting `reduce_range` to True of the activation's observer to get better accuracy on CPUs without VNNI support.
2. If ``onednn`` backend is selected, 8 bits for activation will be used in the default qconfig mapping ``torch.ao.quantization.get_default_qconfig_mapping('onednn')``
and default qconfig ``torch.ao.quantization.get_default_qconfig('onednn')``. It is recommended to be used on CPUs with Vector Neural Network Instruction (VNNI)
support. Otherwise, setting ``reduce_range`` to True of the activation's observer to get better accuracy on CPUs without VNNI support.
## Frequently Asked Questions
Frequently Asked Questions
--------------------------
1. How can I do quantized inference on GPU?:
We don't have official GPU support yet, but this is an area of active development, you can find more information
[here](https://github.com/pytorch/pytorch/issues/87395).
`here <https://github.com/pytorch/pytorch/issues/87395>`_
2. Where can I get ONNX support for my quantized model?
If you get errors exporting the model (using APIs under `torch.onnx`), you may open an issue in the PyTorch repository. Prefix the issue title with `[ONNX]` and tag the issue as `module: onnx`.
If you get errors exporting the model (using APIs under ``torch.onnx``), you may open an issue in the PyTorch repository. Prefix the issue title with ``[ONNX]`` and tag the issue as ``module: onnx``.
If you encounter issues with ONNX Runtime, open an issue at [GitHub - microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/issues/).
If you encounter issues with ONNX Runtime, open an issue at `GitHub - microsoft/onnxruntime <https://github.com/microsoft/onnxruntime/issues/>`_.
3. How can I use quantization with LSTM's?
3. How can I use quantization with LSTM's?:
LSTM is supported through our custom module api in both eager mode and fx graph mode quantization. Examples can be found at:
LSTM is supported through our custom module api in both eager mode and fx graph mode quantization. Examples can be found at
Eager Mode: `pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm <https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/core/test_quantized_op.py#L2782>`_
FX Graph Mode: `pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm <https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/fx/test_quantize_fx.py#L4116>`_
* Eager Mode: [pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm](https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/core/test_quantized_op.py#L2782).
* FX Graph Mode: [pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm](https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/fx/test_quantize_fx.py#L4116).
Common Errors
---------------------------------------
## Common Errors
Passing a non-quantized Tensor into a quantized kernel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### Passing a non-quantized Tensor into a quantized kernel
If you see an error similar to::
If you see an error similar to:
```console
RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...
```
This means that you are trying to pass a non-quantized Tensor to a quantized
kernel. A common workaround is to use `torch.ao.quantization.QuantStub` to
kernel. A common workaround is to use ``torch.ao.quantization.QuantStub`` to
quantize the tensor. This needs to be done manually in Eager mode quantization.
An e2e example:
An e2e example::
```python
class M(torch.nn.Module):
def __init__(self):
super().__init__()
@ -1021,22 +1130,19 @@ An e2e example:
x = self.quant(x)
x = self.conv(x)
return x
```
### Passing a quantized Tensor into a non-quantized kernel
Passing a quantized Tensor into a non-quantized kernel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you see an error similar to:
If you see an error similar to::
```console
RuntimeError: Could not run 'aten::thnn_conv2d_forward' with arguments from the 'QuantizedCPU' backend.
```
This means that you are trying to pass a quantized Tensor to a non-quantized
kernel. A common workaround is to use `torch.ao.quantization.DeQuantStub` to
kernel. A common workaround is to use ``torch.ao.quantization.DeQuantStub`` to
dequantize the tensor. This needs to be done manually in Eager mode quantization.
An e2e example:
An e2e example::
```python
class M(torch.nn.Module):
def __init__(self):
super().__init__()
@ -1061,24 +1167,21 @@ An e2e example:
m.qconfig = some_qconfig
# turn off quantization for conv2
m.conv2.qconfig = None
```
### Saving and Loading Quantized models
Saving and Loading Quantized models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When calling `torch.load` on a quantized model, if you see an error like:
When calling ``torch.load`` on a quantized model, if you see an error like::
```console
AttributeError: 'LinearPackedParams' object has no attribute '_modules'
```
This is because directly saving and loading a quantized model using `torch.save` and `torch.load`
This is because directly saving and loading a quantized model using ``torch.save`` and ``torch.load``
is not supported. To save/load quantized models, the following ways can be used:
1. Saving/Loading the quantized model state_dict
An example:
An example::
```python
class M(torch.nn.Module):
def __init__(self):
super().__init__()
@ -1104,13 +1207,11 @@ is not supported. To save/load quantized models, the following ways can be used:
quantized = convert_fx(prepared)
b.seek(0)
quantized.load_state_dict(torch.load(b))
```
2. Saving/Loading scripted quantized models using `torch.jit.save` and `torch.jit.load`
2. Saving/Loading scripted quantized models using ``torch.jit.save`` and ``torch.jit.load``
An example:
An example::
```python
# Note: using the same model M from previous example
m = M().eval()
prepare_orig = prepare_fx(m, {'' : default_qconfig})
@ -1123,19 +1224,16 @@ is not supported. To save/load quantized models, the following ways can be used:
torch.jit.save(scripted, b)
b.seek(0)
scripted_quantized = torch.jit.load(b)
```
### Symbolic Trace Error when using FX Graph Mode Quantization
Symbolic Trace Error when using FX Graph Mode Quantization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Symbolic traceability is a requirement for `(Prototype - maintenance mode) FX Graph Mode Quantization`_, so if you pass a PyTorch Model that is not symbolically traceable to `torch.ao.quantization.prepare_fx` or `torch.ao.quantization.prepare_qat_fx`, we might see an error like the following::
Symbolic traceability is a requirement for {ref}`prototype-maintenance-mode-fx-graph-mode-quantization`, so if you pass a PyTorch Model that is not symbolically traceable to `torch.ao.quantization.prepare_fx` or `torch.ao.quantization.prepare_qat_fx`, we might see an error like the following:
```console
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow
```
Please take a look at [Limitations of Symbolic Tracing](https://pytorch.org/docs/2.0/fx.html#limitations-of-symbolic-tracing) and use - [User Guide on Using FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html) to workaround the problem.
Please take a look at `Limitations of Symbolic Tracing <https://pytorch.org/docs/2.0/fx.html#limitations-of-symbolic-tracing>`_ and use - `User Guide on Using FX Graph Mode Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html>`_ to workaround the problem.
```{eval-rst}
.. torch.ao is missing documentation. Since part of it is mentioned here, adding them here for now.
.. They are here for tracking purposes until they are more permanently fixed.
.. py:module:: torch.ao
@ -1311,4 +1409,3 @@ Please take a look at [Limitations of Symbolic Tracing](https://pytorch.org/docs
.. py:module:: torch.quantization.quantize_jit
.. py:module:: torch.quantization.stubs
.. py:module:: torch.quantization.utils
```

View File

@ -1,10 +1,7 @@
# torch.random
torch.random
===================================
```{eval-rst}
.. currentmodule:: torch.random
```
```{eval-rst}
.. automodule:: torch.random
:members:
```