Revert "[Docs] Convert to markdown to fix 155032 (#155520)"

This reverts commit cd66ff8030. Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-12-06 12:20:52 +01:00 · 2025-06-17 22:22:50 +00:00 · 2025-06-17 22:22:50 +00:00 · fa4f07b5b8
commit fa4f07b5b8
parent 54998c2daa
5 changed files with 495 additions and 503 deletions
--- a/docs/source/quantization-accuracy-debugging.rst
+++ b/docs/source/quantization-accuracy-debugging.rst
@ -1,4 +1,5 @@
-# Quantization Accuracy Debugging
+Quantization Accuracy Debugging
+-------------------------------

 This document provides high level strategies for improving quantization
 accuracy. If a quantized model has error compared to the original model,
@ -10,9 +11,11 @@ we can categorize the error into:
   portion of input data has large error
 3. **implementation error** - quantized kernel is not matching reference implementation

-## Data insensitive error
+Data insensitive error
+~~~~~~~~~~~~~~~~~~~~~~

-### General tips
+General tips
+^^^^^^^^^^^^

 1. For PTQ, ensure that the data you are calibrating with is representative
   of your dataset. For example, for a classification problem a general
@ -38,7 +41,8 @@ we can categorize the error into:
 4. If you are using PTQ, consider using QAT to recover some of the accuracy loss
   from quantization.

-### Int8 quantization tips
+Int8 quantization tips
+^^^^^^^^^^^^^^^^^^^^^^

 1. If you are using per-tensor weight quantization, consider using per-channel
   weight quantization.
@ -48,7 +52,8 @@ we can categorize the error into:
   If this variation is high, the layer may be suitable for dynamic quantization
   but not static quantization.

-## Data sensitive error
+Data sensitive error
+~~~~~~~~~~~~~~~~~~~~

 If you are using static quantization and a small portion of your input data is
 resulting in high quantization error, you can try:
@ -60,7 +65,8 @@ resulting in high quantization error, you can try:
   the observer settings to choose a better scale and zero_point.


-## Implementation error
+Implementation error
+~~~~~~~~~~~~~~~~~~~~

 If you are using PyTorch quantization with your own backend
 you may see differences between the reference implementation of an
@ -74,23 +80,19 @@ operation (such as ``dequant -> op_fp32 -> quant``) and the quantized implementa
 2. the kernel on the target hardware has an accuracy issue. In this case, reach
   out to the kernel developer.

-## Numerical Debugging Tooling (prototype)
+Numerical Debugging Tooling (prototype)
+---------------------------------------

-```{eval-rst}
 .. toctree::
    :hidden:

    torch.ao.ns._numeric_suite
    torch.ao.ns._numeric_suite_fx
-```

-```{warning}
+.. warning ::
     Numerical debugging tooling is early prototype and subject to change.
-```

-```{eval-rst}
 * :ref:`torch_ao_ns_numeric_suite`
  Eager mode numeric suite
 * :ref:`torch_ao_ns_numeric_suite_fx`
  FX numeric suite
-```
--- a/docs/source/quantization-backend-configuration.rst
+++ b/docs/source/quantization-backend-configuration.rst
@ -1,4 +1,5 @@
-# Quantization Backend Configuration
+Quantization Backend Configuration
+----------------------------------

 FX Graph Mode Quantization allows the user to configure various
 quantization behaviors of an op in order to match the expectation
@ -7,13 +8,13 @@ of their backend.
 In the future, this document will contain a detailed spec of
 these configurations.

-## Default values for native configurations
+
+Default values for native configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Below is the output of the configuration for quantization of ops
 in x86 and qnnpack (PyTorch's default quantized backends).

 Results:

-```{eval-rst}
 .. literalinclude:: scripts/quantization_backend_configs/default_backend_config.txt
-```
--- a/docs/source/quantization-support.rst
+++ b/docs/source/quantization-support.rst
@ -1,16 +1,16 @@
-# Quantization API Reference
+Quantization API Reference
+-------------------------------

-## torch.ao.quantization
+torch.ao.quantization
+~~~~~~~~~~~~~~~~~~~~~

 This module contains Eager mode quantization APIs.

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization
-```

-### Top level APIs
+Top level APIs
+^^^^^^^^^^^^^^

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -22,11 +22,10 @@ This module contains Eager mode quantization APIs.
    prepare
    prepare_qat
    convert
-```

-### Preparing model for quantization
+Preparing model for quantization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -37,11 +36,10 @@ This module contains Eager mode quantization APIs.
    DeQuantStub
    QuantWrapper
    add_quant_dequant
-```

-### Utility functions
+Utility functions
+^^^^^^^^^^^^^^^^^

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -50,17 +48,15 @@ This module contains Eager mode quantization APIs.
    swap_module
    propagate_qconfig_
    default_eval_fn
-```

-## torch.ao.quantization.quantize_fx
+
+torch.ao.quantization.quantize_fx
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module contains FX graph mode quantization APIs (prototype).

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.quantize_fx
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -70,17 +66,14 @@ This module contains FX graph mode quantization APIs (prototype).
    prepare_qat_fx
    convert_fx
    fuse_fx
-```

-## torch.ao.quantization.qconfig_mapping
+torch.ao.quantization.qconfig_mapping
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module contains QConfigMapping for configuring FX graph mode quantization.

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.qconfig_mapping
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -89,19 +82,16 @@ This module contains QConfigMapping for configuring FX graph mode quantization.
    QConfigMapping
    get_default_qconfig_mapping
    get_default_qat_qconfig_mapping
-```

-## torch.ao.quantization.backend_config
+torch.ao.quantization.backend_config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module contains BackendConfig, a config object that defines how quantization is supported
 in a backend. Currently only used by FX Graph Mode Quantization, but we may extend Eager Mode
 Quantization to work with this as well.

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.backend_config
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -112,17 +102,15 @@ Quantization to work with this as well.
    DTypeConfig
    DTypeWithConstraints
    ObservationType
-```

-## torch.ao.quantization.fx.custom_config
+torch.ao.quantization.fx.custom_config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module contains a few CustomConfig classes that's used in both eager mode and FX graph mode quantization

-```{eval-rst}
-.. currentmodule:: torch.ao.quantization.fx.custom_config
-```

-```{eval-rst}
+.. currentmodule:: torch.ao.quantization.fx.custom_config
+
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -132,62 +120,48 @@ This module contains a few CustomConfig classes that's used in both eager mode a
    PrepareCustomConfig
    ConvertCustomConfig
    StandaloneModuleConfigEntry
-```

-## torch.ao.quantization.quantizer
+torch.ao.quantization.quantizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-```{eval-rst}
 .. automodule:: torch.ao.quantization.quantizer
-```

-## torch.ao.quantization.pt2e (quantization in pytorch 2.0 export implementation)
+torch.ao.quantization.pt2e (quantization in pytorch 2.0 export implementation)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-```{eval-rst}
 .. automodule:: torch.ao.quantization.pt2e
 .. automodule:: torch.ao.quantization.pt2e.representation
-```

-## torch.ao.quantization.pt2e.export_utils
+torch.ao.quantization.pt2e.export_utils
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.pt2e.export_utils
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
    :template: classtemplate.rst

    model_is_exported
-```

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization
-```

-## torch.ao.quantization.pt2e.lowering
+torch.ao.quantization.pt2e.lowering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.pt2e.lowering
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
    :template: classtemplate.rst

    lower_pt2e_quantized_to_x86
-```

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization
-```

-## PT2 Export (pt2e) Numeric Debugger
-
-```{eval-rst}
+PT2 Export (pt2e) Numeric Debugger
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -199,17 +173,14 @@ This module contains a few CustomConfig classes that's used in both eager mode a
    prepare_for_propagation_comparison
    extract_results_from_loggers
    compare_results
-```

-## torch (quantization related functions)
+torch (quantization related functions)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This describes the quantization related functions of the `torch` namespace.

-```{eval-rst}
 .. currentmodule:: torch
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -218,18 +189,15 @@ This describes the quantization related functions of the `torch` namespace.
    quantize_per_tensor
    quantize_per_channel
    dequantize
-```

-## torch.Tensor (quantization related methods)
+torch.Tensor (quantization related methods)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Quantized Tensors support a limited subset of data manipulation methods of the
 regular full-precision tensor.

-```{eval-rst}
 .. currentmodule:: torch.Tensor
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -262,18 +230,16 @@ regular full-precision tensor.
    resize_
    sort
    topk
-```

-## torch.ao.quantization.observer
+
+torch.ao.quantization.observer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module contains observers which are used to collect statistics about
 the values observed during calibration (PTQ) or training (QAT).

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.observer
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -310,18 +276,15 @@ the values observed during calibration (PTQ) or training (QAT).
    TorchAODType
    ZeroPointDomain
    get_block_size
-```

-## torch.ao.quantization.fake_quantize
+torch.ao.quantization.fake_quantize
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module implements modules which are used to perform fake quantization
 during QAT.

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.fake_quantize
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -342,18 +305,15 @@ during QAT.
    enable_fake_quant
    disable_observer
    enable_observer
-```

-## torch.ao.quantization.qconfig
+torch.ao.quantization.qconfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This module defines `QConfig` objects which are used
 to configure quantization settings for individual ops.

-```{eval-rst}
 .. currentmodule:: torch.ao.quantization.qconfig
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -372,23 +332,17 @@ to configure quantization settings for individual ops.
    default_weight_only_qconfig
    default_activation_only_qconfig
    default_qat_qconfig_v2
-```

-## torch.ao.nn.intrinsic
-
-```{eval-rst}
+torch.ao.nn.intrinsic
+~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.intrinsic
 .. automodule:: torch.ao.nn.intrinsic.modules
-```

 This module implements the combined (fused) modules conv + relu which can
 then be quantized.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.intrinsic
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -406,23 +360,18 @@ then be quantized.
    ConvBnReLU3d
    BNReLU2d
    BNReLU3d
-```

-## torch.ao.nn.intrinsic.qat
-
-```{eval-rst}
+torch.ao.nn.intrinsic.qat
+~~~~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.intrinsic.qat
 .. automodule:: torch.ao.nn.intrinsic.qat.modules
-```
+

 This module implements the versions of those fused operations needed for
 quantization aware training.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.intrinsic.qat
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -439,24 +388,19 @@ quantization aware training.
    ConvReLU3d
    update_bn_stats
    freeze_bn_stats
-```

-## torch.ao.nn.intrinsic.quantized
-
-```{eval-rst}
+torch.ao.nn.intrinsic.quantized
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.intrinsic.quantized
 .. automodule:: torch.ao.nn.intrinsic.quantized.modules
-```
+

 This module implements the quantized implementations of fused operations
 like conv + relu. No BatchNorm variants as it's usually folded into convolution
 for inference.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.intrinsic.quantized
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -468,47 +412,35 @@ for inference.
    ConvReLU2d
    ConvReLU3d
    LinearReLU
-```

-## torch.ao.nn.intrinsic.quantized.dynamic
-
-```{eval-rst}
+torch.ao.nn.intrinsic.quantized.dynamic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.intrinsic.quantized.dynamic
 .. automodule:: torch.ao.nn.intrinsic.quantized.dynamic.modules
-```

 This module implements the quantized dynamic implementations of fused operations
 like linear + relu.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.intrinsic.quantized.dynamic
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
    :template: classtemplate.rst

    LinearReLU
-```

-## torch.ao.nn.qat
-
-```{eval-rst}
+torch.ao.nn.qat
+~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.qat
 .. automodule:: torch.ao.nn.qat.modules
-```

 This module implements versions of the key nn modules **Conv2d()** and
 **Linear()** which run in FP32 but with rounding applied to simulate the
 effect of INT8 quantization.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.qat
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -517,48 +449,36 @@ effect of INT8 quantization.
    Conv2d
    Conv3d
    Linear
-```

-## torch.ao.nn.qat.dynamic
-
-```{eval-rst}
+torch.ao.nn.qat.dynamic
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.qat.dynamic
 .. automodule:: torch.ao.nn.qat.dynamic.modules
-```

 This module implements versions of the key nn modules such as **Linear()**
 which run in FP32 but with rounding applied to simulate the effect of INT8
 quantization and will be dynamically quantized during inference.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.qat.dynamic
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
    :template: classtemplate.rst

    Linear
-```

-## torch.ao.nn.quantized
-
-```{eval-rst}
+torch.ao.nn.quantized
+~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.quantized
   :noindex:
 .. automodule:: torch.ao.nn.quantized.modules
-```

 This module implements the quantized versions of the nn layers such as
 `~torch.nn.Conv2d` and `torch.nn.ReLU`.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.quantized
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -588,25 +508,17 @@ This module implements the quantized versions of the nn layers such as
    InstanceNorm1d
    InstanceNorm2d
    InstanceNorm3d
-```

-## torch.ao.nn.quantized.functional
-
-```{eval-rst}
+torch.ao.nn.quantized.functional
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.quantized.functional
-```

-```{eval-rst}
 This module implements the quantized versions of the functional layers such as
 `~torch.nn.functional.conv2d` and `torch.nn.functional.relu`. Note:
-:math:`~torch.nn.functional.relu` supports quantized inputs.
-```
+:meth:`~torch.nn.functional.relu` supports quantized inputs.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.quantized.functional
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -634,19 +546,16 @@ This module implements the quantized versions of the functional layers such as
    upsample
    upsample_bilinear
    upsample_nearest
-```

-## torch.ao.nn.quantizable
+torch.ao.nn.quantizable
+~~~~~~~~~~~~~~~~~~~~~~~

 This module implements the quantizable versions of some of the nn layers.
 These modules can be used in conjunction with the custom module mechanism,
 by providing the ``custom_module_config`` argument to both prepare and convert.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.quantizable
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -654,24 +563,19 @@ by providing the ``custom_module_config`` argument to both prepare and convert.

    LSTM
    MultiheadAttention
-```

-## torch.ao.nn.quantized.dynamic

-```{eval-rst}
+torch.ao.nn.quantized.dynamic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. automodule:: torch.ao.nn.quantized.dynamic
 .. automodule:: torch.ao.nn.quantized.dynamic.modules
-```

-Dynamically quantized {class}`~torch.nn.Linear`, {class}`~torch.nn.LSTM`,
-{class}`~torch.nn.LSTMCell`, {class}`~torch.nn.GRUCell`, and
-{class}`~torch.nn.RNNCell`.
+Dynamically quantized :class:`~torch.nn.Linear`, :class:`~torch.nn.LSTM`,
+:class:`~torch.nn.LSTMCell`, :class:`~torch.nn.GRUCell`, and
+:class:`~torch.nn.RNNCell`.

-```{eval-rst}
 .. currentmodule:: torch.ao.nn.quantized.dynamic
-```

-```{eval-rst}
 .. autosummary::
    :toctree: generated
    :nosignatures:
@ -683,9 +587,9 @@ Dynamically quantized {class}`~torch.nn.Linear`, {class}`~torch.nn.LSTM`,
    RNNCell
    LSTMCell
    GRUCell
-```

-## Quantized dtypes and quantization schemes
+Quantized dtypes and quantization schemes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Note that operator implementations currently only
 support per channel quantization for weights of the **conv** and **linear**
@ -693,7 +597,6 @@ operators. Furthermore, the input data is
 mapped linearly to the quantized data and vice versa
 as follows:

-```{eval-rst}
    .. math::

        \begin{aligned}
@ -702,15 +605,11 @@ as follows:
            \text{Dequantization:}&\\
            &x_\text{out} = (Q_\text{input}-z)*s
        \end{aligned}
-```

-```{eval-rst}
 where :math:`\text{clamp}(.)` is the same as :func:`~torch.clamp` while the
 scale :math:`s` and zero point :math:`z` are then computed
 as described in :class:`~torch.ao.quantization.observer.MinMaxObserver`, specifically:
-```

-```{eval-rst}
    .. math::

        \begin{aligned}
@ -726,7 +625,6 @@ as described in :class:`~torch.ao.quantization.observer.MinMaxObserver`, specifi
                    \left( Q_\text{max} - Q_\text{min} \right ) \\
                &z = Q_\text{min} - \text{round}(x_\text{min} / s)
        \end{aligned}
-```

 where :math:`[x_\text{min}, x_\text{max}]` denotes the range of the input data while
 :math:`Q_\text{min}` and :math:`Q_\text{max}` are respectively the minimum and maximum values of the quantized dtype.
@ -737,7 +635,6 @@ the range of the input data or symmetric quantization is being used.
 Additional data types and quantization schemes can be implemented through
 the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html>`_.

-```{eval-rst}
 * :attr:`torch.qscheme` — Type to describe the quantization scheme of a tensor.
  Supported types:

@ -751,9 +648,8 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
  * :attr:`torch.quint8` — 8-bit unsigned integer
  * :attr:`torch.qint8` — 8-bit signed integer
  * :attr:`torch.qint32` — 32-bit signed integer
-```

-```{eval-rst}
+
 .. These modules are missing docs. Adding them here only for tracking
 .. automodule:: torch.ao.nn.quantizable.modules
   :noindex:
@ -782,4 +678,3 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
 .. automodule:: torch.nn.quantized.dynamic.modules
 .. automodule:: torch.quantization
 .. automodule:: torch.nn.intrinsic.modules
-```
--- a/docs/source/quantization.rst
+++ b/docs/source/quantization.rst
@ -1,17 +1,16 @@
-(quantization-doc)=
+.. _quantization-doc:

-# Quantization
+Quantization
+============

-```{eval-rst}
 .. automodule:: torch.ao.quantization
 .. automodule:: torch.ao.quantization.fx
-```

-```{warning}
+.. warning ::
     Quantization is in beta and subject to change.
-```

-## Introduction to Quantization
+Introduction to Quantization
+----------------------------

 Quantization refers to techniques for performing computations and storing
 tensors at lower bitwidths than floating point precision. A quantized model
@ -39,13 +38,14 @@ that perform all or part of the computation in lower precision. Higher-level
 APIs are provided that incorporate typical workflows of converting FP32 model
 to lower precision with minimal accuracy loss.

-## Quantization API Summary
+Quantization API Summary
+-----------------------------

 PyTorch provides three different modes of quantization: Eager Mode Quantization, FX Graph Mode Quantization (maintenance) and PyTorch 2 Export Quantization.

 Eager Mode Quantization is a beta feature. User needs to do fusion and specify where quantization and dequantization happens manually, also it only supports modules and not functionals.

-FX Graph Mode Quantization is an automated quantization workflow in PyTorch, and currently it's a prototype feature, it is in maintenance mode since we have PyTorch 2 Export Quantization. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with `torch.fx`). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with `torch.fx`, especially on how to make a model symbolically traceable.
+FX Graph Mode Quantization is an automated quantization workflow in PyTorch, and currently it's a prototype feature, it is in maintenance mode since we have PyTorch 2 Export Quantization. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with ``torch.fx``). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with ``torch.fx``, especially on how to make a model symbolically traceable.

 PyTorch 2 Export Quantization is the new full graph mode quantization workflow, released as prototype feature in PyTorch 2.1. With PyTorch 2, we are moving to a better solution for full program capture (torch.export) since it can capture a higher percentage (88.8% on 14K models) of models compared to torch.fx.symbolic_trace (72.7% on 14K models), the program capture solution used by FX Graph Mode Quantization. torch.export still has limitations around some python constructs and requires user involvement to support dynamism in the exported model, but overall it is an improvement over the previous program capture solution. PyTorch 2 Export Quantization is built for models captured by torch.export, with flexibility and productivity of both modeling users and backend developers in mind. The main features are
 (1). Programmable API for configuring how a model is quantized that can scale to many more use cases
@ -56,17 +56,49 @@ New users of quantization are encouraged to try out PyTorch 2 Export Quantizatio

 The following table compares the differences between Eager Mode Quantization, FX Graph Mode Quantization and PyTorch 2 Export Quantization:

-|                         | Eager Mode Quantization | FX Graph Mode Quantization | PyTorch 2 Export Quantization |
-|-------------------------|-------------------------|-----------------------------|-------------------------------|
+-----------------+-------------------+-------------------+-------------------------+
+|                 |Eager Mode         |FX Graph           |PyTorch 2 Export         |
+|                 |Quantization       |Mode               |Quantization             |
+|                 |                   |Quantization       |                         |
+-----------------+-------------------+-------------------+-------------------------+
 |Release          |beta               |prototype          |prototype                |
 |Status           |                   |(maintenance)      |                         |
-| Operator Fusion         | Manual                  | Automatic                   | Automatic                     |
-| Quant/DeQuant Placement | Manual                  | Automatic                   | Automatic                     |
-| Quantizing Modules      | Supported               | Supported                   | Supported                     |
-| Quantizing Functionals/Torch Ops | Manual        | Automatic                   | Supported                     |
-| Support for Customization | Limited Support       | Fully Supported             | Fully Supported               |
-| Quantization Mode Support | Post Training Quantization: Static, Dynamic, Weight Only<br><br>Quantization Aware Training: Static | Post Training Quantization: Static, Dynamic, Weight Only<br><br>Quantization Aware Training: Static | Defined by Backend Specific Quantizer |
-| Input/Output Model Type | `torch.nn.Module`       | `torch.nn.Module` (May need some refactors to make the model compatible with FX Graph Mode Quantization) | `torch.fx.GraphModule` (captured by `torch.export`) |
+-----------------+-------------------+-------------------+-------------------------+
+|Operator         |Manual             |Automatic          |Automatic                |
+|Fusion           |                   |                   |                         |
+-----------------+-------------------+-------------------+-------------------------+
+|Quant/DeQuant    |Manual             |Automatic          |Automatic                |
+|Placement        |                   |                   |                         |
+-----------------+-------------------+-------------------+-------------------------+
+|Quantizing       |Supported          |Supported          |Supported                |
+|Modules          |                   |                   |                         |
+-----------------+-------------------+-------------------+-------------------------+
+|Quantizing       |Manual             |Automatic          |Supported                |
+|Functionals/Torch|                   |                   |                         |
+|Ops              |                   |                   |                         |
+-----------------+-------------------+-------------------+-------------------------+
+|Support for      |Limited Support    |Fully              |Fully Supported          |
+|Customization    |                   |Supported          |                         |
+-----------------+-------------------+-------------------+-------------------------+
+|Quantization Mode|Post Training      |Post Training      |Defined by               |
+|Support          |Quantization:      |Quantization:      |Backend Specific         |
+|                 |Static, Dynamic,   |Static, Dynamic,   |Quantizer                |
+|                 |Weight Only        |Weight Only        |                         |
+|                 |                   |                   |                         |
+|                 |Quantization Aware |Quantization Aware |                         |
+|                 |Training:          |Training:          |                         |
+|                 |Static             |Static             |                         |
+-----------------+-------------------+-------------------+-------------------------+
+|Input/Output     |``torch.nn.Module``|``torch.nn.Module``|``torch.fx.GraphModule`` |
+|Model Type       |                   |(May need some     |(captured by             |
+|                 |                   |refactors to make  |``torch.export``         |
+|                 |                   |the model          |                         |
+|                 |                   |compatible with FX |                         |
+|                 |                   |Graph Mode         |                         |
+|                 |                   |Quantization)      |                         |
+-----------------+-------------------+-------------------+-------------------------+
+
+

 There are three types of quantization supported:

@ -77,31 +109,48 @@ There are three types of quantization supported:
 3. static quantization aware training (weights quantized, activations quantized,
   quantization numerics modeled during training)

-Please see our [Introduction to Quantization on PyTorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)
-blog post for a more comprehensive overview of the tradeoffs between these quantization
+Please see our `Introduction to Quantization on PyTorch
+<https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_ blog post
+for a more comprehensive overview of the tradeoffs between these quantization
 types.

 Operator coverage varies between dynamic and static quantization and is captured in the table below.

-|                           | Static Quantization         | Dynamic Quantization      |
-|---------------------------|-----------------------------|----------------------------|
-| nn.Linear                 | Y                           | Y                          |
-| nn.Conv1d/2d/3d           | Y                           | N                          |
-| nn.LSTM                   | Y (through custom modules)  | Y                          |
-| nn.GRU                    | N                           | Y                          |
-| nn.RNNCell                | N                           | Y                          |
-| nn.GRUCell                | N                           | Y                          |
-| nn.LSTMCell               | N                           | Y                          |
-| nn.EmbeddingBag           | Y (activations are in fp32) | Y                          |
+---------------------------+-------------------+--------------------+
+|                           |Static             | Dynamic            |
+|                           |Quantization       | Quantization       |
+---------------------------+-------------------+--------------------+
+| | nn.Linear               | | Y               | | Y                |
+| | nn.Conv1d/2d/3d         | | Y               | | N                |
+---------------------------+-------------------+--------------------+
+| | nn.LSTM                 | | Y (through      | | Y                |
+| |                         | | custom modules) | |                  |
+| | nn.GRU                  | | N               | | Y                |
+---------------------------+-------------------+--------------------+
+| | nn.RNNCell              | | N               | | Y                |
+| | nn.GRUCell              | | N               | | Y                |
+| | nn.LSTMCell             | | N               | | Y                |
+---------------------------+-------------------+--------------------+
+|nn.EmbeddingBag            | Y (activations    |                    |
+|                           | are in fp32)      | Y                  |
+---------------------------+-------------------+--------------------+
 |nn.Embedding               | Y                 | Y                  |
-| nn.MultiheadAttention     | Y (through custom modules)  | Not supported              |
-| Activations               | Broadly supported           | Un-changed, computations stay in fp32 |
+---------------------------+-------------------+--------------------+
+| nn.MultiheadAttention     | Y (through        | Not supported      |
+|                           | custom modules)   |                    |
+---------------------------+-------------------+--------------------+
+| Activations               | Broadly supported | Un-changed,        |
+|                           |                   | computations       |
+|                           |                   | stay in fp32       |
+---------------------------+-------------------+--------------------+

-### Eager Mode Quantization

-For a general introduction to the quantization flow, including different types of quantization, please take a look at {ref}`general-quantization-flow`.
+Eager Mode Quantization
+^^^^^^^^^^^^^^^^^^^^^^^
+For a general introduction to the quantization flow, including different types of quantization, please take a look at `General Quantization Flow`_.

-#### Post Training Dynamic Quantization
+Post Training Dynamic Quantization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 This is the simplest to apply form of quantization where the weights are
 quantized ahead of time but the activations are dynamically quantized
@ -110,9 +159,8 @@ is dominated by loading weights from memory rather than computing the matrix
 multiplications. This is true for LSTM and Transformer type models with
 small batch size.

-Diagram:
+Diagram::

-```python
  # original model
  # all tensors and computations are in floating point
  previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
@ -124,11 +172,9 @@ Diagram:
  previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                       /
     linear_weight_int8
-```

-PTDQ API Example:
+PTDQ API Example::

-```python
  import torch

  # define a floating point model
@ -152,11 +198,12 @@ PTDQ API Example:
  # run the model
  input_fp32 = torch.randn(4, 4, 4, 4)
  res = model_int8(input_fp32)
-```

-To learn more about dynamic quantization please see our [dynamic quantization tutorial](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html).
+To learn more about dynamic quantization please see our `dynamic quantization tutorial
+<https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html>`_.

-#### Post Training Static Quantization
+Post Training Static Quantization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Post Training Static Quantization (PTQ static) quantizes the weights and activations of the model.  It
 fuses activations into preceding layers where possible.  It requires
@ -165,11 +212,10 @@ parameters for activations. Post Training Static Quantization is typically used
 both memory bandwidth and compute savings are important with CNNs being a
 typical use case.

-We may need to modify the model before applying post training static quantization. Please see {ref}`model-preparation-for-eager-mode-static-quantization`.
+We may need to modify the model before applying post training static quantization. Please see `Model Preparation for Eager Mode Static Quantization`_.

-Diagram:
+Diagram::

-```python
    # original model
    # all tensors and computations are in floating point
    previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
@ -181,11 +227,9 @@ Diagram:
    previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                        /
      linear_weight_int8
-```

-PTSQ API Example:
+PTSQ API Example::

-```python
  import torch

  # define a floating point model where some layers could be statically quantized
@ -248,11 +292,12 @@ PTSQ API Example:

  # run the model, relevant calculations will happen in int8
  res = model_int8(input_fp32)
-```

-To learn more about static quantization, please see the [static quantization tutorial](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html).
+To learn more about static quantization, please see the `static quantization tutorial
+<https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html>`_.

-#### Quantization Aware Training for Static Quantization
+Quantization Aware Training for Static Quantization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Quantization Aware Training (QAT) models the effects of quantization during training
 allowing for higher accuracy compared to other quantization methods. We can do QAT for static, dynamic or weight only quantization.  During
@ -263,11 +308,10 @@ activations are quantized, and activations are fused into the preceding layer
 where possible.  It is commonly used with CNNs and yields a higher accuracy
 compared to static quantization.

-We may need to modify the model before applying post training static quantization. Please see {ref}`model-preparation-for-eager-mode-static-quantization`.
+We may need to modify the model before applying post training static quantization. Please see `Model Preparation for Eager Mode Static Quantization`_.

-Diagram:
+Diagram::

-```python
  # original model
  # all tensors and computations are in floating point
  previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
@ -284,11 +328,9 @@ Diagram:
  previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                       /
     linear_weight_int8
-```

-QAT API Example:
+QAT API Example::

-```python
  import torch

  # define a floating point model where some layers could benefit from QAT
@ -349,14 +391,13 @@ QAT API Example:

  # run the model, relevant calculations will happen in int8
  res = model_int8(input_fp32)
-```

-To learn more about quantization aware training, please see the
-[QAT tutorial](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html).
+To learn more about quantization aware training, please see the `QAT
+tutorial
+<https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html>`_.

-(model-preparation-for-eager-mode-static-quantization)=
-
-#### Model Preparation for Eager Mode Static Quantization
+Model Preparation for Eager Mode Static Quantization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 It is necessary to currently make some modifications to the model definition
 prior to Eager mode quantization. This is because currently quantization works on a module
@ -364,39 +405,38 @@ by module basis. Specifically, for all quantization techniques, the user needs t

 1. Convert any operations that require output requantization (and thus have
   additional parameters) from functionals to module form (for example,
-   using `torch.nn.ReLU` instead of `torch.nn.functional.relu`).
+   using ``torch.nn.ReLU`` instead of ``torch.nn.functional.relu``).
 2. Specify which parts of the model need to be quantized either by assigning
-   `.qconfig` attributes on submodules or by specifying `qconfig_mapping`.
-   For example, setting `model.conv1.qconfig = None` means that the
-   `model.conv` layer will not be quantized, and setting
-   `model.linear1.qconfig = custom_qconfig` means that the quantization
-   settings for `model.linear1` will be using `custom_qconfig` instead
+   ``.qconfig`` attributes on submodules or by specifying ``qconfig_mapping``.
+   For example, setting ``model.conv1.qconfig = None`` means that the
+   ``model.conv`` layer will not be quantized, and setting
+   ``model.linear1.qconfig = custom_qconfig`` means that the quantization
+   settings for ``model.linear1`` will be using ``custom_qconfig`` instead
   of the global qconfig.

 For static quantization techniques which quantize activations, the user needs
 to do the following in addition:

 1. Specify where activations are quantized and de-quantized. This is done using
-   {class}`~torch.ao.quantization.QuantStub` and
-   {class}`~torch.ao.quantization.DeQuantStub` modules.
-2. Use {class}`~torch.ao.nn.quantized.FloatFunctional` to wrap tensor operations
+   :class:`~torch.ao.quantization.QuantStub` and
+   :class:`~torch.ao.quantization.DeQuantStub` modules.
+2. Use :class:`~torch.ao.nn.quantized.FloatFunctional` to wrap tensor operations
   that require special handling for quantization into modules. Examples
-   are operations like `add` and `cat` which require special handling to
+   are operations like ``add`` and ``cat`` which require special handling to
   determine output quantization parameters.
 3. Fuse modules: combine operations/modules into a single module to obtain
   higher accuracy and performance. This is done using the
-   {func}`~torch.ao.quantization.fuse_modules.fuse_modules` API, which takes in lists of modules
+   :func:`~torch.ao.quantization.fuse_modules.fuse_modules` API, which takes in lists of modules
   to be fused. We currently support the following fusions:
-   `[Conv, Relu]`, `[Conv, BatchNorm]`, `[Conv, BatchNorm, Relu]`, `[Linear, Relu]`
+   [Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]

-(prototype-maintenance-mode-fx-graph-mode-quantization)=
-### (Prototype - maintenance mode) FX Graph Mode Quantization
+(Prototype - maintenance mode) FX Graph Mode Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 There are multiple quantization types in post training quantization (weight only, dynamic and static) and the configuration is done through `qconfig_mapping` (an argument of the `prepare_fx` function).

-FXPTQ API Example:
+FXPTQ API Example::

-```python
  import torch
  from torch.ao.quantization import (
    get_default_qconfig_mapping,
@ -455,19 +495,17 @@ FXPTQ API Example:
  #
  model_to_quantize = copy.deepcopy(model_fp)
  model_fused = quantize_fx.fuse_fx(model_to_quantize)
-```

 Please follow the tutorials below to learn more about FX Graph Mode Quantization:

- [User Guide on Using FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html)
- [FX Graph Mode Post Training Static Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html)
- [FX Graph Mode Post Training Dynamic Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html)
+- `User Guide on Using FX Graph Mode Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html>`_
+- `FX Graph Mode Post Training Static Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html>`_
+- `FX Graph Mode Post Training Dynamic Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html>`_

-### (Prototype) PyTorch 2 Export Quantization
+(Prototype) PyTorch 2 Export Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+API Example::

-API Example:
-
-```python
  import torch
  from torch.ao.quantization.quantize_pt2e import prepare_pt2e
  from torch.export import export_for_training
@ -514,31 +552,29 @@ API Example:

  # Step 3. lowering
  # lower to target backend
-```
+

 Please follow these tutorials to get started on PyTorch 2 Export Quantization:

 Modeling Users:

- [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html)
- [PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor](https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html)
- [PyTorch 2 Export Quantization Aware Training](https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html)
+- `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_
+- `PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor <https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html>`_
+- `PyTorch 2 Export Quantization Aware Training <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_

 Backend Developers (please check out all Modeling Users docs as well):

- [How to Write a Quantizer for PyTorch 2 Export Quantization](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html)
+- `How to Write a Quantizer for PyTorch 2 Export Quantization <https://pytorch.org/tutorials/prototype/pt2e_quantizer.html>`_

-## Quantization Stack

-Quantization is the process to convert a floating point model to a quantized model. So at high level the quantization stack can be split into two parts:
-
-1. The building blocks or abstractions for a quantized model
-2. The building blocks or abstractions for the quantization flow that converts a floating point model to a quantized model
-
-### Quantized Model
-
-#### Quantized Tensor
+Quantization Stack
+------------------------
+Quantization is the process to convert a floating point model to a quantized model. So at high level the quantization stack can be split into two parts: 1). The building blocks or abstractions for a quantized model 2). The building blocks or abstractions for the quantization flow that converts a floating point model to a quantized model

+Quantized Model
+^^^^^^^^^^^^^^^^^^^^^^^
+Quantized Tensor
+~~~~~~~~~~~~~~~~~
 In order to do quantization in PyTorch, we need to be able to represent
 quantized data in Tensors. A Quantized Tensor allows for storing
 quantized data (represented as int8/uint8/int32) along with quantization
@ -550,10 +586,8 @@ PyTorch supports both per tensor and per channel symmetric and asymmetric quanti

 The mapping is performed by converting the floating point tensors using

-```{eval-rst}
 .. image:: math-quantizer-equation.png
   :width: 40%
-```

 Note that, we ensure that zero in floating point is represented with no error
 after quantization, thereby ensuring that operations like padding do not cause
@ -587,8 +621,8 @@ Here are a few key attributes for quantized Tensor:
    * per_channel_zero_points (list of int)
    * axis (int)

-#### Quantize and Dequantize
-
+Quantize and Dequantize
+~~~~~~~~~~~~~~~~~~~~~~~
 The input and output of a model are floating point Tensors, but activations in the quantized model are quantized, so we need operators to convert between floating point and quantized Tensors.

 * Quantize (float -> quantized)
@ -603,19 +637,20 @@ The input and output of a model are floating point Tensors, but activations in t
  * quantized_tensor.dequantize() - calling dequantize on a torch.float16 Tensor will convert the Tensor back to torch.float
  * torch.dequantize(x)

-#### Quantized Operators/Modules
-
+Quantized Operators/Modules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
 * Quantized Operator are the operators that takes quantized Tensor as inputs, and outputs a quantized Tensor.
 * Quantized Modules are PyTorch Modules that performs quantized operations. They are typically defined for weighted operations like linear and conv.

-#### Quantized Engine
-
+Quantized Engine
+~~~~~~~~~~~~~~~~~~~~
 When a quantized model is executed, the qengine (torch.backends.quantized.engine) specifies which backend is to be used for execution. It is important to ensure that the qengine is compatible with the quantized model in terms of value range of quantized activation and weights.

-### Quantization Flow
-
-#### Observer and FakeQuantize
+Quantization Flow
+^^^^^^^^^^^^^^^^^^^^^^^

+Observer and FakeQuantize
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 * Observer are PyTorch Modules used to:

  * collect tensor statistics like min value and max value of the Tensor passing through the observer
@ -625,8 +660,8 @@ When a quantized model is executed, the qengine (torch.backends.quantized.engine
  * simulate quantization (performing quantize/dequantize) for a Tensor in the network
  * it can calculate quantization parameters based on the collected statistics from observer, or it can learn the quantization parameters as well

-#### QConfig
-
+QConfig
+~~~~~~~~~~~
 * QConfig is a namedtuple of Observer or FakeQuantize Module class that can are configurable with qscheme, dtype etc. it is used to configure how an operator should be observed

  * Quantization configuration for an operator/module
@ -638,10 +673,8 @@ When a quantized model is executed, the qengine (torch.backends.quantized.engine
  * Currently supports configuration for activation and weight
  * We insert input/weight/output observer based on the qconfig that is configured for a given operator or module

-(general-quantization-flow)=
-
-#### General Quantization Flow
-
+General Quantization Flow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In general, the flow is the following

 * prepare
@ -671,62 +704,131 @@ And in terms of how we quantize the operators, we can have:

 We can mix different ways of quantizing operators in the same quantization flow. For example, we can have post training quantization that has both statically and dynamically quantized operators.

-## Quantization Support Matrix
+Quantization Support Matrix
+--------------------------------------
+Quantization Mode Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------------------+------------------------------------------------------+----------------+----------------+------------+-----------------+
+|                             |Quantization                                          |Dataset         | Works Best For | Accuracy   |      Notes      |
+|                             |Mode                                                  |Requirement     |                |            |                 |
+-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+
+|Post Training Quantization   |Dynamic/Weight Only Quantization |activation          |None            |LSTM, MLP,      |good        |Easy to use,     |
+|                             |                                 |dynamically         |                |Embedding,      |            |close to static  |
+|                             |                                 |quantized (fp16,    |                |Transformer     |            |quantization when|
+|                             |                                 |int8) or not        |                |                |            |performance is   |
+|                             |                                 |quantized, weight   |                |                |            |compute or memory|
+|                             |                                 |statically quantized|                |                |            |bound due to     |
+|                             |                                 |(fp16, int8, in4)   |                |                |            |weights          |
+|                             +---------------------------------+--------------------+----------------+----------------+------------+-----------------+
+|                             |Static Quantization              |activation and      |calibration     |CNN             |good        |Provides best    |
+|                             |                                 |weights statically  |dataset         |                |            |perf, may have   |
+|                             |                                 |quantized (int8)    |                |                |            |big impact on    |
+|                             |                                 |                    |                |                |            |accuracy, good   |
+|                             |                                 |                    |                |                |            |for hardwares    |
+|                             |                                 |                    |                |                |            |that only support|
+|                             |                                 |                    |                |                |            |int8 computation |
+-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+
+|                             |Dynamic Quantization             |activation and      |fine-tuning     |MLP, Embedding  |best        |Limited support  |
+|                             |                                 |weight are fake     |dataset         |                |            |for now          |
+|                             |                                 |quantized           |                |                |            |                 |
+|                             +---------------------------------+--------------------+----------------+----------------+------------+-----------------+
+|                             |Static Quantization              |activation and      |fine-tuning     |CNN, MLP,       |best        |Typically used   |
+|                             |                                 |weight are fake     |dataset         |Embedding       |            |when static      |
+|                             |                                 |quantized           |                |                |            |quantization     |
+|                             |                                 |                    |                |                |            |leads to bad     |
+|                             |                                 |                    |                |                |            |accuracy, and    |
+|                             |                                 |                    |                |                |            |used to close the|
+|                             |                                 |                    |                |                |            |accuracy gap     |
+|Quantization Aware Training  |                                 |                    |                |                |            |                 |
+-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+

-### Quantization Mode Support
-
-|  | Quantization Mode | Dataset Requirement | Works Best For | Accuracy | Notes |
-|--|-------------------|---------------------|----------------|----------|-------|
-| Post Training Quantization | Dynamic/Weight Only Quantization<br>(activation dynamically quantized - fp16, int8 or not quantized;<br>weight statically quantized - fp16, int8, in4) | None | LSTM, MLP, Embedding, Transformer | good | Easy to use, close to static quantization when performance is compute or memory bound due to weights |
-| | Static Quantization<br>(activation and weights statically quantized - int8) | Calibration dataset | CNN | good | Provides best performance, may have big impact on accuracy, good for hardware that only supports int8 computation |
-| | Dynamic Quantization<br>(activation and weights are fake quantized) | Fine-tuning dataset | MLP, Embedding | best | Limited support for now |
-| Quantization Aware Training | Static Quantization<br>(activation and weights are fake quantized) | Fine-tuning dataset | CNN, MLP, Embedding | best | Typically used when static quantization leads to bad accuracy, helps close the accuracy gap |
-
-Please see our [Introduction to Quantization on Pytorch](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/)
-blog post for a more comprehensive overview of the tradeoffs between these quantization
+Please see our `Introduction to Quantization on Pytorch
+<https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_ blog post
+for a more comprehensive overview of the tradeoffs between these quantization
 types.

-### Quantization Flow Support
-
+Quantization Flow Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 PyTorch provides two modes of quantization: Eager Mode Quantization and FX Graph Mode Quantization.

 Eager Mode Quantization is a beta feature. User needs to do fusion and specify where quantization and dequantization happens manually, also it only supports modules and not functionals.

-FX Graph Mode Quantization is an automated quantization framework in PyTorch, and currently it's a prototype feature. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with `torch.fx`). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with `torch.fx`, especially on how to make a model symbolically traceable.
+FX Graph Mode Quantization is an automated quantization framework in PyTorch, and currently it's a prototype feature. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with ``torch.fx``). Note that FX Graph Mode Quantization is not expected to work on arbitrary models since the model might not be symbolically traceable, we will integrate it into domain libraries like torchvision and users will be able to quantize models similar to the ones in supported domain libraries with FX Graph Mode Quantization. For arbitrary models we'll provide general guidelines, but to actually make it work, users might need to be familiar with ``torch.fx``, especially on how to make a model symbolically traceable.

-New users of quantization are encouraged to try out FX Graph Mode Quantization first, if it does not work, user may try to follow the guideline of [using FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html) or fall back to eager mode quantization.
+New users of quantization are encouraged to try out FX Graph Mode Quantization first, if it does not work, user may try to follow the guideline of `using FX Graph Mode Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html>`_ or fall back to eager mode quantization.

 The following table compares the differences between Eager Mode Quantization and FX Graph Mode Quantization:

-|                         | Eager Mode Quantization       | FX Graph Mode Quantization                                 |
-|-------------------------|-------------------------------|-------------------------------------------------------------|
+-----------------+-------------------+-------------------+
+|                 |Eager Mode         |FX Graph           |
+|                 |Quantization       |Mode               |
+|                 |                   |Quantization       |
+-----------------+-------------------+-------------------+
 |Release          |beta               |prototype          |
 |Status           |                   |                   |
-| Operator Fusion         | Manual                        | Automatic                                                   |
-| Quant/DeQuant Placement | Manual                        | Automatic                                                   |
-| Quantizing Modules      | Supported                     | Supported                                                   |
-| Quantizing Functionals/Torch Ops | Manual              | Automatic                                                   |
-| Support for Customization | Limited Support             | Fully Supported                                             |
-| Quantization Mode Support | Post Training Quantization: <br>Static, Dynamic, Weight Only <br><br>Quantization Aware Training: <br>Static | Post Training Quantization: <br>Static, Dynamic, Weight Only <br><br>Quantization Aware Training: <br>Static |
-| Input/Output Model Type | `torch.nn.Module`             | `torch.nn.Module` <br>(May need some refactors to make the model compatible with FX Graph Mode Quantization) |
-
-### Backend/Hardware Support
-
-| Hardware    | Kernel Library     | Eager Mode Quantization           | FX Graph Mode Quantization | Quantization Mode Support |
-|-------------|--------------------|-----------------------------------|-----------------------------|----------------------------|
-| server CPU  | fbgemm/onednn      | Supported                         | All Supported               |                            |
-| mobile CPU  | qnnpack/xnnpack    |                                   |                             |                            |
-| server GPU  | TensorRT (early prototype) | Not supported (requires a graph) | Supported            | Static Quantization        |
+-----------------+-------------------+-------------------+
+|Operator         |Manual             |Automatic          |
+|Fusion           |                   |                   |
+-----------------+-------------------+-------------------+
+|Quant/DeQuant    |Manual             |Automatic          |
+|Placement        |                   |                   |
+-----------------+-------------------+-------------------+
+|Quantizing       |Supported          |Supported          |
+|Modules          |                   |                   |
+-----------------+-------------------+-------------------+
+|Quantizing       |Manual             |Automatic          |
+|Functionals/Torch|                   |                   |
+|Ops              |                   |                   |
+-----------------+-------------------+-------------------+
+|Support for      |Limited Support    |Fully              |
+|Customization    |                   |Supported          |
+-----------------+-------------------+-------------------+
+|Quantization Mode|Post Training      |Post Training      |
+|Support          |Quantization:      |Quantization:      |
+|                 |Static, Dynamic,   |Static, Dynamic,   |
+|                 |Weight Only        |Weight Only        |
+|                 |                   |                   |
+|                 |Quantization Aware |Quantization Aware |
+|                 |Training:          |Training:          |
+|                 |Static             |Static             |
+-----------------+-------------------+-------------------+
+|Input/Output     |``torch.nn.Module``|``torch.nn.Module``|
+|Model Type       |                   |(May need some     |
+|                 |                   |refactors to make  |
+|                 |                   |the model          |
+|                 |                   |compatible with FX |
+|                 |                   |Graph Mode         |
+|                 |                   |Quantization)      |
+-----------------+-------------------+-------------------+

+Backend/Hardware Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------+---------------+------------+------------+------------+
+|Hardware         |Kernel Library |Eager Mode  |FX Graph    |Quantization|
+|                 |               |Quantization|Mode        |Mode Support|
+|                 |               |            |Quantization|            |
+-----------------+---------------+------------+------------+------------+
+|server CPU       |fbgemm/onednn  |Supported                |All         |
+|                 |               |                         |Supported   |
+-----------------+---------------+                         |            +
+|mobile CPU       |qnnpack/xnnpack|                         |            |
+|                 |               |                         |            |
+-----------------+---------------+------------+------------+------------+
+|server GPU       |TensorRT (early|Not support |Supported   |Static      |
+|                 |prototype)     |this it     |            |Quantization|
+|                 |               |requires a  |            |            |
+|                 |               |graph       |            |            |
+-----------------+---------------+------------+------------+------------+

 Today, PyTorch supports the following backends for running quantized operators efficiently:

-* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `x86` optimized by [fbgemm](https://github.com/pytorch/FBGEMM) and [onednn](https://github.com/oneapi-src/oneDNN) (see the details at [RFC](https://github.com/pytorch/pytorch/issues/83888))
-* ARM CPUs (typically found in mobile/embedded devices), via [qnnpack](https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native/quantized/cpu/qnnpack)
-* (early prototype) support for NVidia GPU via [TensorRT](https://developer.nvidia.com/tensorrt) through `fx2trt` (to be open sourced)
+* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `x86` optimized by `fbgemm <https://github.com/pytorch/FBGEMM>`_ and `onednn <https://github.com/oneapi-src/oneDNN>`_ (see the details at `RFC <https://github.com/pytorch/pytorch/issues/83888>`_)
+* ARM CPUs (typically found in mobile/embedded devices), via `qnnpack <https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native/quantized/cpu/qnnpack>`_
+* (early prototype) support for NVidia GPU via `TensorRT <https://developer.nvidia.com/tensorrt>`_ through `fx2trt` (to be open sourced)

-#### Note for native CPU backends

+Note for native CPU backends
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 We expose both `x86` and `qnnpack` with the same native pytorch quantized operators, so we need additional flag to distinguish between them. The corresponding implementation of  `x86` and `qnnpack` is chosen automatically based on the PyTorch build mode, though users have the option to override this by setting `torch.backends.quantization.engine` to `x86` or `qnnpack`.

 When preparing a quantized model, it is necessary to ensure that qconfig
@ -736,9 +838,8 @@ during the quantization passes. The qengine controls whether `x86` or `qnnpack`
 specific packing function is used when packing weights for
 linear and convolution functions and modules. For example:

-Default settings for x86:
+Default settings for x86::

-```python
    # set the qconfig for PTQ
    # Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
    qconfig = torch.ao.quantization.get_default_qconfig('x86')
@ -746,78 +847,86 @@ qconfig = torch.ao.quantization.get_default_qconfig('x86')
    qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
    # set the qengine to control weight packing
    torch.backends.quantized.engine = 'x86'
-```

-Default settings for qnnpack:
+Default settings for qnnpack::

-```python
    # set the qconfig for PTQ
    qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
    # or, set the qconfig for QAT
    qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
    # set the qengine to control weight packing
    torch.backends.quantized.engine = 'qnnpack'
-```

-### Operator Support
+Operator Support
+^^^^^^^^^^^^^^^^^^^^

 Operator coverage varies between dynamic and static quantization and is captured in the table below.
 Note that for FX Graph Mode Quantization, the corresponding functionals are also supported.

-|                           | Static Quantization         | Dynamic Quantization          |
-|---------------------------|-----------------------------|-------------------------------|
-| nn.Linear                 | Y                           | Y                             |
-| nn.Conv1d/2d/3d           | Y                           | N                             |
-| nn.LSTM                  | N                           | Y                             |
-| nn.GRU                   | N                           | Y                             |
-| nn.RNNCell               | N                           | Y                             |
-| nn.GRUCell               | N                           | Y                             |
-| nn.LSTMCell              | N                           | Y                             |
-| nn.EmbeddingBag          | Y (activations are in fp32) | Y                             |
+---------------------------+-------------------+--------------------+
+|                           |Static             | Dynamic            |
+|                           |Quantization       | Quantization       |
+---------------------------+-------------------+--------------------+
+| | nn.Linear               | | Y               | | Y                |
+| | nn.Conv1d/2d/3d         | | Y               | | N                |
+---------------------------+-------------------+--------------------+
+| | nn.LSTM                 | | N               | | Y                |
+| | nn.GRU                  | | N               | | Y                |
+---------------------------+-------------------+--------------------+
+| | nn.RNNCell              | | N               | | Y                |
+| | nn.GRUCell              | | N               | | Y                |
+| | nn.LSTMCell             | | N               | | Y                |
+---------------------------+-------------------+--------------------+
+|nn.EmbeddingBag            | Y (activations    |                    |
+|                           | are in fp32)      | Y                  |
+---------------------------+-------------------+--------------------+
 |nn.Embedding               | Y                 | Y                  |
-| nn.MultiheadAttention    | Not Supported               | Not Supported                 |
-| Activations              | Broadly supported           | Un-changed, computations stay in fp32 |
+---------------------------+-------------------+--------------------+
+|nn.MultiheadAttention      |Not Supported      | Not supported      |
+---------------------------+-------------------+--------------------+
+|Activations                |Broadly supported  | Un-changed,        |
+|                           |                   | computations       |
+|                           |                   | stay in fp32       |
+---------------------------+-------------------+--------------------+

 Note: this will be updated with some information generated from native backend_config_dict soon.

-## Quantization API Reference
+Quantization API Reference
+---------------------------

-The [Quantization API Reference](./quantization-support.md) contains documentation
+The :doc:`Quantization API Reference <quantization-support>` contains documentation
 of quantization APIs, such as quantization passes, quantized tensor operations,
 and supported quantized modules and functions.

-```{eval-rst}
 .. toctree::
    :hidden:

    quantization-support
-```

-## Quantization Backend Configuration
+Quantization Backend Configuration
+----------------------------------

 The :doc:`Quantization Backend Configuration <quantization-backend-configuration>` contains documentation
 on how to configure the quantization workflows for various backends.

-```{eval-rst}
 .. toctree::
    :hidden:

    quantization-backend-configuration
-```

-## Quantization Accuracy Debugging
+Quantization Accuracy Debugging
+-------------------------------

-The [Quantization Accuracy Debugging](./quantization-accuracy-debugging.md) contains documentation
+The :doc:`Quantization Accuracy Debugging <quantization-accuracy-debugging>` contains documentation
 on how to debug quantization accuracy.

-```{eval-rst}
 .. toctree::
    :hidden:

    quantization-accuracy-debugging
-```

-## Quantization Customizations
+Quantization Customizations
+---------------------------

 While default implementations of observers to select the scale factor and bias
 based on observed tensor data are provided, developers can provide their own
@ -828,12 +937,13 @@ We also provide support for per channel quantization for **conv1d()**, **conv2d(
 **conv3d()** and **linear()**.

 Quantization workflows work by adding (e.g. adding observers as
-`.observer` submodule) or replacing (e.g. converting `nn.Conv2d` to
-`nn.quantized.Conv2d`) submodules in the model's module hierarchy. It
-means that the model stays a regular `nn.Module`-based instance throughout the
+``.observer`` submodule) or replacing (e.g. converting ``nn.Conv2d`` to
+``nn.quantized.Conv2d``) submodules in the model's module hierarchy. It
+means that the model stays a regular ``nn.Module``-based instance throughout the
 process and thus can work with the rest of PyTorch APIs.

-### Quantization Custom Module API
+Quantization Custom Module API
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Both Eager mode and FX graph mode quantization APIs provide a hook for the user
 to specify module quantized in a custom way, with user defined logic for
@ -848,6 +958,7 @@ observation and quantization. The user needs to specify:
   created from the observed module.
 4. A configuration describing (1), (2), (3) above, passed to the quantization APIs.

+
 The framework will then do the following:

 1. during the `prepare` module swaps, it will convert every module of type
@ -863,9 +974,8 @@ on that output. The observer will be stored under the `activation_post_process`
 as an attribute of the custom module instance. Relaxing these restrictions may
 be done at a future time.

-Custom API Example:
+Custom API Example::

-```python
  import torch
  import torch.ao.nn.quantized as nnq
  from torch.ao.quantization import QConfigMapping
@ -959,56 +1069,55 @@ Custom API Example:
  # calibration (not shown)
  mq = torch.ao.quantization.quantize_fx.convert_fx(
      mp, convert_custom_config=convert_custom_config_dict)
-```

-## Best Practices
+Best Practices
+--------------

-1. If you are using the `x86` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the `quant\_min`, `quant\_max`, e.g.
-   if `dtype` is `torch.quint8`, make sure to set a custom `quant_min` to be `0` and `quant_max` to be `127` (`255` / `2`)
-   if `dtype` is `torch.qint8`, make sure to set a custom `quant_min` to be `-64` (`-128` / `2`) and `quant_max` to be `63` (`127` / `2`), we already set this correctly if
-   you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default `qconfig` for
-   `x86` or `qnnpack` backend
+1. If you are using the ``x86`` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the ``quant\_min``, ``quant\_max``, e.g.
+if ``dtype`` is ``torch.quint8``, make sure to set a custom ``quant_min`` to be ``0`` and ``quant_max`` to be ``127`` (``255`` / ``2``)
+if ``dtype`` is ``torch.qint8``, make sure to set a custom ``quant_min`` to be ``-64`` (``-128`` / ``2``) and ``quant_max`` to be ``63`` (``127`` / ``2``), we already set this correctly if
+you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default ``qconfig`` for
+``x86`` or ``qnnpack`` backend

-1. If `onednn` backend is selected, 8 bits for activation will be used in the default qconfig mapping `torch.ao.quantization.get_default_qconfig_mapping('onednn')`
-   and default qconfig `torch.ao.quantization.get_default_qconfig('onednn')`. It is recommended to be used on CPUs with Vector Neural Network Instruction (VNNI)
-   support. Otherwise, setting `reduce_range` to True of the activation's observer to get better accuracy on CPUs without VNNI support.
+2. If ``onednn`` backend is selected, 8 bits for activation will be used in the default qconfig mapping ``torch.ao.quantization.get_default_qconfig_mapping('onednn')``
+and default qconfig ``torch.ao.quantization.get_default_qconfig('onednn')``. It is recommended to be used on CPUs with Vector Neural Network Instruction (VNNI)
+support. Otherwise, setting ``reduce_range`` to True of the activation's observer to get better accuracy on CPUs without VNNI support.

-## Frequently Asked Questions
+Frequently Asked Questions
+--------------------------

 1. How can I do quantized inference on GPU?:

   We don't have official GPU support yet, but this is an area of active development, you can find more information
-   [here](https://github.com/pytorch/pytorch/issues/87395).
+   `here <https://github.com/pytorch/pytorch/issues/87395>`_

 2. Where can I get ONNX support for my quantized model?

-   If you get errors exporting the model (using APIs under `torch.onnx`), you may open an issue in the PyTorch repository. Prefix the issue title with `[ONNX]` and tag the issue as `module: onnx`.
+   If you get errors exporting the model (using APIs under ``torch.onnx``), you may open an issue in the PyTorch repository. Prefix the issue title with ``[ONNX]`` and tag the issue as ``module: onnx``.

-   If you encounter issues with ONNX Runtime, open an issue at [GitHub - microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/issues/).
+   If you encounter issues with ONNX Runtime, open an issue at `GitHub - microsoft/onnxruntime <https://github.com/microsoft/onnxruntime/issues/>`_.

-3. How can I use quantization with LSTM's?
+3. How can I use quantization with LSTM's?:

-   LSTM is supported through our custom module api in both eager mode and fx graph mode quantization. Examples can be found at:
+   LSTM is supported through our custom module api in both eager mode and fx graph mode quantization. Examples can be found at
+   Eager Mode: `pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm <https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/core/test_quantized_op.py#L2782>`_
+   FX Graph Mode: `pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm <https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/fx/test_quantize_fx.py#L4116>`_

-   * Eager Mode: [pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm](https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/core/test_quantized_op.py#L2782).
-   * FX Graph Mode: [pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm](https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/fx/test_quantize_fx.py#L4116).
+Common Errors
+---------------------------------------

-## Common Errors
+Passing a non-quantized Tensor into a quantized kernel
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-### Passing a non-quantized Tensor into a quantized kernel
+If you see an error similar to::

-If you see an error similar to:
-
-```console
  RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...
-```

 This means that you are trying to pass a non-quantized Tensor to a quantized
-kernel. A common workaround is to use `torch.ao.quantization.QuantStub` to
+kernel. A common workaround is to use ``torch.ao.quantization.QuantStub`` to
 quantize the tensor.  This needs to be done manually in Eager mode quantization.
-An e2e example:
+An e2e example::

-```python
  class M(torch.nn.Module):
      def __init__(self):
          super().__init__()
@ -1021,22 +1130,19 @@ An e2e example:
          x = self.quant(x)
          x = self.conv(x)
          return x
-```

-### Passing a quantized Tensor into a non-quantized kernel
+Passing a quantized Tensor into a non-quantized kernel
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-If you see an error similar to:
+If you see an error similar to::

-```console
  RuntimeError: Could not run 'aten::thnn_conv2d_forward' with arguments from the 'QuantizedCPU' backend.
-```

 This means that you are trying to pass a quantized Tensor to a non-quantized
-kernel. A common workaround is to use `torch.ao.quantization.DeQuantStub` to
+kernel. A common workaround is to use ``torch.ao.quantization.DeQuantStub`` to
 dequantize the tensor.  This needs to be done manually in Eager mode quantization.
-An e2e example:
+An e2e example::

-```python
  class M(torch.nn.Module):
      def __init__(self):
          super().__init__()
@ -1061,24 +1167,21 @@ An e2e example:
  m.qconfig = some_qconfig
  # turn off quantization for conv2
  m.conv2.qconfig = None
-```

-### Saving and Loading Quantized models
+Saving and Loading Quantized models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-When calling `torch.load` on a quantized model, if you see an error like:
+When calling ``torch.load`` on a quantized model, if you see an error like::

-```console
  AttributeError: 'LinearPackedParams' object has no attribute '_modules'
-```

-This is because directly saving and loading a quantized model using `torch.save` and `torch.load`
+This is because directly saving and loading a quantized model using ``torch.save`` and ``torch.load``
 is not supported. To save/load quantized models, the following ways can be used:

 1. Saving/Loading the quantized model state_dict

-   An example:
+An example::

-   ```python
  class M(torch.nn.Module):
      def __init__(self):
          super().__init__()
@ -1104,13 +1207,11 @@ is not supported. To save/load quantized models, the following ways can be used:
  quantized = convert_fx(prepared)
  b.seek(0)
  quantized.load_state_dict(torch.load(b))
-   ```

-2. Saving/Loading scripted quantized models using `torch.jit.save` and `torch.jit.load`
+2. Saving/Loading scripted quantized models using ``torch.jit.save`` and ``torch.jit.load``

-   An example:
+An example::

-   ```python
  # Note: using the same model M from previous example
  m = M().eval()
  prepare_orig = prepare_fx(m, {'' : default_qconfig})
@ -1123,19 +1224,16 @@ is not supported. To save/load quantized models, the following ways can be used:
  torch.jit.save(scripted, b)
  b.seek(0)
  scripted_quantized = torch.jit.load(b)
-   ```

-### Symbolic Trace Error when using FX Graph Mode Quantization
+Symbolic Trace Error when using FX Graph Mode Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Symbolic traceability is a requirement for `(Prototype - maintenance mode) FX Graph Mode Quantization`_, so if you pass a PyTorch Model that is not symbolically traceable to `torch.ao.quantization.prepare_fx` or `torch.ao.quantization.prepare_qat_fx`, we might see an error like the following::

-Symbolic traceability is a requirement for {ref}`prototype-maintenance-mode-fx-graph-mode-quantization`, so if you pass a PyTorch Model that is not symbolically traceable to `torch.ao.quantization.prepare_fx` or `torch.ao.quantization.prepare_qat_fx`, we might see an error like the following:
-
-```console
  torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow
-```

-Please take a look at [Limitations of Symbolic Tracing](https://pytorch.org/docs/2.0/fx.html#limitations-of-symbolic-tracing) and use - [User Guide on Using FX Graph Mode Quantization](https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html) to workaround the problem.
+Please take a look at `Limitations of Symbolic Tracing <https://pytorch.org/docs/2.0/fx.html#limitations-of-symbolic-tracing>`_ and use - `User Guide on Using FX Graph Mode Quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html>`_ to workaround the problem.
+

-```{eval-rst}
 .. torch.ao is missing documentation. Since part of it is mentioned here, adding them here for now.
 .. They are here for tracking purposes until they are more permanently fixed.
 .. py:module:: torch.ao
@ -1311,4 +1409,3 @@ Please take a look at [Limitations of Symbolic Tracing](https://pytorch.org/docs
 .. py:module:: torch.quantization.quantize_jit
 .. py:module:: torch.quantization.stubs
 .. py:module:: torch.quantization.utils
-```
--- a/docs/source/random.rst
+++ b/docs/source/random.rst
@ -1,10 +1,7 @@
-# torch.random
+torch.random
+===================================

-```{eval-rst}
 .. currentmodule:: torch.random
-```

-```{eval-rst}
 .. automodule:: torch.random
   :members:
-```