Beginning of process for 3.14 bringup.
State of things from this PR:
- Nothing too scary looking from the Dynamo CPython side, nothing we heavily rely on seems to be missing @williamwen42
- The existing check that makes torch.compile() nicely fail is working as expected. So all these empty functions shouldn't cause any weirdness.
- The `__module__` update changes look suspicious, we should investigate what is the reason and impact of that, in particular for our public API checking @jbschlosser
- Leaving the weakref.py thread safety change as a follow up to keep this a bit simpler. I vendored the whole struct in the meantime FYI @ezyang
EDIT: The `__module__` change is even more cursed than I though due to changes to Union and Optional type where the `__module__` field cannot be changed anymore. See https://github.com/python/cpython/issues/132139 for details.
For now, I'm just skipping the `__module__` setting for 3.14 which will trip the public API checks. Will revisit once I have a final answer on the cpython issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158184
Approved by: https://github.com/msaroufim
Summary:
att
Test Plan:
(ao) $ PYTHONWARNINGS='default' python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.ao.quantization.quantizer.xnnpack_quantizer import XNNPACKQuantizer
printing warning
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/__init__.py:36: DeprecationWarning: torch.ao.quantization is deprecated. Plan is to
1. Remove eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead
2. Remove fx graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx, torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e)
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e)
see https://dev-discuss.pytorch.org/t/torch-ao-quantization-migration-plan/2810 for more details
warnings.warn(
>>> a = XNNPACKQuantizer()
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/quantizer/xnnpack_quantizer.py:281: DeprecationWarning: XNNPACKQuantizer is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead
warnings.warn(f"{self.__class__.__name__} is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead", DeprecationWarning)
>>>
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153892
Approved by: https://github.com/Skylion007
MOTIVATION
We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_.
CHANGES
Move eps to _device_ of other tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
This commit enables float8_e5m2 and float8_e4m3fn dtypes in fx quantization and PT2E.
Motivation for using fp8 quantization instead of int8:
- it works better to run inference with the same datatype the model was trained with,
- fp8 can handle outliers better, which is one of the problems in LLMs activations.
The numerical recipe we want to use it for is fp8 inference:
- bgemms/gemms running in float8_e4m3fn,
- Per-Tensor-Quantization/Scaling,
- amax observer for measurement with input_backoff and weight_backoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123161
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.
I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
**Description**
Add dynamic quantization config for x86 inductor backend.
To support the QKV structure in self-attention, we removed an assertion in port-metadata-pass that requires single dequantize node after quantize node.
**Test plan**
```
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_dynamic_quant_linear
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_qat_dynamic_quant_linear
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115337
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
Previously we can only use native pytorch int dtypes that has corresponding quantized dtypes (e.g. quint8, qint8), this
PR removes this assumption in observers/fake_quants so that users can use all pytorch native dtypes (except for int64, we can add it later if need)
the main addition here is int16.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108453
Approved by: https://github.com/kimishpatel
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
Summary:
Makes the `nnqr.Linear` module respect the qmin/qmax attributes of weight observer. This is to unblock some customer teams who are depending on non-default values of these attributes.
Test plan:
```
python test/test_quantization.py -k TestReferenceQuantizedModule.test_linear_decomposed
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96232
Approved by: https://github.com/andrewor14
Summary: The previous LSTM reference module implementation did
not handle dtypes other than quint8 correctly. This is because
the internal LSTM custom module quantization used eager mode,
which did not insert the q-dq ops properly. E.g., we want the
following reference quantized model:
```
[dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 ->
[dq - linear2_fp32 -> q_to_quint8] -> dq -> ...
```
This requires two sets of `q - dq` pairs between two adjacent
ops that have different dtypes (linear1 and linear2). However,
these `q - dq` pairs were not inserted in the old flow, because
eager mode required users to insert Quant/DeQuantStubs manually.
This commit changes the internal LSTM custom module quantization
to use FX graph mode quantization, which automatically inserts
the `q - dq` ops that convert the dtypes between adjacent ops
correctly. However, using FX graph mode quantization here comes
with its own set of challenges that required some hacks to get
the end-to-end flow to work. These hacks are detailed in the
comments in the util functions.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams
This commit also updates the corresponding test to verify the
dtypes as well as the qparams in the reference quantized graph.
This test case should serve as an example for users to set up
their own LSTM reference module flows.
Reviewers: vkuzo, supriyar, jcaip
Subscribers: vkuzo, supriyar, jcaip
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96343
Approved by: https://github.com/vkuzo
Summary:
Previously we assumed asymmetric quantization for dynamic quantization, this diff adds the support of symmetric quantization
for the input in dynamic quantization
Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic"
Reviewed By: digantdesai
Differential Revision: D43134794
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94854
Approved by: https://github.com/digantdesai
Summary: The existing util function did not quantize all inner
ops in the quantizable LSTM module, resulting in the error
"Could not run X with arguments from the 'QuantizedCPU' backend."
This commit fixes this by ensuring that all the other ops whose
qparams were not specifically configured are still quantized as
before, as in `torch.ao.nn.quantizable.LSTM.from_float`.
Test Plan: This commit also adds an additional check in the test
to ensure that the final converted model is in fact quantized,
in addition to just checking the qparams in the observers have
the right values.
python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams
Reviewers: vkuzo
Subscribers: vkuzo, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95537
Approved by: https://github.com/vkuzo
This reverts commit 59071ab1e7.
It breaks `quantization.jit.test_ondevice_quantization.TestOnDeviceDynamicPTQFinalize`, which is not run in OSS, but is mandatory for internal CI.
Summary:
This PR deprecates the `compute_dtype` field on observers, and replaces
it with the `is_dynamic` field on observers. This is better aligned
with the reference model spec.
Test plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85431
Approved by: https://github.com/jerryzh168