Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer`
After `convert_pt2e`, the pattern will be
```
x
|
linear <- to_fp32 <- to_fp16 <- w
```
**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
- also makes scales and zp dtype reconcile with meta impl as well as other
quantized ops representation of scales and zero point
- make sure qunatize_per_token's output_dtype is respected
There are a few places where we need to reconcile on scale and zero point dtype
but that will come later. This fixes are mainly being done to enable quantized
kv cache though ET stack
Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807
Approved by: https://github.com/jerryzh168
Summary: Follow-up to https://github.com/pytorch/ao/pull/229.
This resolves the difference between `input.div(scales)` and
`input.mul(1.0 / scales)`, which results in small numerical
discrepancies on some inputs.
Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_channel_group
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_token
Reviewers: jerryzh168
Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125781
Approved by: https://github.com/jerryzh168
This commit enables float8_e5m2 and float8_e4m3fn dtypes in fx quantization and PT2E.
Motivation for using fp8 quantization instead of int8:
- it works better to run inference with the same datatype the model was trained with,
- fp8 can handle outliers better, which is one of the problems in LLMs activations.
The numerical recipe we want to use it for is fp8 inference:
- bgemms/gemms running in float8_e4m3fn,
- Per-Tensor-Quantization/Scaling,
- amax observer for measurement with input_backoff and weight_backoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123161
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.
I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
Summary: https://github.com/pytorch/pytorch/pull/123452 added
backward support to this op by turning it into
CompositeImplicitAutograd, which meant it gets decomposed during
export/compile. However, this is not desirable behavior for the
PTQ case when we try to lower the model. This commit enables
QAT without breaking PTQ by refactoring the impl into a separate
op that does have backward support.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai, zou3519
Subscribers: jerryzh168, digantdesai, zou3519, supriyar
Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178
Approved by: https://github.com/digantdesai
Summary: When running the backward for this op, we get the error:
```
RuntimeError: derivative for aten::aminmax is not implemented
```
This commit replaces this call with separate amin and amax
calls instead, which do have implemented derivatives.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai
Subscribers: jerryzh168, digantdesai, supriyar
Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452
Approved by: https://github.com/digantdesai, https://github.com/jerryzh168
We are taking API feedback. Changes:
- I removed some of the default values (they weren't being used).
- I was unable to convert the last op (which is essentially an
autograd.Function registered as CompositeImplicitAutograd). That one
is "incorrectly registered"; I punt fixing it to the future.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123454
Approved by: https://github.com/andrewor14
ghstack dependencies: #123453, #123578
Summary: We probably don't need
`torch._C._AutoDispatchBelowAutograd()`, which is to prevent
infinite recursion if the implementation calls itself. Let's
remove it and see if anything breaks. The other major change
is registering the op to the more general Autograd dispatch
key so it can be used on cuda as well.
Test Plan:
python test/inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
Reviewers: zou3519, bdhirsh
Subscribers: zou3519, bdhirsh, jerryzh168, leslie-fang-intel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123186
Approved by: https://github.com/zou3519, https://github.com/leslie-fang-intel
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308
Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.
At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.
This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel
Reviewed By: digantdesai
Differential Revision: D53590486
Pulled By: manuelcandales
Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
**Summary**
Add the operator of `quantized_decomposed.fake_quant_per_channel` and test the forward and backward of this op with comparing to ATen.
**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```
**Next Step**
Optimize the performance: from the generated code of forward and backward graph, the code didn't vectorize.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121297
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
Summary:
Previously we can only use native pytorch int dtypes that has corresponding quantized dtypes (e.g. quint8, qint8), this
PR removes this assumption in observers/fake_quants so that users can use all pytorch native dtypes (except for int64, we can add it later if need)
the main addition here is int16.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108453
Approved by: https://github.com/kimishpatel
Summary: Similar to quantized add, in this PR we added the reference represenation for quantize/dequantize operators
Test Plan:
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_quantize (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_dequantize (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
Reviewed By: kimishpatel
Differential Revision: D46959928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104395
Approved by: https://github.com/andrewor14
Summary:
The planned e2e for quantization in pytorch 2.0 export is the following:
float_model -> prepare_pt2e -> calibration -> convert_pt2e -> ...
inside convert_pt2e, we will first produce a q/dq representation of the quantized model, similar to the previous output of
convert_to_reference_fx in fx grah mode quantization:
```
torch.ops.quantized_decomposed.dequantize_per_tensor -> torch.ops.aten.add -> torch.ops.quantized_decomopsed.quantize_per_tensor
torch.ops.quantized_decomposed.dequantize_per_tensor /
```
Then we'll rewrite the above to a more precise representation that express the intention in a more precise manner, since
here we actually want to do int8 addition, instead of simulating the int8 addition with fp32 operations, the representation for
quantized add is:
```
def quantized_add(x_i8, x_scale, x_zero_point, y_i8, y_scale, y_zero_point, out_scale, out_zero_point):
x = (x_scale / out_scale) * x_i8
y = (y_scale / out_scale) * y_i8
out = x + y
out -= (x_zero_point * x_scale - y_zero_point * y_scale) / out_scale
out += out_zero_point
return out
```
Test Plan:
```
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_add (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```
Reviewed By: kimishpatel
Differential Revision: D45628032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104130
Approved by: https://github.com/kimishpatel
Summary:
Previously we assumed asymmetric quantization for dynamic quantization, this diff adds the support of symmetric quantization
for the input in dynamic quantization
Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic"
Reviewed By: digantdesai
Differential Revision: D43134794
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94854
Approved by: https://github.com/digantdesai
Summary:
This PR tries to decompose the operators in torch.ops.quantized_decomposed namespace to more
primitive aten operators, this would free us from maintaining the semantics of the quantize/dequantize
operators, which can be expressed more precises in terms of underlying aten operators
Note: this PR just adds them to the decomposition table, we haven't enable this by default yet
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_q_dq_decomposition
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93312
Approved by: https://github.com/vkuzo, https://github.com/SherlockNoMad
Summary:
This PR tries to decompose the operators in torch.ops.quantized_decomposed namespace to more
primitive aten operators, this would free us from maintaining the semantics of the quantize/dequantize
operators, which can be expressed more precises in terms of underlying aten operators
Note: this PR just adds them to the decomposition table, we haven't enable this by default yet
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_q_dq_decomposition
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93312
Approved by: https://github.com/vkuzo, https://github.com/SherlockNoMad
This reverts commit 59071ab1e7.
It breaks `quantization.jit.test_ondevice_quantization.TestOnDeviceDynamicPTQFinalize`, which is not run in OSS, but is mandatory for internal CI.
Summary: Only the pattern part, will leave the delegation example to Chen
Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic"
Reviewed By: cccclai
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90640
Approved by: https://github.com/cccclai
When you are writing a meta function, you cannot call item() on the tensor because there is no real data on the tensor and it will fail. The error message was not very good in this case, see also https://github.com/pytorch/pytorch/issues/89959
This PR takes a brute force approach to resolving the problem: just manually define meta implementations for the naughty functions that are calling item(). However, this results in a lot of code duplication. The easiest way to avoid this situation is to rewrite the decomps so they don't call item. It should not be that difficult to use direct tensors on your operations, as scalar tensors can broadcast too.
I could only test this with `buck test @mode/opt -c python.package_style=inplace //executorch/backends/test:test_backends` in internal with D41555454. Test coverage needs to be improved, otherwise don't blame us when we break you.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89958
Approved by: https://github.com/jerryzh168