This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
Adding bf16 for the backward pass of `torch._fake_quantize_learnable_per_tensor_affine()`.
Note that for testing, we modified the seed to avoid increasing tolerance due to cases where difference in Python vs CPP downcasting causes tensor mismatches. (e.g. 27.87704 vs 27.8408 before downcasting, 27.7500 vs 27.8750 after downcasting for Python vs CPP op)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165362
Approved by: https://github.com/andrewor14
Follow up to #165098 - adding bf16 support for the backward pass. To avoid BC breaking changes/losing precision, we upcast the parameters to fp32 after the op gets called, and downcast the gradients to bf16 before returning.
For testing, we upcast to fp32 before calling the reference function. We increase the tolerance to 1e-2 for bf16 inputs because of a difference in casting calculations between python's `x.to(torch.bfloat16)` and cpp's `x.to(at::kBFloat16)` (after comparing intermediate tensors, we found that the numerics diverge after the final casting). We don't explicitly cast in the CPP op but rather let autograd/optimizer handle it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165325
Approved by: https://github.com/andrewor14
Adding bf16 support for `torch._fake_quantize_learnable_per_channel_affine()` op by relaxing the type check on scale
TODO: need to add bf16 support to `per_tensor_affine_` as `torch._fake_quantize_learnable_per_tensor_affine_backward` gets called in the backward pass
**Test**
Modified unit test in `test_workflow_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165098
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
This PR fixes:
- Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions)
```
test_quantization.py::TestDynamicQuantizedOps::test_qlinear
test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy
test_quantization.py::TestQuantizedLinear::test_qlinear
test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_tanh
test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32
```
- A couple of SDPA tests on MI355 by adjusting fudge_factors:
```
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429
Approved by: https://github.com/jeffdaily
**Summary**
When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip.
**Test plan**
```
pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160957
Approved by: https://github.com/Valentine233, https://github.com/CaoE
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`
With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.
Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
**Summary**
Enable fp8 qconv on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qconv op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qconv op is not changed either.
So, the FP8 qconv shares the same op as INT8 qconv and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.
Note:
OneDNN does not support quantized fp8 convolution until v3.9 but the version used in PyTorch is v3.7.2. So, the op goes to the reference kernel for now. And we have also update the oneDNN path so that it's compatible with the fp8 dtype. Once oneDNN is upgraded to v3.9 or newer, minimum changes are needed to enable the oneDNN path. And we have ensured that the behavior of the reference kernel is the same as the new oneDNN's implementation.
- oneDNN version < 3.9 (now)
- Always go to the reference kernel
- oneDNN version >= 3.9 (future)
- Go to reference kernel on old platforms (without AMX)
- Use oneDNN on new platforms (with AMX)
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qconv and fp8"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157076
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`:
- Corrected the word "paramter" to "parameter" for better readability and accuracy.
While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase.
Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157646
Approved by: https://github.com/yewentao256, https://github.com/ezyang
Fixes#154328
**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.
Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.
**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
**Summary**
Enable fp8 qlinear on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qlinear op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qlinear op is not changed either.
So, the FP8 qlinear shares the same op as INT8 qlinear and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.
The differences of qlinear from `_scaled_mm` are that
- Qlinear supports post op fusion while `_scaled_mm` does not
- Weights are prepacked for qlinear
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qlinear and fp8"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155678
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Fixes#154328
**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.
Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.
**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Fixes#154328
**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.
Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.
**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168