This is follow-up of #165037. It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165142
Approved by: https://github.com/albanD
**Summary**
For int8 GEMM Template, the micro GEMM will calculate in u8s8s32 and we will do the scale/zp compensation in the epilogue. In general, it will be calculated as:
```
temp = micro_gemm_output * x_scale * w_scale
temp = temp - (x_scale * w_scale * x_zp) * sum(w, 0)
```
For case when `x_scale, w_scale, x_zp` are constant, we can pre-calculate the compensation to save runtime calculation.
**Performance**
Test with 4 cores of XEON-5 and shapes from VIT model
Before
```
GEMM(M=197,N=768,K=768) compile: 0.0939 ms (2.48 TOPS, 18.13 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.4275 ms (2.17 TOPS, 13.90 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2677 ms (3.47 TOPS, 22.20 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0148 ms (0.10 TOPS, 99.10 GB/s)
```
After
```
GEMM(M=197,N=768,K=768) compile: 0.0597 ms (3.90 TOPS, 28.53 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.2126 ms (4.37 TOPS, 27.95 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2282 ms (4.07 TOPS, 26.04 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0149 ms (0.10 TOPS, 98.71 GB/s)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152000
Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel
For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor.
The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless).
The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
# Motivation
The PR is intended to enable `onednn.qlinear` and `onednn.qlinear_unary` at Intel GPU.
We register the qlinear ops at C++ backend via `TORCH_LIBRARY_IMPL`, the op this PR registers includes `onednn::qlinear_pointwise`, `onednn::qlinear_pointwise.tensor`, and `onednn::qlinear_prepack`. The prepack conduct transpose on weight for fitting oneDNN requirement on weight to acquire higher performance.
Also, we remove the limitation of the corresponding annotation method in the `XPUInductorQuantizer` (`torch/ao/quantization/quantizer/xpu_inductor_quantizer.py`) to allow GPU linear conversion.
We add the kChar(`torch.int8`) dtype in the `torch/_inductor/fx_passes/quantization` and `torch/_inductor/mkldnn_ir.py`, as signed int8 is the default INT8 data type at GPU side.
We verified the op through UTs and e2e model testing like ResNet18, ResNet50.
# UT verification
```
DNNL_VERBOSE=0 TORCH_COMPILE_DEBUG=0 python test/inductor/test_mkldnn_pattern_matcher.py -v \
-k test_qlinear_xpu \
-k test_qlinear_relu_xpu \
-k test_qlinear_gelu_xpu
```
# Runtime exemplification
Here is the oneDNN verbose collected through running above UTs
```
//pure int8 gemm
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32,,2x4:4x3,0.187988
// post-relu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_relu,,2x4:4x4,0.115234
// post-gelu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_gelu_tanh,,2x4:4x4,0.170898
````
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133307
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jerryzh168
Co-authored-by: guangyey <guangye.yu@intel.com>
## Summary
Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner .
Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token.
Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case).
Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config.
## Performance data
The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used.
For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x.
Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5
Co-authored-by: Leslie Fang <leslie.fang@intel.com>
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012
- Support flexible number of GEMMs
- Share activation across GEMMs
- The Grouped GEMM Template supports independent activations
- However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
- Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
- Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```
**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16
class M(torch.nn.Module):
def __init__(self, bias):
super().__init__()
self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)
def forward(self, x):
return self.linear0(x), self.linear1(x)
if __name__ == "__main__":
with torch.no_grad():
input = torch.randn(batch_size, in_features, dtype=dtype)
m = M(bias=bias).to(dtype=dtype).eval()
cm = torch.compile(m)
act_res = cm(input)
```
Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py
**Next Step**
- Support Epilogue fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects:
1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent.
2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540
Approved by: https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #138691, #138806
## Summary
As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).
WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.
Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.
### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.
In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.
Benchmarked with unit-tests.
Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442
The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.
#### AVX2/AVX512 micro-kernels
Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437
### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.
E2E perf measurement should be done with #131310.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
## Summary
As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).
WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.
Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.
### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.
In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.
Benchmarked with unit-tests.
Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442
The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.
#### AVX2/AVX512 micro-kernels
Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437
### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.
E2E perf measurement should be done with #131310.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
**Summary**
We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template.
- Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar.
- We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion.
**Test Plan**
```
python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel