**Summary**
In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it.
**Fusion**
- The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM.
- During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design.
- In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles.
**Code Gen**
We maintain a list of epilogues and codegen it one by one.
- If any of the GEMM has bias, we create a extra `bias_add` epilogue and prepend it at first of the epilogue list.
- If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #143796
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012
- Support flexible number of GEMMs
- Share activation across GEMMs
- The Grouped GEMM Template supports independent activations
- However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
- Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
- Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```
**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16
class M(torch.nn.Module):
def __init__(self, bias):
super().__init__()
self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)
def forward(self, x):
return self.linear0(x), self.linear1(x)
if __name__ == "__main__":
with torch.no_grad():
input = torch.randn(batch_size, in_features, dtype=dtype)
m = M(bias=bias).to(dtype=dtype).eval()
cm = torch.compile(m)
act_res = cm(input)
```
Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py
**Next Step**
- Support Epilogue fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest`
- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)`
- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)`
Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243
Approved by: https://github.com/jgong5
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746
Approved by: https://github.com/jgong5
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.
Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`
And the modification api:
```
{{ modification(
subgraph_number=0,
output_name="post_mod_scores",
score="qk",
out="qk"
) | indent_except_first(1) }}
```
We have:
```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```
Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.
There are a couple main use cases for prologue fusion:
- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.
Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066
Other notes:
By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.
With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.
Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`
And the modification api:
```
{{ modification(
subgraph_number=0,
output_name="post_mod_scores",
score="qk",
out="qk"
) | indent_except_first(1) }}
```
We have:
```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```
Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.
There are a couple main use cases for prologue fusion:
- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.
Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066
Other notes:
By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.
With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
61 | tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
| ^~~~~
```
This PR adds the required broadcasting.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
### Summary
1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned).
2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body.
This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
Fixes#138550.
### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.
### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
## Description
Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune.
The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same).
If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`.
This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied.
### Unsupported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
401408 * d0 + 100352 * d1 + **7168 * d2** + **1792 * d3** + 128 * d4 + d5
The load of `buf1` in the epilogue node:
401408 * d0 + 100352 * d1 + **1792 * d2** + **25088 * d3** + 128 * d4 + d5
The above two indexes are different.
```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise(
'cpu',
torch.float32,
def inner_fn(index):
i0, i1, i2, i3, i4, i5 = index
tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
tmp2 = tmp0 + tmp1
tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
tmp4 = tmp2 + tmp3
return tmp4
,
ranges=[8, 4, 14, 4, 14, 128],
origin_node=clone,
origins=OrderedSet([clone])
))
```
### Supported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
d0 + 576 * d1 + 32 * d2
The load of `buf1` in the epilogue node:
d0 + 576 * d1 + 32 * d2
The above two indexes are the same.
The layout of `buf2` and `buf1` are different though which is handled by the reindexer:
`buf1`: `size=[324, 32], stride=[32, 1]`
`buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]`
```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise(
'cpu',
torch.bfloat16,
def inner_fn(index):
_, i1, i2, i3 = index
tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2)
tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
tmp2 = ops.load(_frozen_param4, i1)
tmp3 = tmp1 * tmp2
tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2)
tmp5 = tmp3 + tmp4
tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32)
return tmp6
,
ranges=[1, 32, 18, 18],
origin_node=convert_element_type_4,
origins=OrderedSet([add, mul, convert_element_type_4])
))
```
## TODO
Add the support for fusions when the indexes are different in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5