**Summary**
Fixes [#151290](https://github.com/pytorch/pytorch/issues/151290) and [#151523](https://github.com/pytorch/pytorch/issues/151523), which are regressions introduced by [#144020](https://github.com/pytorch/pytorch/pull/144020). That PR enabled parallelization at the inner loop level.
However, a currently unsupported case arises when parallel reduction occurs under the vectorization loop level, specifically in patterns like:
```
for vec_loop_level:
do_parallel_reduction
```
In such cases, a temporary buffer `tmp_acc_array` is allocated for tail scalar kernels, and another temporary buffer `tmp_acc_array` is also defined for parallel reduction. This results in a conflict due to overlapping temporary buffers. This PR disables the problematic case to avoid the conflict until proper support is implemented.
**Test Plan**
```
python test/inductor/test_flex_attention.py -k test_make_block_mask_cpu
python test/inductor/test_cpu_repro.py -k test_parallel_reduction_vectorization
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151887
Approved by: https://github.com/jansel
Fix https://github.com/pytorch/pytorch/issues/148639.
Summary:
Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential [disconnected] Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
**Summary**
Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254
Approved by: https://github.com/malfet, https://github.com/jgong5
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Note the attached diff contains some minor fbcode-only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
#146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch.
UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq`
The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA.
`test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts.
@leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
**Summary**
In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it.
**Fusion**
- The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM.
- During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design.
- In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles.
**Code Gen**
We maintain a list of epilogues and codegen it one by one.
- If any of the GEMM has bias, we create a extra `bias_add` epilogue and prepend it at first of the epilogue list.
- If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #143796
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012
- Support flexible number of GEMMs
- Share activation across GEMMs
- The Grouped GEMM Template supports independent activations
- However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
- Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
- Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```
**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16
class M(torch.nn.Module):
def __init__(self, bias):
super().__init__()
self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)
def forward(self, x):
return self.linear0(x), self.linear1(x)
if __name__ == "__main__":
with torch.no_grad():
input = torch.randn(batch_size, in_features, dtype=dtype)
m = M(bias=bias).to(dtype=dtype).eval()
cm = torch.compile(m)
act_res = cm(input)
```
Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py
**Next Step**
- Support Epilogue fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest`
- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)`
- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)`
Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243
Approved by: https://github.com/jgong5
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746
Approved by: https://github.com/jgong5
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.
Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`
And the modification api:
```
{{ modification(
subgraph_number=0,
output_name="post_mod_scores",
score="qk",
out="qk"
) | indent_except_first(1) }}
```
We have:
```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```
Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.
There are a couple main use cases for prologue fusion:
- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.
Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066
Other notes:
By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.
With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel