## Context
Previously, the stride preservation of user-visible nodes worked as follows:
- After joint-graph tracing, we recorded the **names** of user-visible nodes and passed them to GraphLowering.
- In GraphLowering, we determined whether we needed to preserve the striding for a certain node by checking if the node's name was in `user_visible_outputs`.
- We obtained the original strides by checking `node.meta["val"].stride()`.
However, there's a problem with this approach: the nodes in output_node.args[0] and their strides could change between the completion of joint-graph tracing and the consumption of `user_visible_outputs` (e.g., during post-grad passes), making it unreliable.
## This PR
- After joint graph tracing:
- Record the original strides for all nodes in `output_nodes.args[0]` as `output_node.meta["original_output_strides"]` (recording for all nodes in case we need the info for other purposes such as debugging).
- Record the indices of user-visible outputs as `output_node.meta["user_visible_output_idxs"]`.
- Remove the original plumbing of `user_visible_outputs`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136732
Approved by: https://github.com/Chillee
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303
Approved by: https://github.com/eellison
ghstack dependencies: #135701
Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code.
This is the output code for test `test_cond_reintepret_view_inputs_outputs`
Before this PR - https://www.internalfb.com/intern/paste/P1632948905/
With this PR - https://www.internalfb.com/intern/paste/P1632946348/
A relevant snippet from the above paste is
~~~
def false_graph_0(args):
false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args
args.clear()
s0 = s0
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
# Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add]
triton_poi_fused_add_sub_1_xnumel = (-20) + (20*s0)
stream0 = get_raw_stream(0)
triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0)
del false_graph_0_arg0_1
del false_graph_0_arg1_1
return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), )
async_compile.wait(globals())
del async_compile
def call(args):
arg0_1, arg1_1, arg2_1, arg3_1 = args
args.clear()
s0 = arg0_1
assert_size_stride(arg1_1, (s0, 20), (20, 1))
assert_size_stride(arg2_1, (s0, 20), (20, 1))
assert_size_stride(arg3_1, (), ())
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = [None] * 2
buf0 = [None] * 2
if arg3_1.item():
# subgraph: true_graph_0
true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
(true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0])
buf0[0] = true_graph_0_buf0
buf0[1] = true_graph_0_buf1
else:
# subgraph: false_graph_0
false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
(false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0])
buf0[0] = false_graph_0_buf0
buf0[1] = false_graph_0_buf1
del arg1_1
del arg2_1
del arg3_1
buf1 = buf0[0]
buf2 = buf0[1]
del buf0
return (buf1, buf2, )
~~~
The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper.
Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov).
This work will unblock hierarchical compilation (or cold start compile time work).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200
Approved by: https://github.com/desertfire, https://github.com/eellison
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.
The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.
An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :
```
from torch._inductor.bisect_helper import BisectionManager
if op in CURRENT_DECOMPOSITION_TABLE:
if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
return NotImplemented
```
Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.
We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.
Fix for https://github.com/pytorch/pytorch/issues/126546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
Previously, Inductor was allowed to modify the stride/storage_offset
(layout) for inputs to user-defined triton kernels. This can cause
silent incorrectness because most triton kernels are written for a
specific striding pattern (usually contiguous).
This PR adds a config to allow the user to choose Inductor's behavior on
this. The options are:
- "flexible_layout" (default): Inductor can modify the layout for inputs
to user-defined triton kernels as much as it wants.
- "needs_fixed_stride_order": Inductor must preserve the stride order
(when compared to tracing) for inputs to user-defined triton kernels.
This matches our handling for custom operators. In the future, we'll
want a "needs_exact_strides" option (this is the safest option).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530
Approved by: https://github.com/FindHao, https://github.com/oulgen
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
Fixes#130394
TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables dense and non-dense outputs' strides follow the strides required by semantics.
The comparison between the original and after this fix for the test is the below.
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 128
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 8
x1 = (xindex // 8)
- x2 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
tmp1 = tmp0 + tmp0
- tl.store(out_ptr0 + (x2), tmp1, xmask)
+ tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (16, 8), (16, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
- buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+ buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
stream0 = get_raw_stream(0)
triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
del arg0_1
return (buf1, )
```
The buf1 is created with exact stride required by users, and its values are written in same stride with the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire
Fixes#130394
TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables non-dense outputs' strides follow the strides required by semantics.
The comparison between the original and after this fix for the test is the below.
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 128
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 8
x1 = (xindex // 8)
- x2 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
tmp1 = tmp0 + tmp0
- tl.store(out_ptr0 + (x2), tmp1, xmask)
+ tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (16, 8), (16, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
- buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+ buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
stream0 = get_raw_stream(0)
triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
del arg0_1
return (buf1, )
```
The buf1 is created with exact stride required by users, and its values are written in same stride with the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister
Fixes#128084
The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
involve a clone; if there was a clone and then a mutation into it
then we copy_ back the result of the mutation.
The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.
There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
Fixes#128084
The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
involve a clone; if there was a clone and then a mutation into it
then we copy_ back the result of the mutation.
The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.
There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
This is a low-risk short-term fix for
https://github.com/pytorch/pytorch/issues/128084, for the purposes of
2.4.1. The actual fix for that issue is more risky and we'll target 2.5.
needs_fixed_stride_order is silently incorrect with args that are
mutable because it creates clones of those args, writes into them, and
doesn't update the original args.
This PR makes it so that needs_fixed_stride_order doesn't apply to
inputs that are being mutated.
This PR doesn't completely fix the problem, but it makes it less
incorrect: most of the time the input already has the correct strides
but inductor fails to recognize it, and in those cases writing directly
to the input is fine.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452
Approved by: https://github.com/eellison
Summary:
**Context:**
Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866)
The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process.
**Changes:**
1. Added a simple initial debug printer helper to print out tensor values
2. Added a filter option to selectively dump tensor values.
**Usage:**
Sample cmd :
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda
```
Sample outputs :
```
[ before_launch - triton_poi_fused_0 - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
[ after_launch - triton_poi_fused_0 - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
[ before_launch - aoti_torch_cuda_addmm_out - buf1 ]:
Min value: -2.25655
Max value: 2.32996
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0
[ before_launch - aoti_torch_cuda_addmm_out - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
[ after_launch - aoti_torch_cuda_addmm_out - buf1 ]:
Min value: -12.0839
Max value: 11.6878
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0
[ after_launch - aoti_torch_cuda_addmm_out - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)]
.
----------------------------------------------------------------------
Ran 1 test in 10.867s
OK
```
The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below:
```
torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out']
```
In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`.
Test Plan:
Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part)
`AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda`
Differential Revision: D60538496
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323
Approved by: https://github.com/ColinPeppler