Commit Graph

697 Commits

Author SHA1 Message Date
Cynthia Yang
960c4b9937 [inductor] Enable triton kernels with unbacked inputs (#164509)
Summary:
We need to pass in fallback value to avoid converting symbols to int

original failure log in onefeed Slimper MB - P1973406565
`raise TypeError("Cannot convert symbols to int")`

Test Plan:
if not passing in fallback value -
https://www.internalfb.com/intern/everpaste/?handle=GGeAoh_M11kEGOECAFELOaq8ooRCbswMAAAz
`raise TypeError("Cannot convert symbols to int")`

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_triton_kernel_with_unbacked_symint_fallback --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Buck UI: https://www.internalfb.com/buck2/4d27cd49-770b-40de-8c65-9ee04c5dd687
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149324695031
Network: Up: 0B  Down: 16MiB  (reSessionID-8e8b07a2-e31c-402d-bf6a-ebb92253e654)
Executing actions. Remaining     0/6                                                              5.0s exec time total
Command: test.     Finished 2 cache (100% hit)                                                    5.0s exec time cached (100%)
Time elapsed: 33.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D83684260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164509
Approved by: https://github.com/ColinPeppler
2025-10-03 21:05:18 +00:00
eellison
86474ce996 Update mask dtype (#164472)
Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472
Approved by: https://github.com/bdhirsh
2025-10-03 00:19:36 +00:00
Isuru Fernando
f465ea6752 [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
ghstack dependencies: #164158
2025-10-02 21:52:09 +00:00
PyTorch MergeBot
20edc5b26a Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)"
This reverts commit 22c5e8c17c.

Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274))
2025-10-01 16:23:03 +00:00
Pian Pawakapan
d615f6b935 [inductor] use hint_override in kernel benchmark args (#164207)
Summary: forward fix T239259207

Test Plan: test_multi_kernel

Differential Revision: D83539263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164207
Approved by: https://github.com/bobrenjc93, https://github.com/mlazos
2025-09-30 18:09:29 +00:00
Nick Riasanovsky
719b64ee8b Fix TMA transpose logic to handle 1D shapes + string differences (#163966)
Fixes #163702.

This fixes 2 issues:
1. The value may inconsistently be a shape or string. This normalizes to handle both of these.
2. 1D shapes should not transpose data. This fixes the order of operations to prevent this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966
Approved by: https://github.com/eellison
2025-09-30 17:51:37 +00:00
Yuanyuan Chen
85012fe167 Remove unnecessary list comprehensions (#164103)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-09-30 03:56:54 +00:00
PyTorch MergeBot
6b473c90cf Revert "[inductor] require shape in TritonCSEVariable (#162275)"
This reverts commit c257570e6c.

Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/jeffdaily due to sorry this broke rocm CI; inductor/test_select_algorithm.py::TestTemplateRender::test_finalized_subclass_hooks [GH job link](https://github.com/pytorch/pytorch/actions/runs/18048893250/job/51366715091) [HUD commit link](c257570e6c) ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3348159095))
2025-09-29 17:26:54 +00:00
Markus Hoehnerbach
069ccf5f1e [inductor] pdl: enable launch and deduplicate waits (#162014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162014
Approved by: https://github.com/eellison
2025-09-29 16:10:26 +00:00
Isuru Fernando
c257570e6c [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
2025-09-26 20:41:12 +00:00
Shangdi Yu
520fca82c8 Refactor Provenance Tracking (#163378)
Summary:
- Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code

- Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`.

- If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a  `codegen_comment` call.

- Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling

- Remove `debug_handle` from `call_kernel`.

Test Plan:
CI

```
buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D82839271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378
Approved by: https://github.com/angelayi
2025-09-25 22:55:59 +00:00
karthickai
8c98aee436 [Inductor] Update DeviceAssert op to behave like store (#163696)
Updated the DeviceAssert operation to match the behavior of Store, it will fixes the issue mentioned in [this PR](https://github.com/pytorch/pytorch/pull/163023) and updated testcases as Elias [suggested](https://github.com/pytorch/pytorch/pull/160677#discussion_r2353834646).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163696
Approved by: https://github.com/mlazos
2025-09-24 23:35:56 +00:00
Nick Riasanovsky
0390798dad [Triton] [Inductor] Enable Epilogue Subtiling in the blackwell ws template (#163145)
Summary: Enables support for epilogue subtiling in the blackwell ws template. This requires the ability to call `store_output` twice in the same kernel and reuse the same tensor descriptor across allocations.

Test Plan:
Tested with test_max_autotune.py on a Blackwell server.

Rollback Plan:

Differential Revision: D82610077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163145
Approved by: https://github.com/eellison
2025-09-24 05:38:02 +00:00
Markus Hoehnerbach
eb3fbf5b08 [inductor] in emulate_precision_casts, disable fma fusion in triton (#163073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163073
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-23 23:59:17 +00:00
eellison
c63e417c79 use reduction hint for aggressive rblock (#163371)
I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163371
Approved by: https://github.com/PaulZhang12
2025-09-23 22:04:22 +00:00
PaulZhang12
22c5e8c17c Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-09-23 20:36:39 +00:00
Jason Ansel
518c320676 [inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
Fixes #163082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #163386, #163398, #163387, #163414, #163415
2025-09-23 15:37:21 +00:00
PaulZhang12
2b036632ca Allow add_persistent_r_block to scale up rblock up to a limit (#162296)
<img width="654" height="392" alt="Screenshot 2025-09-18 at 4 22 53 PM" src="https://github.com/user-attachments/assets/975650ec-f769-43a6-bdf5-2885a8d40d3c" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162296
Approved by: https://github.com/eellison
2025-09-22 21:41:46 +00:00
Markus Hoehnerbach
c5e7bb08b0 [inductor] pdl inductor option (disabled by default) (#160928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928
Approved by: https://github.com/eellison
2025-09-18 06:35:28 +00:00
Isuru Fernando
c77726b1d7 [inductor] fix expand_shape when copy_shape is not a string (#162739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162739
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-09-15 23:22:07 +00:00
Nick Riasanovsky
74a35c6344 [Triton] [Inductor] Enable TMA store for TMA mm templates (#160480)
Summary:
Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support.

This also includes a couple notable changes:
- Adds support in the TMA template support for checking the output layout.
- Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation.
- Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches.

Test Plan:
Tested with test_max_autotune.py unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480
Approved by: https://github.com/NikhilAPatel
2025-09-14 04:56:49 +00:00
Isuru Fernando
f654cff566 [inductor] Add shape to load_input in matmul templates (#162513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513
Approved by: https://github.com/eellison
ghstack dependencies: #162426
2025-09-11 01:51:15 +00:00
eellison
f4aeceaa9d Use upper bound for persistent rblock (#162441)
Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2.

Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441
Approved by: https://github.com/PaulZhang12
2025-09-10 22:29:02 +00:00
Colin Peppler
348303ebd2 [ez] add docstring/typing for codegen_kernel_benchmark (#162609)
```
lintrunner init && lintrunner -m origin/main
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609
Approved by: https://github.com/coconutruben
ghstack dependencies: #162442
2025-09-10 20:49:38 +00:00
Colin Peppler
94755e81c4 [inductor] Enable combo kernels with unbacked inputs (#162442)
Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes.

### Example exception

```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel
    kernel_code_list = self.generate_combo_kernel_code(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code
    src_code = kernel.codegen_kernel()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel
    code.splice(self.codegen_kernel_benchmark(num_gb=0))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark
    var_names.extend(self.kernel_benchmark_extra_args())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args
    extra_args.append(str(V.graph.sizevars.size_hint(tree.numel)))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint
    return int(out)
           ^^^^^^^^
  File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int
```

Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442
Approved by: https://github.com/jansel
2025-09-10 20:49:38 +00:00
PyTorch MergeBot
ada43ed39c Revert "[inductor] pdl inductor option (disabled by default) (#160928)"
This reverts commit 9458d1ac3b.

Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))
2025-09-07 07:37:37 +00:00
Markus Hoehnerbach
9458d1ac3b [inductor] pdl inductor option (disabled by default) (#160928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928
Approved by: https://github.com/eellison
2025-09-04 00:35:23 +00:00
Mwiza Kunda
b994f6e3b3 [inductor] check block options after broadcasting and singleton dims have been removed (#161602)
This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match
because the innermost dimension does not have stride 1
```python
block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0])
```
After broadcasting dimensions and singleton dimensions are removed:
```python
block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602
Approved by: https://github.com/jansel
2025-08-30 08:10:51 +00:00
Boyuan Feng
77d8e98e1b [Inductor] update exp codegen for better precision (#161829)
Prior to this PR, we have:
```
[Default Behavior] uses `tl.math.exp({x})`:
eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592

TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`:
eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367

Update inductor to use `tl.extra.libdevice.exp(tmp0)`:
eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593
```

Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now.

Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829
Approved by: https://github.com/jansel
2025-08-30 04:56:51 +00:00
eellison
ebfee60101 [WIP] more aggressive persistent reduction (#161055)
Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335).

Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed.

As criteria:

- there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input)
- we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing
- we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved).

Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055
Approved by: https://github.com/jansel
2025-08-30 01:08:45 +00:00
Karthick Panner Selvam
130e50afff [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos, https://github.com/atalman
2025-08-28 18:57:34 +00:00
PyTorch MergeBot
c55bdb26e1 Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)"
This reverts commit 378edb047f.

Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))
2025-08-27 23:45:12 +00:00
PyTorch MergeBot
014b98dd09 Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)"
This reverts commit 77bc959fe1.

Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))
2025-08-27 21:21:19 +00:00
Karthick Panner Selvam
378edb047f [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
2025-08-27 14:49:20 +00:00
PyTorch MergeBot
de58505890 Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)"
This reverts commit cddcaa1903.

Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))
2025-08-27 02:36:42 +00:00
Karthick Panner Selvam
cddcaa1903 [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
2025-08-26 22:33:23 +00:00
Charlie West-Taylor
77bc959fe1 Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)
Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314
Approved by: https://github.com/eellison
2025-08-26 19:40:37 +00:00
Mwiza Kunda
5ee464db5c [inductor] Fix descriptor broadcasting for singleton dimensions (#160310)
This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail.

As an example, consider the following store instruction:

```
name=buf1
index=x2 + 192*y0 + 64*y1
valule=TritonCSEVariable('tmp7')
params = BlockParameters(
    shape=[3, 4, 1, 1, 64],
    block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK],
    strides=[64, 192, 0, 0, 1],
    offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset]
)
broadcasting_dims=[False, False, True, True, False]
broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK]
```
Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly
marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`.

```
9             pre_broadcast_shape = [
280                 sympy.S.One if is_broadcasting else dim
281                 for dim, is_broadcasting in zip(
282  ->                 self.broadcast_shape, self.broadcasting_dims
283                 )
284             ]
```

The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]`

Triton throws the following error: `reshape() cannot change total number of elements in tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310
Approved by: https://github.com/blaine-rister
2025-08-20 09:48:58 +00:00
Isuru Fernando
f305019377 [inductor] propagate shapes in CSEVariable (#152198)
Fixes #149905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198
Approved by: https://github.com/eellison
2025-08-19 16:46:38 +00:00
Nick Riasanovsky
df60736410 [BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
2025-08-19 07:32:55 +00:00
PyTorch MergeBot
04c7be903d Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)"
This reverts commit 8f434545c2.

Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))
2025-08-17 14:22:48 +00:00
Nick Riasanovsky
8f434545c2 [BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
2025-08-17 00:35:12 +00:00
Markus Hoehnerbach
89654db1ab [inductor] fix triton bucketize mask propagation (#159961)
See 6b414f56a4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961
Approved by: https://github.com/eellison
2025-08-12 19:59:32 +00:00
David Berard
1f4057c11a [inductor] remove no_x_dim (#159810)
no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional.

no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue.

However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell.

H100 inference benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

H100 training benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

Overall, the benchmarks show minimal change in performance.

Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-08-12 17:10:31 +00:00
Tom Ritchford
a5725965ea Remove unnecessary "# noqa: set_linter" comments (#159467)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467
Approved by: https://github.com/eellison
2025-08-06 21:31:52 +00:00
Bin Bao
a4b07fe8f6 [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-08-06 15:59:27 +00:00
PyTorch MergeBot
410812763b Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)"
This reverts commit bbc0df1094.

Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))
2025-08-05 22:00:24 +00:00
Nick Riasanovsky
bbc0df1094 [Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Relying on CI. Should be a NFC.

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D79378792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777
Approved by: https://github.com/davidberard98
2025-08-05 03:29:13 +00:00
Anatoly Myachev
46b925681c [inductor] Update to(tl.int8).to(tl.uint8) workaround from #94717 to handle entire range of torch.uint8 (#158567)
https://github.com/pytorch/pytorch/pull/94717/files#r2210265070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-07-26 19:11:37 +00:00
PyTorch MergeBot
7d6f340238 Revert "[AOTI] Add more default options to compile_standalone (#158560)"
This reverts commit a991e285ae.

Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))
2025-07-22 16:20:17 +00:00