Commit Graph

183 Commits

Author SHA1 Message Date
Yuanyuan Chen
fc8ac1216c [4/N] Remove unused loop variables in tests (#166690)
This PR removes unused loop variables in tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690
Approved by: https://github.com/justinchuby, https://github.com/mlazos
2025-10-31 10:20:48 +00:00
eellison
f5543e3741 [wip] fix searchsorted non dense (#165064)
Fix for https://github.com/pytorch/pytorch/issues/163528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165064
Approved by: https://github.com/benjaminglass1, https://github.com/mlazos
2025-10-30 21:21:24 +00:00
PaulZhang12
901bbcba12 Gate division bitwise numerics under a flag (#165566)
https://github.com/pytorch/pytorch/pull/164144 ensures that division for compile is bitwise equivalent with eager. However, in https://github.com/pytorch/pytorch/issues/164301, the kernel performance is regressed.

On B200:
With standard triton `/`:
6511 GB/s

With triton `div_rn`:
4692 GB/s

Further investigation is required for the generated PTX to see why there is such a large slowdown. For now, enable bitwise equivalent results under `TORCHINDUCTOR_EMULATE_DIVISION_ROUNDING` similar to emulate_precision_cast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165566
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-10-15 23:41:01 +00:00
PaulZhang12
c8c5187e85 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/bobrenjc93
2025-10-10 22:18:11 +00:00
PyTorch MergeBot
abb2f7179e Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit 68913d8f2a.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to It breaks CI again, why was it landed for 3 times in a row without any changes? ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3390973016))
2025-10-10 16:10:25 +00:00
PaulZhang12
68913d8f2a Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
2025-10-10 14:00:46 +00:00
eellison
d272ed4b3e Fix identity expansion (#165066)
In some cases, we wrap indexing with `Identity` to prevent expansion from int32 -> int64 range. There are some checks in codegen which intend to check for constants, which did not handle Identity. Update these checks and update Identity so that it recursively prints inputs.

Fix for https://github.com/pytorch/pytorch/issues/164700

Replaces https://github.com/pytorch/pytorch/pull/160190 cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @njriasan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165066
Approved by: https://github.com/njriasan, https://github.com/shunting314, https://github.com/jansel
2025-10-10 13:07:15 +00:00
PyTorch MergeBot
ed2d514ad8 Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit 724463d5a2.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to Not sure if it's related, but looks it triggered fuzzer compiler test failure, see a2f29bcd63/1 ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3387288464))
2025-10-09 19:53:38 +00:00
Jithun Nair
ee6a1ecb0a [ROCm] Enable MI355 CI on PRs, and run full set of UTs on PRs (#160215)
Useful to have PR testing for PRs such as https://github.com/pytorch/pytorch/pull/151360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160215
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-09 18:03:12 +00:00
PaulZhang12
724463d5a2 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 14:31:33 +00:00
PyTorch MergeBot
e09fb44ef1 Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit d386325ca9.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3384769092))
2025-10-09 08:40:52 +00:00
PaulZhang12
d386325ca9 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 04:22:03 +00:00
PaulZhang12
54ae61c573 Change test_emulate_precision_casts_mean_ratio_chain from gelu to relu (#164997)
gelu can be instable on local builds due to libdevice differences, as we lower to libdevice.erf. That combined with the semantics in the test can lead to catastrophic cancellation. We switch this test from gelu to relu to fix this instability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164997
Approved by: https://github.com/eellison, https://github.com/jansel
2025-10-09 03:14:05 +00:00
eellison
86474ce996 Update mask dtype (#164472)
Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472
Approved by: https://github.com/bdhirsh
2025-10-03 00:19:36 +00:00
Jason Ansel
6fa972796e [inductor] Fix bugs in emulate_precision_casts (#163520)
Fixes #163449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163520
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481
2025-09-24 02:52:36 +00:00
Jason Ansel
9c4d9f940b [inductor] Support out_dtype arg to matmul (#163393)
Fixes #163275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393
Approved by: https://github.com/eellison, https://github.com/coconutruben
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434
2025-09-23 15:37:38 +00:00
Jason Ansel
518c320676 [inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
Fixes #163082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #163386, #163398, #163387, #163414, #163415
2025-09-23 15:37:21 +00:00
Colin Peppler
3ef1bef36c [sdpa] make sure to recompile if alignment is different than before (#163083)
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
    buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.

### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)

## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
2025-09-23 01:33:33 +00:00
Jason Ansel
36c2a1325c [inductor] Fix bug where viewed outputs get padded (#163398)
Fixes #163328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163398
Approved by: https://github.com/eellison
ghstack dependencies: #163386
2025-09-22 21:52:45 +00:00
Boyuan Feng
77d8e98e1b [Inductor] update exp codegen for better precision (#161829)
Prior to this PR, we have:
```
[Default Behavior] uses `tl.math.exp({x})`:
eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592

TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`:
eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367

Update inductor to use `tl.extra.libdevice.exp(tmp0)`:
eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593
```

Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now.

Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829
Approved by: https://github.com/jansel
2025-08-30 04:56:51 +00:00
Boyuan Feng
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
PyTorch MergeBot
09381f5dac Revert "[Graph Partition] Pass all OSS unit tests (#154667)"
This reverts commit ca7315c171.

Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](ca7315c171) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))
2025-08-11 20:34:27 +00:00
Boyuan Feng
ca7315c171 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-11 16:25:12 +00:00
Eddie Yan
1128f4c2a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-08-08 22:22:48 +00:00
gaoyvfeng
50f23ff6f8 rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883)
Fixes #159399
"Modified torch.testing._internal.inductor_utils and test/inductor"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883
Approved by: https://github.com/janeyx99
2025-08-08 15:44:52 +00:00
PyTorch MergeBot
bfe5674e22 Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 0797b2b6a8.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info  ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))
2025-07-17 16:55:55 +00:00
Sam Larsen
4b11428cb5 [BE][testing] Skip test_repeated_masked_load internally (#158355)
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
2025-07-16 16:15:44 +00:00
Sam Larsen
a04a13c449 [BE][testing] Skip test_triton_interpret internally (#158260)
Summary: Subprocesses in fbcode are tricky because of .par files. I'm thinking it's not an important enough test to get it running and skipping is fine.

Test Plan: `buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158260
Approved by: https://github.com/eellison
2025-07-16 16:14:44 +00:00
Eddie Yan
0797b2b6a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 16:07:54 +00:00
Xuehai Pan
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
Jason Ansel
b40981c630 Fix incorrect stride handling in adaptive_avg_pool3d (#157326)
Fixes #157248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157326
Approved by: https://github.com/eqy
ghstack dependencies: #157242
2025-07-01 03:03:48 +00:00
Xuehai Pan
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
Oguz Ulgen
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
Austin Wahle
517d2995e0 Add__int__ and __float__ methods to _sympy.functions.Identity (#155873)
Fixes #155688

Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`.  This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.

Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
2025-06-15 04:24:40 +00:00
eellison
f6b83d4cc6 sort iteration over index vars (#154846)
Fix for https://github.com/pytorch/pytorch/issues/154741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154846
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-06-02 22:06:00 +00:00
eellison
ef1d45b12d Cleanup parent fallback logic (#154006)
The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006
Approved by: https://github.com/drisspg
2025-05-29 13:40:36 +00:00
eellison
d6e29bf875 Reflect back mutation if we clone misaligned tensors (#154442)
Fix for https://github.com/pytorch/pytorch/issues/152425

inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor.

If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation.

We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark:
```
import torch

t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16)

@torch.compile(dynamic=False)
def foo(x):
    return x.add_(1)

import triton

print(triton.testing.do_bench(lambda: foo(t[:-1])))
torch._dynamo.reset()
print(triton.testing.do_bench(lambda: foo(t[1:])))
```
gives
```
0.04063070610165596
0.07613472988113162
```
So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case.

In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-05-29 13:36:48 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
henrylhtsang
02cecd1018 [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-21 20:14:34 +00:00
PyTorch MergeBot
e434a9152e Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)"
This reverts commit 6246c7d62c.

Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))
2025-04-18 18:40:17 +00:00
eellison
9ccdeae7db Fix uint view copy (#151598)
Fix for https://github.com/pytorch/pytorch/issues/151156. We have some logic to undo our upcast prior to dtype bitcast. This pr cleans up that logic using dtypes in codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151598
Approved by: https://github.com/zou3519
ghstack dependencies: #151562
2025-04-18 18:13:39 +00:00
henrylhtsang
6246c7d62c [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-18 17:26:16 +00:00
eellison
6d46b530fc Remove libdevice ops in inductor (#151562)
Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562
Approved by: https://github.com/isuruf, https://github.com/jansel
2025-04-17 22:18:00 +00:00
Michael Lazos
fe961679d5 [Inductor] add support for disabling atomic adds (#151033)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-04-11 18:41:56 +00:00
eellison
27ded359a5 Fix inplacing with multiple, fused uses (#150845)
We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845
Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel
2025-04-09 00:05:07 +00:00
Jack Taylor
49b7d0d84d [ROCm] Enable more inductor UTs (#149513)
Primarily enable inductor fp8 tests, also enable other inductor tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513
Approved by: https://github.com/jeffdaily
2025-04-01 00:30:36 +00:00
eellison
585fd972b8 Iterate over dense dim first in split reduction reindexing (#147229)
Fix for https://github.com/pytorch/pytorch/issues/144431.

Improves perf from 0.29963893827160504 -> 0.0396331632970453.

In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing.

This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229
Approved by: https://github.com/jansel
2025-03-18 17:35:21 +00:00
Jason Ansel
b040dc3a53 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential [disconnected] Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-12 15:52:16 +00:00
PyTorch MergeBot
5ada4e6a53 Revert "Reland: [inductor] Simplify grid handling (#148305)"
This reverts commit 8d08b49015.

Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))
2025-03-12 14:58:43 +00:00
Jason Ansel
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00