Commit Graph

64 Commits

Author SHA1 Message Date
Yidi Wu
3413490f53 [scan] materialize combine_fn in forward add more autograd tests (#161732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025
2025-09-27 18:13:15 +00:00
Yidi Wu
8f15d6a0c9 [test][scan] refactor inductor test and prepare for adding bw tests (#161557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161557
Approved by: https://github.com/zou3519
2025-09-27 18:13:14 +00:00
Yidi Wu
ec2e3687c7 [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
2025-09-07 21:55:29 +00:00
PyTorch MergeBot
7a83cf430e Revert " [while_loop][autograd] support autograd_key of while_loop (#160483)"
This reverts commit 2b8a83901c.

Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))
2025-09-07 08:50:49 +00:00
Yidi Wu
2b8a83901c [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160467
2025-09-06 21:26:33 +00:00
Yidi Wu
48e3be3ab6 [while_loop][autograd] add hop while_loop_stack_output (#160467)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160467
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-09-06 21:26:33 +00:00
Zhang, Liangang
3e459491b5 Enable XPU path for FlexAttention (#143553)
[#RFC153024](https://github.com/pytorch/pytorch/issues/153024)

**Motivation**

1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA  in the transformers repo on XPU device. Besides,  it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and  different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.

**What does this PR do?**

 1. Enable the device type for Flexattention kernel  and UTs to ensure all important UTs pass on XPU device.
 2. For E2E model inference, ensure the functionality  of LLM models inference with FlexAttention to be ready.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553
Approved by: https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Mao Yunfei <yunfei.mao@intel.com>
Co-authored-by: Xingyuan Li <xingyuan.li@intel.com>
Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: Xiao, Wang <wang.xiao@intel.com>
2025-08-29 23:10:58 +00:00
Yidi Wu
bf6aaba0f7 [while_loop] avoid aliasing when body_fn never executes (#160670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160670
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160669
2025-08-29 19:36:37 +00:00
Yidi Wu
6598f00c18 [dynamo] auto lift unbacked symbol in tensor's storage_offset (#161199)
```python
import torch

torch._dynamo.config.capture_scalar_outputs = True

class M(torch.nn.Module):
    def forward(self, idx, x):
        u0 = idx.item()
        x0 = x.select(0, u0)
        def fn():
            return x0.sin()
        return torch.cond(x0.sum() > 0, fn, fn)

m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64, device="cuda"), torch.randn(3, 3, device="cuda"))
print(out)

```

Before the PR, we didn't track the storage_offset symbol of a tensor. After https://github.com/pytorch/pytorch/pull/157605, we create an unbacked_symint for stroage_offset for the result of select. So when we try to lift the free basic symbols of x0  during speculating fn, we found a free symbol that's not bound to a proxy.

This PR tracks the symbols of storage_offset and associated it with a proxy using torch.ops.aten.storage_offest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161199
Approved by: https://github.com/zou3519
ghstack dependencies: #161198
2025-08-26 17:06:54 +00:00
Yidi Wu
ba6ce66698 [dynamo] lift backed symint output of item() (#161198)
Before the change in this PR, we have an error for the following code
```python
import torch

torch._dynamo.config.capture_scalar_outputs = True

class M(torch.nn.Module):
    def forward(self, idx, x):
        u0 = idx.item()
        x0 = x.select(0, u0)
        def fn():
            return x0.sin()
        return torch.cond(x0.sum() > 0, fn, fn)

m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64), torch.randn(3, 3))
```

The error is caused when speculate fn, and tries to lift symbol of x0.storage_offset() but found the symbols doesn't have a source associated with it.

What really happens is that, when input tensor is a scalar tensor of int type and resides on CPU, we have a short cut that creates a norm symint when .item() is called see https://github.com/pytorch/pytorch/pull/126245.

However, previously, we only track the unbacked symint output of an operation because we believe all the backed symint must have a source associated with it and has already bee lifted as input at the top-level. Now this invariant no longer holds, so we end up an error saying the symbol doesn't have source (because only input and symbols derided from inputs have source and result of .item() doesn't have a source).

In this PR, we start to also track the normal symint with the proxy that created it (i.e. in this case the proxy .item()).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161198
Approved by: https://github.com/zou3519
2025-08-26 17:06:54 +00:00
Yidi Wu
ff86509a06 [map] filter none gradients and add autograd inductor tests (#160548)
Will filter the none outputs in autograd backward for other hops as follow ups

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160548
Approved by: https://github.com/zou3519
2025-08-15 20:13:12 +00:00
Boyuan Feng
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
PyTorch MergeBot
09381f5dac Revert "[Graph Partition] Pass all OSS unit tests (#154667)"
This reverts commit ca7315c171.

Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](ca7315c171) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))
2025-08-11 20:34:27 +00:00
Boyuan Feng
ca7315c171 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-11 16:25:12 +00:00
Xuehai Pan
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
Yidi Wu
836bb1941b [hop] support torch.func.functional_call in hop subgraph (#155886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155886
Approved by: https://github.com/zou3519
2025-06-28 23:47:46 +00:00
Xuehai Pan
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
xinan.lin
83259cf7a7 [Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287)
This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode.
The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion.

The performance improvement:

| Suite       | Inductor Speedup (Baseline) | Inductor Speedup (Compared) | Acc Failed | Perf Failed | Inductor Perf Ratio | Speedup  |
|-------------|-----------------------------|------------------------------|------------|--------------|----------------------|----------|
| Huggingface | 2.134838                    | 2.125740314                  | 0          | 0            | 1.001462504          | 100.43%  |
| Torchbench  | 1.808558                    | 1.675100479                  | 0          | 0            | 1.075722187          | 107.97%  |
| Timm        | 2.343893                    | 2.070476653                  | 0          | 0            | 1.131023832          | 113.21%  |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287
Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel
2025-06-19 13:17:22 +00:00
Yidi Wu
c8c892b4a5 [scan] disable functionalization key in backward tracing (#154343)
Previously, we didn't disable functionalization key when materializing backward graph. This causes the torch.zeros_like call for the case where grad is None to return a functional tensor that's not tracked by the proxy tensor mode.

This PR fixes it by putting the tracing code under disable functionalization ctx manager.

Fixes https://github.com/pytorch/pytorch/issues/153437.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154343
Approved by: https://github.com/zou3519
2025-06-05 20:06:33 +00:00
Yidi Wu
d1fe198df6 [cond] support output the same unbacked symbol from two branches (#148206)
Previously, we didn't track the unbacked symbols leaked out of true_branch and false_branch if they have the same shape expr. This cause the the fake output of cond operator itself doesn't set up its unbacked_bindings meta properly (because they're ignored).

In this PR, we also check whether there're leaked out unbacked symbols and create new unbacked symbols for it and track it as output of cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148206
Approved by: https://github.com/zou3519
2025-05-22 03:39:43 +00:00
Yidi Wu
d356ca2466 [map] add inductor support by lowering to while_loop (#150971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150971
Approved by: https://github.com/zou3519
ghstack dependencies: #151034
2025-05-21 22:19:47 +00:00
Thomas Bohnstingl
68034198e5 [HOP] Mutation and alias rework (#146658)
This PR reworks the way the input mutations and various aliases are checked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146658
Approved by: https://github.com/ydwu4
2025-05-18 08:05:22 +00:00
Yidi Wu
4a4a71a73c [inductor]lowering scan to while_loop (#148580)
This PR add a pass in post_grad that lowers scan to while_loop. See the comment before the pass for how this is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148580
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-20 20:21:02 +00:00
Yidi Wu
923ce10f6c [while_loop] require stride to be the same as input for body_fn (#148002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002
Approved by: https://github.com/zou3519
2025-03-12 17:15:10 +00:00
Yidi Wu
982d7ba3ef [while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019)
Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-28 18:27:03 +00:00
Yidi Wu
2d2f60bdda [cond] support mismatched output in inductor (#147567)
In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it,  HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567
Approved by: https://github.com/jansel
2025-02-28 18:26:48 +00:00
Yidi Wu
824474cb35 [cond] support output sizes mismatch in front end (#147130)
This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130
Approved by: https://github.com/zou3519
2025-02-25 20:28:41 +00:00
Yidi Wu
8f073065d5 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
2025-02-10 21:25:40 +00:00
Shunting Zhang
bc0191802f [inductor] add size-asserts for fallback ops (#145904)
Fix https://github.com/pytorch/pytorch/issues/144717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904
Approved by: https://github.com/jansel
2025-02-07 18:44:32 +00:00
PyTorch MergeBot
076717785c Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222)"
This reverts commit 5ecdc428b2.

Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))
2025-02-07 16:19:41 +00:00
Yidi Wu
5ecdc428b2 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
ghstack dependencies: #146194, #146195
2025-02-06 19:39:55 +00:00
Yidi Wu
b0fe975521 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-02-04 18:47:34 +00:00
PyTorch MergeBot
c0979d72b5 Revert "[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)"
This reverts commit 68a3635484.

Reverted https://github.com/pytorch/pytorch/pull/143456 on behalf of https://github.com/atalman due to New tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/143456#issuecomment-2631475900))
2025-02-03 16:25:58 +00:00
Yidi Wu
68a3635484 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-01-31 18:29:27 +00:00
Yidi Wu
a3698ebd5c [while_loop] specialize when cond_fn return constants (#144515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144515
Approved by: https://github.com/zou3519
2025-01-30 19:02:34 +00:00
Yidi Wu
5660709856 [hop][BE] unify meta checking with check_meta_consistency (#143545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545
Approved by: https://github.com/zou3519
ghstack dependencies: #143105
2025-01-03 19:01:07 +00:00
Yidi Wu
da29c13693 [while_loop] data-dependent op in body_fn (#142031)
The idea is the parent hop's fake tensor mode should ignore the newly allocated unbacked symints in subgraph because the bindings of unbacked symbols in the subgraph should already be done when we trace the subgraph. E.g. if there's an operator in subgraph that produces unbacked symints, the track_tensor_tree logic for that operator will take care of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142031
Approved by: https://github.com/zou3519
ghstack dependencies: #142162
2024-12-10 21:54:28 +00:00
Yidi Wu
9cfc9e636d [while_loop] change to guard_equals for checking output and carry (#141734)
The input with the same can be represented with different symbols e.g.
```python
def body_fn(a, b):
  return b.sin(), a.sin()
```
, where a = torch.randn(3, 4), b= torch.randn(3, 4). There could be 4 symbols allocated for a and b. So instead of checking their shapes and strides' symbol must be the same, we just use guard_equals to enforce the constraint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141734
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-12-03 04:00:21 +00:00
Yidi Wu
4b3ce62946 [while_loop] support pytree inputs (#140059)
Previously, we only support carries to be tuple of tensors. This pr enables us to support pytree of tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140059
Approved by: https://github.com/zou3519
2024-11-20 21:12:29 +00:00
Thomas Bohnstingl
d1c26b0781 Improvements for associative_scan - slicing of xs (#138858)
In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of https://github.com/pytorch/pytorch/pull/136966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138858
Approved by: https://github.com/ydwu4
2024-11-05 23:38:21 +00:00
Yidi Wu
3087b5e431 [cond] support lifted symint inputs in subgraph (#137519)
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137519
Approved by: https://github.com/eellison
2024-10-17 16:09:06 +00:00
Animesh Jain
cd02c85ba4 [inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200)
Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code.

This is the output code for test `test_cond_reintepret_view_inputs_outputs`

Before this PR - https://www.internalfb.com/intern/paste/P1632948905/
With this PR - https://www.internalfb.com/intern/paste/P1632946348/

A relevant snippet from the above paste is

~~~

def false_graph_0(args):
    false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args
    args.clear()
    s0 = s0
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
        false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
        # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add]
        triton_poi_fused_add_sub_1_xnumel = (-20) + (20*s0)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0)
        del false_graph_0_arg0_1
        del false_graph_0_arg1_1
    return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), )

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1 = args
    args.clear()
    s0 = arg0_1
    assert_size_stride(arg1_1, (s0, 20), (20, 1))
    assert_size_stride(arg2_1, (s0, 20), (20, 1))
    assert_size_stride(arg3_1, (), ())
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = [None] * 2
        buf0 = [None] * 2
        if arg3_1.item():
            # subgraph: true_graph_0
            true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
            true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
            (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0])
            buf0[0] = true_graph_0_buf0
            buf0[1] = true_graph_0_buf1
        else:
            # subgraph: false_graph_0
            false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
            false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
            (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0])
            buf0[0] = false_graph_0_buf0
            buf0[1] = false_graph_0_buf1
        del arg1_1
        del arg2_1
        del arg3_1
        buf1 = buf0[0]
        buf2 = buf0[1]
        del buf0
    return (buf1, buf2, )

~~~

The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper.

Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov).

This work will unblock hierarchical compilation (or cold start compile time work).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200
Approved by: https://github.com/desertfire, https://github.com/eellison
2024-10-11 17:57:10 +00:00
Thomas Bohnstingl
994438040c Improvements for associative_scan - combine_mode (#133012)
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.

@ydwu4 @Chillee @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012
Approved by: https://github.com/ydwu4
2024-08-30 16:09:53 +00:00
Stonepia
15addb00e6 Update test_control_flow.py to device-agnostic. (#133843)
Fixes #133841

This PR makes the `test_pointwise_associative_scan_CUDA_flip` also work on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133843
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/malfet, https://github.com/jansel, https://github.com/atalman
2024-08-20 05:05:43 +00:00
Thomas Bohnstingl
d04cd7f3ba Improvements for associative_scan - Reverse feature (#133011)
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `reverse` flag to the `associative_scan` to establish a similar interface as for `jax.associative_scan`. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.

@ydwu4 @Chillee @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133011
Approved by: https://github.com/ydwu4
2024-08-16 23:06:31 +00:00
Xuehai Pan
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
PyTorch MergeBot
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fcc.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
Xuehai Pan
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
Feng Yuan
cf090e222e Update torch-xpu-ops pin (ATen XPU implementation) (#130333)
1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`.
2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function.
3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-07-10 18:10:53 +00:00
Sijia Chen
31bb65de19 [Inductor] Fix conditional codegen (#129492)
Summary:
We have the cache to guarantee the `sym` is codegen only once, see the following code
```
def ensure_size_computed(self, sym: sympy.Symbol):
    if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE):
        if sym in self.computed_sizes:
            return
        self.computed_sizes.add(sym)
        expr = V.graph.sizevars.inv_precomputed_replacements[sym]
        self.writeline(
            f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}"
        )
```
However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of  `undefined symbols`: P1441378833

To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen

Test Plan:
TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile

PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size

Differential Revision: D58973730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492
Approved by: https://github.com/aakhundov
2024-07-08 05:33:47 +00:00