Commit Graph

674 Commits

Author SHA1 Message Date
PyTorch MergeBot
9a7c2f1f64 Revert "Add torch compile force disable caches alias (#158072)"
This reverts commit 2ecf083b72.

Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))
2025-07-18 04:58:24 +00:00
Jack Taylor
7ebbf2cae7 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563) (#158550)
This reverts commit 8554c8007d #157563 due to causing a few breakages on ROCm

Reverted expected_results.csv to 26807dcf27

> @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: f4d8bc46c7/2
and the failures are still showing up on tip-of-tree on HUD

Context
https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857

Needs to be relanded in non bc-breaking way, or sanity checked for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-17 19:47:41 +00:00
Oguz Ulgen
2ecf083b72 Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-07-17 15:40:36 +00:00
Shangdi Yu
82a1ee1135 Refactor Provenance Tracking (#158399)
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`

- change the guard flag from `trace.enabled` to `trace.provenance_tracking`.  It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.

In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.

See more motivation in internal Diff

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D78287976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
2025-07-17 00:23:00 +00:00
Xuan Zhang
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
PyTorch MergeBot
26807dcf27 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit c062550a35.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main c062550a35, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test).  Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))
2025-07-15 16:35:55 +00:00
Xiangyang (Mark) Guo
156a377f4c [AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949)
Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time.

Differential Revision: D77915987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949
Approved by: https://github.com/desertfire
2025-07-15 10:51:43 +00:00
Xuan Zhang
c062550a35 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-14 22:27:21 +00:00
PyTorch MergeBot
6ea91f0672 Revert "[Inductor] Set the default value of min_chunk_size to 512 (#150762)"
This reverts commit 3321acc92e.

Reverted https://github.com/pytorch/pytorch/pull/150762 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but an inductor compilation error shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150762#issuecomment-3070286787))
2025-07-14 16:58:13 +00:00
Sun, Jiayi
3321acc92e [Inductor] Set the default value of min_chunk_size to 512 (#150762)
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-14 01:14:30 +00:00
bobrenjc93
5221448574 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-12 15:08:21 +00:00
Xuehai Pan
7f14b42adf [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 05:47:06 +00:00
PyTorch MergeBot
e90148c91d Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit 4b9a6f7211.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))
2025-07-12 04:52:11 +00:00
PyTorch MergeBot
e15f4248ad Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (#156312)"
This reverts commit 7a92b51196.

Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250))
2025-07-12 04:40:52 +00:00
PyTorch MergeBot
9c189ed29a Revert "multi-kernel matmuls based on varying hint sizes (#156628)"
This reverts commit 6c79530637.

Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))
2025-07-12 03:48:39 +00:00
Xuehai Pan
7a92b51196 [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 01:47:22 +00:00
Xuan Zhang
4b9a6f7211 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-11 21:07:57 +00:00
bobrenjc93
6c79530637 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-11 19:38:10 +00:00
Xu Han
c4cdcda754 [aot] add format_consts_to_cpp function for further development. (#157608)
Changes:
1. Split `format_consts_to_asm` function, which is current way to convert consts to object.
2. Add `format_consts_to_cpp` function, which would support for more compiler support, such as `msvc` and `icx`.
3. Add `config.aot_inductor.use_consts_asm_build` for `format_consts_to_asm` and `format_consts_to_cpp` control.
4. Add UT for `format_consts_to_cpp`.

For `format_consts_to_cpp`, I have local tested it:
Case: https://docs.pytorch.org/docs/main/torch.compiler_aot_inductor.html
Run it and `cat` cpp code:
<img width="674" alt="image" src="https://github.com/user-attachments/assets/d47ccf84-06d2-47f5-8a0d-9a43a9020aa3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157608
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-11 17:02:41 +00:00
Mwiza Kunda
ed508cc018 [inductor][triton] Add experimental use_tensor_descriptor config option (#157906)
Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met.

Some implementation details:
- The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store
- The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met.

Testing:
- Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option.

This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906
Approved by: https://github.com/jansel
2025-07-11 09:32:40 +00:00
Sam Larsen
5bd7804be2 Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface (#157990)
Summary: Essentially, treat joint_custom_pre_pass/joint_custom_post_pass the same as post_grad_custom_post_pass/post_grad_custom_pre_pass.

Test Plan: More unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157990
Approved by: https://github.com/oulgen
2025-07-10 19:17:11 +00:00
Shangdi Yu
4781d72faa [AOTI] codegen for static linkage (#157129)
Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal)

- Add codegen for static linkage
- refactor test code for test_compile_after_package tests

For now,  the following options must be used together with `"aot_inductor.compile_standalone": True`.
"aot_inductor.package_cpp_only": True,

Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR.

```
python test/inductor/test_aot_inductor_package.py -k test_compile_after_package
python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129
Approved by: https://github.com/desertfire
2025-07-10 16:03:50 +00:00
IvanKobzarev
8dff457f42 [simple_fsdp] Port fx pass to bucket reduce_scatters (#157780)
Porting fx passes for reduce_scatters bucketing (similar to all_gather bucketing) for simple_fsdp and autoparallel testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157780
Approved by: https://github.com/wconstab
2025-07-10 14:04:43 +00:00
Paul Zhang
4cfc0a3208 [Inductor] Introduce Lookup Table for Overriding Triton Kernel autotune configs post fusion (#157924)
Summary:
Introduce lookup table for kernels post fusion, hashing on inductor generated source code

Rollback Plan:

Differential Revision: D77866885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157924
Approved by: https://github.com/jansel
2025-07-10 03:23:50 +00:00
Xiangyang (Mark) Guo
b354328ecd [AOTI] add flag AOT_INDUCTOR_ENABLE_LTO (#157773)
Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time.

Rollback Plan:

Differential Revision: D77899195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773
Approved by: https://github.com/desertfire
2025-07-09 16:54:19 +00:00
Shangdi Yu
effe376db0 Adding aoti_standalone config (#157731)
Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if  `package_cpp_only` is explicitly set to False in config.

Test Plan:
```
buck2 run  mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r  TestAOTInductorConfig
```

Rollback Plan:

Differential Revision: D77889754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731
Approved by: https://github.com/desertfire
2025-07-09 04:30:04 +00:00
Sam Larsen
7a41f20794 [inductor] Quiesce Triton compile worker pool after each dynamo compile (#156187)
For internal usages, keeping the Triton compile worker pool active for the lifetime of the process has caused some challenges, e.g., it slows down and muddies profiling due to the huge number of threads on a box: N threads = 8 ranks * 32 subprocs * M threads started by torch. Also, each subproc can use more than 1GB each. This PR adds the functionality to shutdown worker subprocs after each dynamo compile when using the SubprocPool implementation. The idea is to leave the main sidecar process running, but signal it to tear down its internal ProcessPoolExecutor when compile is finished. Restarting the ProcessPoolExecutor is relatively fast, e.g., 500ms because the ProcessPoolExecutor forks from the sidecar. Changes:
* Do not start the ProcessPoolExecutor automatically when compile_fx is imported. Instead, start the sidecar process only. The sidecar process imports torch, so is still slow to start.
* Introduce wakeup() and quiesce() calls to the implementation to start and stop the ProcessPoolExecutor.
* Add a context manager to automatically quiesce() at the end of dynamo compilation.
* Signal a wakeup() in compile_fx only when we have cuda devices.
* Add a killswitch so we can turn of quiescing.

Testing:
For correctness, the stacked change at https://github.com/pytorch/pytorch/pull/156534 enables the feature for OSS so it's exercised in CI.

For performance, because of recent compile-time variance (see https://github.com/pytorch/pytorch/issues/152566), it's pretty hard to glean whether there's a regression....

* Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801
* Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801

The wins (mostly for inference) don't make sense, but I'm also skeptical of the losses (mostly for training). I can't repro any of the slowdowns locally. Furthermore, check out the benchmarking results for the stacked diff, which actually enables the quiescing functionality for OSS. That should only slow down compile since there can only be overhead to stop and start the workers. But the results are somehow better:

* Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801
* Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156187
Approved by: https://github.com/aorenste, https://github.com/jansel
2025-07-08 22:53:13 +00:00
IvanKobzarev
7b392bac13 all_gather_bucketing fx pass (#157396)
Porting passes to bucket all_gathers

The main logic of the pass is done via
1. Searching for all all_gathers from the buckets

Copying tests from @wconstab PR to test compatibility with reordering.
Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering.

Pass changes:
Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops.

Inductor changes:
Adding foreach_copy_ lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396
Approved by: https://github.com/wconstab
2025-07-03 22:07:42 +00:00
Nicolas Macchioni
3bdd5ae334 [PT2] deprecate force_same_precision, guarded by JK (#156789)
Summary:
cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks

we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected

changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed

Test Plan:
CI

Rollback Plan:

Differential Revision: D77265930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789
Approved by: https://github.com/jhadidjojo, https://github.com/masnesral
2025-06-27 00:43:06 +00:00
Nicolas Macchioni
13efb2c858 [BE] Deprecate search_autotune_cache (#155302)
We haven't had the offline cache populated in > 1 year, this *should* be safe; if this passes, we can finally go through and rip out the offline cache logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155302
Approved by: https://github.com/masnesral
2025-06-26 17:30:08 +00:00
James Wu
e581f015ee Bump STATIC_CUDA_LAUNCHER_VERSION to 2 (#156726)
Differential Revision: [D77241813](https://our.internmc.facebook.com/intern/diff/D77241813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156726
Approved by: https://github.com/oulgen
2025-06-26 01:50:51 +00:00
Boyuan Feng
1044934878 [CUDAGraph] add config cudagraph_capture_sizes (#156551)
Users may want CUDAGraph for certain sizes and fallback for other sizes.

As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.

Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.

This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.

Example:
```python
import torch

torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]

def f(x):
    return x + 1

f = torch.compile(f, mode="reduce-overhead", dynamic=False)

def run(batch_size, seq_len, d):
    x = torch.randn((batch_size, seq_len, d), device="cuda")
    # Need to mark the dimension as dynamic. Automated-dynamic
    # may have some ux issues on matching `cudagraph_capture_sizes`
    # with the actual dynamic shapes, since there are specialization and
    # multiple dynamo graphs.
    torch._dynamo.mark_dynamic(x, 0)
    torch._dynamo.mark_dynamic(x, 1)
    for _ in range(3):
        f(x)

for i in range(2, 10):
    for j in range(2, 10):
        run(i, j, 8)

num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
2025-06-24 05:14:49 +00:00
Xuehai Pan
6ff6630375 [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
leslie-fang-intel
c55eef79f8 [Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395)
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
2025-06-23 02:00:42 +00:00
Jack Taylor
03023f178c FlexAttn config refactor + ROCm optimisations (#156307)
This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends.

This PR also introduces:
1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases.
3. Updates configs for ROCm flex autotune path providing perf optimisations

AMD perf numbers on score mod benchmark (default inputs)
flex_attn | mode | Speedup (Avg) | Speedup (Max)
-- | -- | -- | --
fwd | autotune before PR | 2.608 | 20.56
fwd | autotune after PR | 2.862 | 22
fwd | exhaustive_autotune | 2.943 | 22.471
bwd | autotune before PR | 2.196 | 9.831
bwd | autotune after PR | 2.423 | 11.331
bwd | exhaustive_autotune | 2.566 | 13.87

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307
Approved by: https://github.com/drisspg, https://github.com/jansel
2025-06-22 22:27:38 +00:00
PyTorch MergeBot
f1331f3f1b Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)"
This reverts commit 3627270bdf.

Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
Xuehai Pan
3627270bdf [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
Menglu Yu
0ad88a2224 Support environement var for autotune log (#156254)
Summary: Titled

Test Plan:
See the scadcastle signal

Rollback Plan:

Differential Revision: D76860928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254
Approved by: https://github.com/Mingming-Ding
2025-06-20 23:06:33 +00:00
Shunting Zhang
39270430c9 [inductor] force min num-split (off by default) (#155941)
This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ).

The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance.

 The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order:
- [xnumel * num_split, rnumel / num_split]
- [xnumel, num_split]

A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2.

Here are some benchmarking results.
```
import torch
from triton.testing import do_bench
import functools
from torch._inductor import config
from torch._dynamo.decorators import mark_dynamic
import os

@torch.compile(dynamic=True)
def f(x):
    return x.sum(dim=0)

N = 512
C = functools.partial(torch.randn, device="cuda")
x_small = C(4096, N)
x_large = C(4096 * 1000, N)

if os.getenv("HINT_WITH_SMALL_INPUT") == "1":
    x = x_small
else:
    x = x_large

mark_dynamic(x, 0)
f(x)

ms = do_bench(lambda: f(x_large))

# 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15
# 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272
# 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba
print(ms)
```
This test mimic what we see in the original problem.

- If we compile with large inputs and benchmark for large inputs, latency is 4.03ms
- if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms
- with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve

The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test.
- Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms
- Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62  Perf is neutral (0.011 ms v.s. 0.011ms)

A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325

# Update for not applying the change to static shape:
from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example:
```
import torch
from torch._inductor import config

aten = torch.ops.aten

def f(x):
    return aten.bernoulli(x).sum()

x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda")
torch.compile(f)(x)
```

With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue.

Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6)

Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941
Approved by: https://github.com/jansel, https://github.com/bobrenjc93
2025-06-20 18:01:28 +00:00
CaoE
159a39ad34 Add an option for cpp_wrapper to compile entry and kernel separately (#156050)
Fixes #156037.
Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050
Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel
2025-06-20 01:11:16 +00:00
Shangdi Yu
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
Ruben Rodriguez Buchillon
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
Oguz Ulgen
8e02cd9c5a Skip cache related configs for cache config serialization (#156195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195
Approved by: https://github.com/masnesral
2025-06-17 21:24:07 +00:00
PyTorch MergeBot
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
Ruben Rodriguez Buchillon
9a2c669425 [inductor][cutlass] binary remote cache (#156106)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

Test Plan:
## prove that we can upload successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Rollback Plan:

Reviewed By: henrylhtsang

Differential Revision: D76454741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106
Approved by: https://github.com/henrylhtsang

Co-authored-by: atalman <atalman@fb.com>
2025-06-17 16:24:10 +00:00
penknife6153
3e38feb05f [inductor] Add configuration control for CUTLASS operation selection. (#155770)
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.

**Fixes #155718**

## Usage Examples

```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"

# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"

# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"

# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
2025-06-14 08:19:54 +00:00
Sean McGovern
297805fd8f Typo fixes for "overridden" in comments and function names (#155944)
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
2025-06-14 03:37:38 +00:00
Bin Bao
f151b20123 [AOTI] Remove the emit_current_arch_binary option (#155768)
Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768
Approved by: https://github.com/angelayi
2025-06-13 02:06:07 +00:00
Brian Hirsh
a2b0b2698d inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
2025-06-11 01:33:24 +00:00
Michael Lazos
5dfe1787b5 [Inductor] Limit fusions to a node distance of 64 (#154688)
fix for https://github.com/pytorch/pytorch/issues/154652 and https://fb.workplace.com/groups/1075192433118967/permalink/1484799079148049/

[window 128 dashboard run here w/ no regressions](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sun%2C%2001%20Jun%202025%2006%3A38%3A41%20GMT&stopTime=Sun%2C%2008%20Jun%202025%2006%3A38%3A41%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=mlazos/fuse-window&lCommit=8576f00ebfa53567d7bddc89d9882df9eb990561&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154688
Approved by: https://github.com/eellison, https://github.com/Raymo111
2025-06-10 07:32:23 +00:00