Commit Graph

710 Commits

Author SHA1 Message Date
Sam Larsen
af39144a93 Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480)
Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead.

Fixes https://github.com/pytorch/pytorch/issues/159440

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480
Approved by: https://github.com/xmfan, https://github.com/oulgen
2025-07-30 21:29:38 +00:00
PyTorch MergeBot
d987a6f7f0 Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)"
This reverts commit abcb24f4de.

Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))
2025-07-29 19:49:40 +00:00
Lucas Kabela
abcb24f4de [Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`

Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta    | +990 | +9 | +44.43% | +155 | 0 | +42.82% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
2025-07-24 15:55:18 +00:00
Xu Han
5e386eec94 [AOTI] enable aot inductor on Windows (#158915)
With many PRs landed, we can run the first aot inductor example on Windows.

<img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" />

Let's remove the Windows check on `AotCodeCompiler`.

CC: @angelayi , @desertfire , @jansel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158915
Approved by: https://github.com/desertfire
2025-07-23 16:29:15 +00:00
Mwiza Kunda
d3d9bc1c31 [inductor] Allow backends to register their own custom config object (#158254)
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
2025-07-23 15:56:06 +00:00
PyTorch MergeBot
7d6f340238 Revert "[AOTI] Add more default options to compile_standalone (#158560)"
This reverts commit a991e285ae.

Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))
2025-07-22 16:20:17 +00:00
Bin Bao
a991e285ae [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-07-21 21:16:48 +00:00
Xu Han
16b21fa8b2 [AOTI] skip ld and objcopy on Windows. (#158545)
Skip `ld` and `objcopy` on Windows. They are not support on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158545
Approved by: https://github.com/desertfire
2025-07-17 15:43:24 +00:00
Xu Han
da4c7b4ced [AOTI] align signature to model_base.h (#158554)
Remove `const` keyword, align its signature to `model_base.h` eeda1a75ac/torch/csrc/inductor/aoti_runtime/model_base.h (L51-L53)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158554
Approved by: https://github.com/desertfire
2025-07-17 14:44:32 +00:00
Xu Han
a04bd11895 [AOTI] Use format_consts_to_cpp on Windows. (#158543)
`format_consts_to_asm` is not supported on Windows, force use `format_consts_to_cpp` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158543
Approved by: https://github.com/desertfire
2025-07-17 14:40:34 +00:00
Han, Xu
4805a6ead6 [aot][XPU] switch xpu to use consts cpp build. (#158425)
Intel compiler is not support `format_consts_to_asm`, let's use `format_consts_to_cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158425
Approved by: https://github.com/jansel
2025-07-16 16:19:33 +00:00
henrylhtsang
7e433d5f42 [cutlass backend] cache a few things for codegen and properties (#158158)
Differential Revision: [D78193404](https://our.internmc.facebook.com/intern/diff/D78193404/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158158
Approved by: https://github.com/ColinPeppler
2025-07-15 00:18:31 +00:00
bobrenjc93
5221448574 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-12 15:08:21 +00:00
Xu Han
aacb944079 [aot inductor] fix clang-asan for consts_cpp. (#158175)
From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes.

But it still raise clang ASAN `stack alloction`, when build large size consts.

This PR:
1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list.
2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175
Approved by: https://github.com/jansel
2025-07-12 07:14:05 +00:00
Xuehai Pan
7f14b42adf [BE][2/16] fix typos in torch/ (torch/_*/) (#156312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 05:47:06 +00:00
PyTorch MergeBot
9c189ed29a Revert "multi-kernel matmuls based on varying hint sizes (#156628)"
This reverts commit 6c79530637.

Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))
2025-07-12 03:48:39 +00:00
bobrenjc93
6c79530637 multi-kernel matmuls based on varying hint sizes (#156628)
The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts:

https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/
https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/
https://fb.workplace.com/groups/257735836456307/posts/906589324904285/

Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size:

![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301)

This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case:

![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213)

This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes:

![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1)

Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096:

![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce)

## How to review this PR

At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points:

1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments.
2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels.
3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape.
4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR.

## Results

The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec

Before
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0948   |   0.3124   |   4.9477
    256      |   0.2243   |   0.2256   |   3.3880
    4096     |   0.3384   |   0.3404   |   3.3010
```

After
```
Hint\Runtime |     64     |    256     |    4096
---------------------------------------------------
     64      |   0.0951   |   0.2289   |   3.3013
    256      |   0.0952   |   0.2258   |   3.4045
    4096     |   0.0957   |   0.2231   |   3.3146
```

We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938

![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed)

NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result.

For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0

HUD benchmark runs:
base: https://github.com/pytorch/pytorch/actions/runs/15889871988
head: https://github.com/pytorch/pytorch/actions/runs/15889876842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628
Approved by: https://github.com/jansel
2025-07-11 19:38:10 +00:00
Xu Han
c4cdcda754 [aot] add format_consts_to_cpp function for further development. (#157608)
Changes:
1. Split `format_consts_to_asm` function, which is current way to convert consts to object.
2. Add `format_consts_to_cpp` function, which would support for more compiler support, such as `msvc` and `icx`.
3. Add `config.aot_inductor.use_consts_asm_build` for `format_consts_to_asm` and `format_consts_to_cpp` control.
4. Add UT for `format_consts_to_cpp`.

For `format_consts_to_cpp`, I have local tested it:
Case: https://docs.pytorch.org/docs/main/torch.compiler_aot_inductor.html
Run it and `cat` cpp code:
<img width="674" alt="image" src="https://github.com/user-attachments/assets/d47ccf84-06d2-47f5-8a0d-9a43a9020aa3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157608
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-11 17:02:41 +00:00
Sam Larsen
5bd7804be2 Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface (#157990)
Summary: Essentially, treat joint_custom_pre_pass/joint_custom_post_pass the same as post_grad_custom_post_pass/post_grad_custom_pre_pass.

Test Plan: More unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157990
Approved by: https://github.com/oulgen
2025-07-10 19:17:11 +00:00
Shangdi Yu
4781d72faa [AOTI] codegen for static linkage (#157129)
Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal)

- Add codegen for static linkage
- refactor test code for test_compile_after_package tests

For now,  the following options must be used together with `"aot_inductor.compile_standalone": True`.
"aot_inductor.package_cpp_only": True,

Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR.

```
python test/inductor/test_aot_inductor_package.py -k test_compile_after_package
python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129
Approved by: https://github.com/desertfire
2025-07-10 16:03:50 +00:00
Henry Tsang
54a4d34d10 [fbcode] switch to cutlass-4 (#157579)
Summary: Update cutlass version to 4. For most use cases.

Test Plan:
testing in progress

Rollback Plan:

Differential Revision: D77605011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157579
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 14:12:33 +00:00
Nicolas Macchioni
94716db222 [BE][DCE] eliminate remnants of global gemm cache (#157327)
Summary: The global gemm cache has not been maintained in ~1 year, and the only entry point (`search_autotune_cache`) was recently deprecated. Meaning, this is now dead code that we can remove.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77520979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157327
Approved by: https://github.com/jansel
2025-07-02 19:52:35 +00:00
Jason Ansel
f8cc4c0af8 [inductor] Update triton_key import to support latest Triton (#157242)
With Triton main things were failing with:
```py
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 205, in get_system
    from triton.compiler.compiler import triton_key
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/home/jansel/pytorch/triton/compiler/compiler.py)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157242
Approved by: https://github.com/aorenste
2025-06-30 17:51:43 +00:00
Han, Xu
39b71d11fc [Inductor] add pedantic to limit inductor code follow standard. (#156914)
### Background:

During my development work, I found Windows msvc don't support to compile zero size array, please reference: https://github.com/pytorch/pytorch/issues/153180

As discussed with MSFT engineer, we found zero size array don't align to c++ standard, though gcc/clang can support it. When we add `-pedantic` option to gcc, it should check and raise c++ standard strictly. Reference: https://github.com/pytorch/pytorch/issues/153180#issuecomment-2986676878

So this PR add `-pedantic` to torch inductor build option list to constraint codegen generate c++ standard well code.
Additional, It also fixed a halide zero size array code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156914
Approved by: https://github.com/jansel
2025-06-30 16:29:08 +00:00
Bin Bao
771be85704 [AOTI] Print out error msg when nvcc compiler fails (#157203)
Summary: To debug https://github.com/pytorch/pytorch/issues/156930. Not able to reproduce the problem locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157203
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-06-30 01:30:55 +00:00
James Wu
e7a66166ce [precompile] When using BundledAOTAutogradCache, disable FXGraphCache (#156611)
The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs.

If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit *between* the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers.

One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical:

When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on.

BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now.

Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611
Approved by: https://github.com/zhxchen17
2025-06-25 21:01:42 +00:00
henrylhtsang
4bc3e4b497 [cutlass backend] Move cutlass key to cutlass_library (#156654)
Differential Revision: [D77188311](https://our.internmc.facebook.com/intern/diff/D77188311/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156654
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh
ghstack dependencies: #156651
2025-06-25 17:55:57 +00:00
henrylhtsang
e071837594 [cutlass backend] compile and link for .so files (#155876)
Differential Revision: [D76482736](https://our.internmc.facebook.com/intern/diff/D76482736/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155876
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-06-25 17:01:56 +00:00
Xuehai Pan
6ff6630375 [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
PyTorch MergeBot
f1331f3f1b Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)"
This reverts commit 3627270bdf.

Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
Xuehai Pan
3627270bdf [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
Shangdi Yu
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
Ruben Rodriguez Buchillon
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
PyTorch MergeBot
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
Ruben Rodriguez Buchillon
9a2c669425 [inductor][cutlass] binary remote cache (#156106)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

Test Plan:
## prove that we can upload successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Rollback Plan:

Reviewed By: henrylhtsang

Differential Revision: D76454741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106
Approved by: https://github.com/henrylhtsang

Co-authored-by: atalman <atalman@fb.com>
2025-06-17 16:24:10 +00:00
Oguz Ulgen
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
henrylhtsang
45382b284d [cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875)
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.

Also I want to change how cutlass .so are compiled. Hence this change.

This change is broken down since @coconutruben is trying to make a change to the same files too.

Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
2025-06-17 02:06:54 +00:00
Henry Tsang
a35b3a9b95 [cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939)
Differential Revision: D76571828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-06-15 00:01:57 +00:00
Marcin Pioch
ce79056471 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-14 17:29:54 +00:00
Bin Bao
f151b20123 [AOTI] Remove the emit_current_arch_binary option (#155768)
Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768
Approved by: https://github.com/angelayi
2025-06-13 02:06:07 +00:00
Henry Tsang
6b3eef6d31 [cutlass backend] Only consider to use re worker if nvcc doesn't exist (#155745)
Differential Revision: D76463340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155745
Approved by: https://github.com/masnesral
2025-06-12 15:23:52 +00:00
Oguz Ulgen
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
Brian Hirsh
a2b0b2698d inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
2025-06-11 01:33:24 +00:00
PyTorch MergeBot
79bdafe5b6 Revert "Custom FX pass for inductor's backend registration (#154841)"
This reverts commit e694280d12.

Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))
2025-06-09 16:56:45 +00:00
Marcin Pioch
e694280d12 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-06 06:49:44 +00:00
Shangdi Yu
26f066bb61 Add AOTI model name config (#154129)
Summary: If a model name is specified in aoti config, the generated files will use that model name as file stem.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_using_model_name_for_files
```

Bifferential Revision: D75102034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154129
Approved by: https://github.com/desertfire
2025-06-05 23:38:11 +00:00
James Wu
fa63de0866 Handle empty linemaps in PyCodeCache (#155064)
Some functions have empty linemaps, and if you call `PyCodeCache.stack_frames_for_code` on code in the wrong order, you'll end up triggering a too many values to unpack issue: https://github.com/pytorch/pytorch/issues/154536

Specifically, if you populate PyCodeCache's linemap via caching, and then request the stack frames of a inductor generated output file that has an empty linemap, this function will try to unpack too many arguments.

Test plan:
```
import os

os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "1"
os.environ["TORCHINDUCTOR_AUTOGRAD_CACHE"] = "1"

import torch

@torch.compile
def fn(x: torch.Tensor):
    (x_grad,) = torch.autograd.grad(x.sum(), x)
    return x_grad

x = torch.randn(10, 10, requires_grad=True)
result = fn(x)
```

Run this twice and see that everything works as expected.

It's hard to exactly pinpoint a good unit test for this: it requires a whole lot of moving parts to get the issue to trigger because:

- The callsite in question in dynamo, without caching, will always run before generating the code, so cls.linemaps[path] will be None most of the time
- The inductor generated output needs to call *back* into dynamo via `assert_size_stride`
- In our test case, the CompiledBackward needs to not have linemaps, and also be called in the middle of a graph break while compiling a different cached function. Caching switches the order the PyCodeCache.linemap is populated (i.e. either before or after the graph break is evaluated), which causes the issue.

All these things need to interact together to create the bug, so it's a bit difficult to write a simple unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155064
Approved by: https://github.com/bdhirsh
2025-06-05 03:54:35 +00:00
xinan.lin
cbdacd32fe [AOTI][Intel GPU] Support multi_arch_kernel_binary option for XPU. (#154514)
Following the design of #154413, this PR add XPU support for generating kernel binary files that support multiple archs.

Fixes #154682, Fixes #154683, Fixes 154689, Fixes #154685 , Fixes #154690, Fixes #154681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154514
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-06-03 23:02:00 +00:00
PyTorch MergeBot
1193bf0855 Revert "convert inductor codecache to use getArtifactLogger (#153766)"
This reverts commit 5b6fd277f9.

Reverted https://github.com/pytorch/pytorch/pull/153766 on behalf of https://github.com/malfet due to I want to revert this change as I'm 90+% certain it somehow broke testing ([comment](https://github.com/pytorch/pytorch/pull/153766#issuecomment-2923620806))
2025-05-30 22:20:07 +00:00
PyTorch MergeBot
31f95b5d2e Revert "inductor codecache: include private inductor configs in cache key (#153672)"
This reverts commit 2c1cb38d95.

Reverted https://github.com/pytorch/pytorch/pull/153672 on behalf of https://github.com/malfet due to Looks like it regressed pr_time_benchmarks, see ba3f91af97/1 ([comment](https://github.com/pytorch/pytorch/pull/153672#issuecomment-2922759739))
2025-05-30 15:54:14 +00:00