Commit Graph

593 Commits

Author SHA1 Message Date
James Wu
74d0300804 Change unsafe_marked_cacheable_functions to a dictionary, so that you can specify a static cache key (#152486)
Fixes https://github.com/pytorch/pytorch/issues/152434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152486
Approved by: https://github.com/oulgen
2025-05-19 02:16:33 +00:00
Bin Bao
a2d0ef242d [AOTI] Embed cubin files into .so (#150739)
Summary: Embed cubin files so AOTI is one step closer to generate a single binary. Controlled by a flag and off as default.

Differential Revision: [D72535357](https://our.internmc.facebook.com/intern/diff/D72535357)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150739
Approved by: https://github.com/angelayi
2025-05-19 01:11:46 +00:00
Michael Lazos
f604732e2e [Cutlass] E2E Tests for EVT (#152815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152815
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-05-17 12:29:10 +00:00
Benjamin Glass
cda572b053 codecache: Remove cpp_prefix.h duplication per build, then precompile it (#144293)
Prior to this PR, `_inductor/codegen/cpp_prefix.h` was copied into a new temporary directory on every inductor run utilizing the CPP backend (i.e. CPU-only), then included in the output source code. Instead, this PR puts it in an appropriate place in the torch includes, and includes it from there. This allows us to precompile it in cpp_wrapper and AOT inductor mode, saving significant compilation time.

Due to difficulties getting this to work in FBCode, the precompilation itself is only enabled in OSS PyTorch.

Differential Revision: [D69420620](https://our.internmc.facebook.com/intern/diff/D69420620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144293
Approved by: https://github.com/desertfire
2025-05-16 17:41:36 +00:00
James Wu
5ff2cb8587 Add justknobs for static cuda launcher (#153400)
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.

It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.

Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.

Differential Revision: D74599203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
2025-05-13 20:10:13 +00:00
Henry Tsang
36722c287f [cutlass backend] make compile name independent of command (#153388)
Differential Revision: D74291603

The goal is to reuse the kernels as much as possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153388
Approved by: https://github.com/ColinPeppler
2025-05-13 03:49:24 +00:00
Aaron Gokaslan
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
rzou
89aa6eb19b Stop codegen-ing post_grad_custom_pass in repros (#153243)
When codegen'ed, it looks like:
```py
post_grad_custom_pass = <object at 0x12345678>
```
Which is not runnable at all. Some logic is also trying to deepcopy the
object, and not all of these objects are deepcopy-able.

This PR skips codegenning of these passes.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243
Approved by: https://github.com/houseroad
2025-05-12 15:21:11 +00:00
Michael Lazos
a3154ca34a [Cutlass] Changes to gemm template for EVT (#150907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150907
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #153196
2025-05-09 05:39:05 +00:00
Oguz Ulgen
50fe1b2349 Implement async manifold cache write (#152452)
Summary: This diff implements an AsyncManifoldCache class that performs cache write and update ttl operations in an async manner. Essentially we are ok with the fire and forget approach where we dont guarantee that we can observe our writes, this gives us better runtime latency.

Test Plan: added new unit test

Reviewed By: jamesjwu

Differential Revision: D73867797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152452
Approved by: https://github.com/jamesjwu
2025-05-05 16:45:48 +00:00
rzou
2b37a726e0 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-05-03 00:02:24 +00:00
Huamin Li
6f6acb4128 [AOTI][CPU] Introduce config.cpp.use_decompose_tanh (#152542)
Summary: Previously D70489427 changed tanh impl to `.tanh()`, and this is causing some meta internal workload perf regression. This diff will introduce a config so we can set it based on need.

Differential Revision: D73909371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152542
Approved by: https://github.com/desertfire
2025-05-01 10:25:31 +00:00
Will Constable
a4a771648a [pt2d] Add reorder_comms_preserving_peak_memory pass (#146562)
This is a new pass to replace the pre-existing passes.  It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.

The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
  relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
  not increaes curr memory beyond peak memory

The pass logs a summary table for each graph to TORCH_LOGS=overlap.

e.g. (exact format may have been tweaked but this shows the idea).

```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node                                                                                                                                                initial exposed    final exposed    improvement  limiting factor        moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] -----------------------------------------------------------------------------------------------------------------------------------------------------------  -----------------  ---------------  -------------  -------------------  -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns)               12141.6          6514.53       5627.08   prefetch limit            75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns)                 32265.8         28429.2        3836.61   data dependency           78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns)                         10800.6         10732.3          68.254  peak memory                1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns)                          10809.5         10809.5           0      data dependency            4
[rank
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
2025-04-29 22:51:31 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
James Wu
0dae27d75b Turn on static cuda launcher in OSS (#151691)
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
<img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" />

With no changes to runtime perf:
<img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" />

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
<img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/EikanWang
2025-04-25 17:48:53 +00:00
henrylhtsang
4ea2e093ca [inductor][BE] Clean up use_mixed_mm and mixed_mm_choice usage inside pytorch (#152071)
Differential Revision: [D73551912](https://our.internmc.facebook.com/intern/diff/D73551912/)

Decided to leave the mixed_mm tests alive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152071
Approved by: https://github.com/eellison
2025-04-25 17:25:55 +00:00
PyTorch MergeBot
562328501e Revert "Turn on static cuda launcher in OSS (#151691)"
This reverts commit e31e2d27c6.

Reverted https://github.com/pytorch/pytorch/pull/151691 on behalf of https://github.com/malfet due to This breaks tests, see c1f51cf2c4/1 ([comment](https://github.com/pytorch/pytorch/pull/151691#issuecomment-2825427252))
2025-04-23 20:28:31 +00:00
James Wu
e31e2d27c6 Turn on static cuda launcher in OSS (#151691)
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
<img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" />

With no changes to runtime perf:
<img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" />

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
<img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691
Approved by: https://github.com/oulgen
2025-04-23 15:43:24 +00:00
henrylhtsang
1f0d764b65 stage 2 of depreate silent fallback of tuning gemm (#148622)
context: https://github.com/pytorch/pytorch/issues/147479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622
Approved by: https://github.com/eellison
ghstack dependencies: #151506
2025-04-21 20:14:34 +00:00
Oguz Ulgen
0f8613bf5c Introduce unsafe way to mark functions as cacheable (#151603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603
Approved by: https://github.com/jamesjwu
ghstack dependencies: #151768, #151609
2025-04-21 17:37:38 +00:00
Camyll Harajli
6b45b6e6c9 run lintrunner for Export d68846308 (#151725)
fixes broken lint tests in https://github.com/pytorch/pytorch/pull/151481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151725
Approved by: https://github.com/exclamaforte, https://github.com/Skylion007

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-20 14:58:17 +00:00
PyTorch MergeBot
6261db7719 Revert "inductor.config.descriptive_names = False is not actually supported (#145523) (#146051) (#151481)"
This reverts commit cfc4d74b0c.

Reverted https://github.com/pytorch/pytorch/pull/151481 on behalf of https://github.com/malfet due to It indeed breaks lint, it followup PR contains it's own issues ([comment](https://github.com/pytorch/pytorch/pull/151481#issuecomment-2816490764))
2025-04-19 03:12:56 +00:00
Gabriel Ferns
cfc4d74b0c inductor.config.descriptive_names = False is not actually supported (#145523) (#146051) (#151481)
Summary:

This config is not supported (it throws an error when set), and doesn't really make sense imo.

Approved by: https://github.com/eellison

Test Plan: contbuild & OSS CI, see edf266e9bb

Reviewed By: masnesral

Differential Revision: D68846308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151481
Approved by: https://github.com/masnesral
2025-04-19 01:13:35 +00:00
PyTorch MergeBot
b7807759de Revert "stage 2 of depreate silent fallback of tuning gemm (#148622)"
This reverts commit 181b3883e7.

Reverted https://github.com/pytorch/pytorch/pull/148622 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/148622#issuecomment-2815995105))
2025-04-18 18:37:09 +00:00
Oguz Ulgen
b73606dcc5 Add jk for force_disable_caches (#151621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151621
Approved by: https://github.com/jamesjwu
2025-04-18 18:19:40 +00:00
henrylhtsang
181b3883e7 stage 2 of depreate silent fallback of tuning gemm (#148622)
context: https://github.com/pytorch/pytorch/issues/147479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622
Approved by: https://github.com/eellison
ghstack dependencies: #151506
2025-04-18 17:26:16 +00:00
henrylhtsang
dd11613f94 [cutlass backend][experimental] Try out presets for cutlass instead of searching all configs (#151255)
Differential Revision: [D72668861](https://our.internmc.facebook.com/intern/diff/D72668861/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151255
Approved by: https://github.com/mlazos
2025-04-16 01:48:06 +00:00
Oguz Ulgen
3cf0e2d8ec Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-15 23:38:15 +00:00
Shunting Zhang
91923f0ee1 [inductor] disable alignment asserts in fbcode (#151274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151274
Approved by: https://github.com/Mingming-Ding, https://github.com/Microve, https://github.com/eellison
2025-04-15 19:59:54 +00:00
PyTorch MergeBot
74f6bc28a7 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit c9aef50898.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))
2025-04-15 17:35:59 +00:00
Oguz Ulgen
c9aef50898 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 22:00:09 +00:00
PyTorch MergeBot
24b3ab9255 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit bbc5fe8504.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))
2025-04-14 15:22:33 +00:00
Oguz Ulgen
bbc5fe8504 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 07:07:10 +00:00
Michael Lazos
fe961679d5 [Inductor] add support for disabling atomic adds (#151033)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-04-11 18:41:56 +00:00
Shunting Zhang
252029b294 [Inductor] assert fallback output alignment (#150804)
Previous PR (https://github.com/pytorch/pytorch/pull/150777) fixes the alignment problem for fallback kernel assuming meta kernel is correct. This PR handles the case that meta kernel is incorrect. Assertion is added if the compiler assumes a  fallback kernel output is aligned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150804
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #150777
2025-04-10 20:01:06 +00:00
PyTorch MergeBot
01568cb17a Revert "Refactor layout constraint selection logic (#148104)"
This reverts commit 2e7c9d33e7.

Reverted https://github.com/pytorch/pytorch/pull/148104 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
PyTorch MergeBot
a0e796df03 Revert "Inductor respects exact strides on custom ops by default (#150511)"
This reverts commit a4bb2f106f.

Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
rzou
a4bb2f106f Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #150495, #148104
2025-04-09 16:46:48 +00:00
rzou
2e7c9d33e7 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #150495
2025-04-09 02:09:18 +00:00
Shunting Zhang
901b02cf16 [Inductor] fix alignement assumption for fallback (#150777)
Inductor right now only works properly for fallback kernels producing aligned output.
When Inductor create layout for fallback kernel output, Inductor does not add the tensor offset to the layout [link](2a1e2b88ed/torch/_inductor/ir.py (L6935-L6941)). Thus unaligned output will be treated as aligned. Adding the offset to the layout directly does not work since that change the index expression in the generated kernel and we may 'double' applying the offset. Triton already considers the offset when passing in the data_ptr.

To solve this issue, we track the unaligned buffer names instead.

This potentially can fix the internal issues we are debugging here: https://fb.workplace.com/groups/1075192433118967/permalink/1618308128807392/

Differential Revision: [D72600784](https://our.internmc.facebook.com/intern/diff/D72600784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150777
Approved by: https://github.com/eellison, https://github.com/jansel
2025-04-08 18:49:44 +00:00
PyTorch MergeBot
7c858066ae Revert "Enable TMA persistent GEMM Template by default (#149427)"
This reverts commit b8ef642f04.

Reverted https://github.com/pytorch/pytorch/pull/149427 on behalf of https://github.com/clee2000 due to failing tests internally D72116141 ([comment](https://github.com/pytorch/pytorch/pull/149427#issuecomment-2766672200))
2025-03-31 15:58:34 +00:00
PaulZhang12
b8ef642f04 Enable TMA persistent GEMM Template by default (#149427)
Previously, this was unable to be landed given there was limited H100 for CI testing. Benchmarking on H100 CI looks good now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149427
Approved by: https://github.com/drisspg
2025-03-29 07:32:42 +00:00
Sam Larsen
266bd22b44 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-28 01:06:39 +00:00
PyTorch MergeBot
185aaaaf8e Revert "Improve subproc autotuning implementation (#149700)"
This reverts commit 8cd6a133f2.

Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))
2025-03-26 23:17:01 +00:00
Mu-Chu Lee
a0253d2840 [Inductor] Use real input to autotune user defined triton kernels (#149553)
Summary:
User defined Triton kernel sometimes rely on real inputs to determine
the path of execution. We need real inputs to invoke the correct
behavior of the user defined triton kernels (see example in test case,
where we have an early return for random inputs)

Test Plan:
Included in the commit.
python test/inductor/test_aot_inductor.py -k triton_autotuning
python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-03-26 16:42:48 +00:00
Sam Larsen
8cd6a133f2 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-25 20:07:28 +00:00
Ding, Yi1
f7d1b966c2 [Inductor] Unify the data type propagation between Triton and CPP Backend (#146970)
Fixes #144246

Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel
2025-03-21 17:52:51 +00:00
James Wu
7bb9c36784 Hook StaticCudaLauncher up to torch.compile (cold start) (#148890)
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.

Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
- The kernel is a cuda kernel and inductor can find a cubin file associated with it
- The kernel takes less than 50 arguments
- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
- The kernel is not user defined (to be supported in a later PR)

We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.

Some key features of StaticTritonCompileResult:
- It is fully serializable
- It stores the minimum amount of stuff, so that later it can be cached easily
- It does not depend on any triton specific types (though it does have various triton metadata).

For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.

Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.

Fixes #149448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890
Approved by: https://github.com/jansel
2025-03-20 17:32:20 +00:00
Benjamin Glass
e8dd58b8cf cpp_wrapper: Precompile device-specific header files (#146928)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests.  Total OpInfo test runtime is down about 2x from this change alone.

Relands #144002, with changes needed by fbcode internals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928
Approved by: https://github.com/desertfire
2025-03-17 20:40:15 +00:00
Shangdi Yu
df60500ab8 Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714)
Summary:
1. Check against the "0" char instead

2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`.

Test Plan:
```
 buck run  'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest
```

Differential Revision: D70670957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714
Approved by: https://github.com/desertfire
2025-03-13 10:22:06 +00:00