Commit Graph

887 Commits

Author SHA1 Message Date
Simon Fan
fae6f6c9ca [aot] fix deepcopying of aot bwd containing real tensors (#153999)
Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module.

But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs.

FIXES https://github.com/pytorch/pytorch/issues/153993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-05-21 23:30:02 +00:00
Tomasz Bohutyn
bb7e30c165 [MegaCache] Make MegaCache generic to allow external plugins registration (#152977)
Implements #152976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152977
Approved by: https://github.com/oulgen
2025-05-21 18:18:47 +00:00
James Wu
c31e239910 [precompile] Add BundledAOTAutogradCacheEntry (#152840)
Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.

This has some advantages:
- No more dependency on FxGraphCache at all
- Clearing FxGraphCache does not result in AOTAutogradCache miss
- Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output

We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this *could* be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.

On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.

Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840
Approved by: https://github.com/zhxchen17
2025-05-21 18:08:42 +00:00
PyTorch MergeBot
3eb8fa081a Revert "[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)"
This reverts commit 32b1baa981.

Reverted https://github.com/pytorch/pytorch/pull/153802 on behalf of https://github.com/malfet due to It breaks ROCM testing, see d23762974e/1 ([comment](https://github.com/pytorch/pytorch/pull/153802#issuecomment-2898695702))
2025-05-21 17:20:31 +00:00
Jane Xu
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
Menglu Yu
32b1baa981 [3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)
Summary:
1. Customers now can test with float8_e4m3fn.
2. To play safe, we set the scaling version as the default.

Test Plan:
### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Buck UI: https://www.internalfb.com/buck2/f679f362-8bf4-454c-87df-a85cbc2ab2a8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549861047443
Network: Up: 16KiB  Down: 3.9MiB  (reSessionID-98badbfd-76f7-487f-ab1c-1ec4f850614d)
Analyzing targets. Remaining     0/281
Executing actions. Remaining     0/5957                                                                                                   7.3s exec time total
Command: test.     Finished 3 local, 1 remote
Time elapsed: 1:29.7s
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D74910193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153802
Approved by: https://github.com/nareshrajkumar866, https://github.com/Hahu803, https://github.com/Mingming-Ding
2025-05-21 00:21:54 +00:00
PyTorch MergeBot
d81217be2e Revert "Improve torch.ops typing (#153558)"
This reverts commit c5cba39d46.

Reverted https://github.com/pytorch/pytorch/pull/153558 on behalf of https://github.com/yangw-dev due to Your diff will not be landed to fbcode since we suspect it caused the following breakage in an internal test:[D75007157](https://www.internalfb.com/diff/D75007157) for instance: tests_gpu/lookup_gpu_index_test.py:232:8 Undefined attribute [16]: torch._ops._OpNamespace has no attribute simple_index_mm_batch ([comment](https://github.com/pytorch/pytorch/pull/153558#issuecomment-2892506789))
2025-05-19 23:32:36 +00:00
Benjamin Glass
c5cba39d46 Improve torch.ops typing (#153558)
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558
Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever
2025-05-19 14:52:32 +00:00
Animesh Jain
7fdd754136 [compile-time traces] Profile large missing gaps in compile time (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel
2025-05-13 14:44:51 +00:00
Menglu Yu
88a068f33b [2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770)
Summary:
Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version.

Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it.

Test Plan:
### how to enable

```
    torch._inductor.config.post_grad_fusion_options = {
        "activation_quantization_aten_pass": {
            "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type
            "use_scaling": False,  -> default is False, if you want to use scaling verison, set it to True
            "size_in_mb": 0.0,  -> default is 100, you can tune the value.
             "exclude_primals": False, -> whether want to exclude quantize parameters, default is False
              "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16
        },
    }
```

### toy model

```
buck2 run mode/opt //scripts/qyz/autoac:quantization
```

```
Epoch [80/200], Loss: 19227.2109
Epoch [100/200], Loss: 1353.5272
Epoch [120/200], Loss: 38630.6758
Epoch [140/200], Loss: 6239.9155
Epoch [160/200], Loss: 6039.1567
Epoch [180/200], Loss: 3994.3569
Epoch [200/200], Loss: 146.3966
```

Differential Revision: D73015996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770
Approved by: https://github.com/Mingming-Ding
2025-05-12 19:43:18 +00:00
James Wu
e21ff9c3be Add logging for guard miss failure (#153125)
Differential Revision: [D74371381](https://our.internmc.facebook.com/intern/diff/D74371381/)

This PR adds some logging for guard misses to tlparse, so that we know when AOTAutogradCache and FxGraphCache miss due to guards.

Example tlparse result:
https://gist.github.com/jamesjwu/afa19335c0aee85b24546b13c1cf6427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153125
Approved by: https://github.com/oulgen, https://github.com/jingsh
2025-05-09 16:51:04 +00:00
Menglu Yu
2d25e4d478 [1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380)
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB  Down: 42MiB  (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining     0/4                                                                                 6.7s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable (you can overrite the dtype, if nothing given, the default is fp8)

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },
```

Differential Revision: D70522237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380
Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803
2025-05-08 04:44:15 +00:00
PyTorch MergeBot
a28dcdba2c Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860)"
This reverts commit 613bd46272.

Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
Aaron Orenstein
7a0781eaad Improve cache key graph printing performance (#151928)
Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster.

On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better).

Fixes #151823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928
Approved by: https://github.com/laithsakka
2025-05-06 17:39:53 +00:00
James Wu
12a8b70247 [precompile] Refactor AOTAutogradCacheEntry to be generic (#152836)
The purpose of this stack is to create a new BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that is self contained, i.e. it contains all of the CompiledFxGraph directly in the entry, instead of relying on FxGraphCache._lookup_graph.

Because this woudl balloon the size of the actual cache entry to do this, our goal is not to use BundledAOTAutogradCacheEntry in cache scenarios: only for precompile use cases. Thus, it's important we make this whole setup generic, to be able to support these two workflows clearly.

This PR genericizes AOTAutogradCacheEntry considerably, so that it can take in different types of Forwards and Backwards.

Each GenericAOTAutogradCacheEntry is composed of two parts, a TForward and a TBackward. The forward and backward can be loaded in multiple ways, either via FxGraphCache._lookup_graph, or by saving the entire CompiledFxGraph.

For simplicify, this PR only implements the generic code refactors needed, but does not fully implement BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that takes a full CompiledForward. We'll handle and implement BundledAOTAutogradCacheEntry in the PR above this, for easier review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152836
Approved by: https://github.com/oulgen
2025-05-06 15:19:17 +00:00
Francisco Massa
199d5a408a [partitioner] Fix argument to _broadcast_on_rank0 (#152846)
Summary:
There was a bug when I refactored my original implementation.

This should fix it

Test Plan: Run on some internal workloads

Differential Revision: D74190485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152846
Approved by: https://github.com/danthe3rd
2025-05-06 13:45:59 +00:00
zhxchen17
ffd58293f7 [dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616)
Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states.

## Test Cases:

0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed.
1. torch.compile() nested in vmap.
2. torch.compile() nested in grad.
3. torch.compile() nested in jvp + vmap
4. torch.compile() nested functionalize
5. torch.compile() nested in vmap + grad

Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616
Approved by: https://github.com/zou3519
ghstack dependencies: #152615
2025-05-05 18:05:56 +00:00
rzou
3d777bae10 Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #148104
2025-05-03 00:02:24 +00:00
rzou
2b37a726e0 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-05-03 00:02:24 +00:00
Ryan Guo
16153a0f27 [AOTAutogradCache][Easy] Move "einops.einops.rearrange" to SAFE_NON_TORCH_FUNCTIONS (#152640)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152640
Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/bdhirsh
2025-05-02 19:09:30 +00:00
Laith Sakka
376529c78b consolidate guard_or_x and definitely_x (#152463)
definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the
existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463
Approved by: https://github.com/bobrenjc93
2025-05-02 18:08:11 +00:00
Animesh Jain
9e3fc41060 [invoke_subgraph] rename identifiers to prevent python mangling (#152581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
ghstack dependencies: #152547
2025-05-02 06:46:05 +00:00
Animesh Jain
4649fd17b0 [invoke_subgraph] Unpacked operands (#152547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152547
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-05-02 05:44:46 +00:00
Simon Fan
613bd46272 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #149707
2025-05-01 21:59:43 +00:00
Simon Fan
c461ba6522 [aot] mark dynamic activations as maybe dynamic (#149707)
Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs.

Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707
Approved by: https://github.com/bdhirsh
2025-05-01 21:59:36 +00:00
Francisco Massa
e82dc0769c Respect checkpointed boundaries when using knapsack formulation in the partitioner (#141684)
When multiple checkpoint regions are back-to-back with no operations in-between, we enforce the operation at the boundary to be force-saved, see 7ea0da2d57/torch/_functorch/partitioners.py (L772-L807)

When using the `memory_budget` formulation on a graph which already has AC inside, we should respect the boundaries of the AC decision (which is set to `MUST_SAVE`), and thus ban those nodes from possible recomputation.

Adding tests would be nice, but not sure what's the best way to test this right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141684
Approved by: https://github.com/bdhirsh
2025-05-01 15:28:41 +00:00
Francisco Massa
b6f8209f54 Remove redundant line in partitioner (#152517)
Summary: This is a cleanup from https://github.com/pytorch/pytorch/pull/152264, which contained a line which was a vestige from a previous implementation.

Test Plan: Let CI run

Differential Revision: D73904636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152517
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-04-30 23:17:30 +00:00
Ryan Guo
e4994e2f73 [AOTAutogradCache] Allow torch.Tensor and a non-torch op from einops (#152369)
This addresses part of #150706.

Specifically, it reduces the warm start `torch.compile` overhead by
40~50% for GGUF models on
1. HuggingFace diffusers: [tlparse before, 224s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpqgbdva/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 126s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp950PFy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
2. ComfyUI: [tlparse before, 93s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7SeJb4/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 51s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRwGNqA/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)

The improvements should generalize to all other GGUF models on these
platforms, because the cache miss was induced by framework code, which
will be hit by every GGUF model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152369
Approved by: https://github.com/jamesjwu
2025-04-30 17:34:21 +00:00
Animesh Jain
d620fefb2c [invoke_subgraph] Use backward identifier for min-cut parititioning (#152207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152207
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-30 14:34:56 +00:00
Brian Hirsh
4a63cab624 [cudagraphs] Fix issue in collecting static_input_idxs (#152287)
related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison

Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
2025-04-30 03:24:05 +00:00
Francisco Massa
89c0c3ca80 Add private config to broadcast rank0 decision from the partitioner to all ranks (#152264)
Summary: This PR adds a private configuration to the partitioner that ensures that the decision taken is the same across all ranks. This is a temporary workaround, as when size_hints are also taken into account in compiler collectives this workaround will not be needed anymore.

Test Plan:
This has been tested on some internal models, but I haven't added any tests in PyTorch (yet?)
T

Differential Revision: D73666017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152264
Approved by: https://github.com/bdhirsh
2025-04-29 21:27:57 +00:00
PyTorch MergeBot
a6d19fcfac Revert "[cudagraphs] Fix issue in collecting static_input_idxs (#152287)"
This reverts commit 75a564608a.

Reverted https://github.com/pytorch/pytorch/pull/152287 on behalf of https://github.com/wdvr due to causing ao failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152287#issuecomment-2837686127))
2025-04-29 06:57:06 +00:00
Animesh Jain
75a564608a [cudagraphs] Fix issue in collecting static_input_idxs (#152287)
related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison
2025-04-28 23:07:52 +00:00
Yukio Siraichi
ee8166e94f Correctly handle duplicated arguments when merging input views. (#146275)
Fix: #135099

This PR changes how we map the original inputs into the new set of
inputs that take in the tensor input's base instead of their aliases.

**Problem:** in order to create this mapping, we had a dictionary that
mapped the hashed arguments into their respective indices. However, if
there's a group of equal arguments, we will have only one mapping for
such an argument. This breaks the assumption that there will be one
mapping for each argument.

**Solution:** map the hashed arguments into a list of indices. Then, we
will be able to correctly reconstruct the parameters for the new calling
convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275
Approved by: https://github.com/bdhirsh
2025-04-26 14:50:16 +00:00
Yuanhao Ji
d7eb3a492c [Typing] Enable torch.types.IntLikeType / FloatLikeType / BoolLikeType (#152157)
### Changes

Replace `Union[SymInt, int]` and `Union[int, SymInt]` with `IntLikeType`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152157
Approved by: https://github.com/Skylion007
2025-04-25 19:00:10 +00:00
Animesh Jain
d743a7bd85 [invoke_subgraph] Cache fake tensor if no unbacked symint in the output (#151957)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151957
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151409, #151633, #151477
2025-04-24 14:17:22 +00:00
rzou
596296fb0b [standalone_compile] Dynamic shape handling (#151788)
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:

1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")

1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).

Most of this PR is tests.

Test Plan:
- a lot of new tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
2025-04-22 20:17:24 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
69ee6a9280 [Sana][HybridCache] Fix bug in detect_attr_assignment (#151824)
Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour

Test Plan: CI

Differential Revision: D73388529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824
Approved by: https://github.com/bdhirsh
2025-04-22 19:39:50 +00:00
Sam Larsen
529f698ad4 [logging] Put "everything" WaitCounters in dynamo_timed (#151757)
Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic.

Test Plan:
Ran an internal model with `TORCHINDUCTOR_BENCHMARK_FUSION=1` (to get benchmarking at compile time in addition to runtime).

Overall compile time from various sources matches up:
* tlparse: https://fburl.com/9fgsstkr. Eyeballing, total time should be 32 ranks x 2175 = ~69.6k s
* ods: https://fburl.com/canvas/r4clhnb7. Right on.
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/ax71aqox. Right on.
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/shcjd9ql. Right on.

And the runtime overhead:
* ods: https://fburl.com/canvas/nvgjb282
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/f2dtv0qh

If we compare that to a run of the same model without the changes in this stack, results can mismatch by a lot:
* tlparse: https://fburl.com/cchxwd1s. Eyeballing, total time should be 32 ranks x 2300s = ~73.5k s
* ods: https://fburl.com/canvas/x1i3wvf4. It's kinda close
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/l7sgxdxd. Waaay too high.
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/jb4s9z1u. This is the only one that's actually correct.

The discrepancy is even worse if we focus on the runtime events:
* ods: https://fburl.com/canvas/a4o9f7ou
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/95izaes1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151757
Approved by: https://github.com/ppanchalia
ghstack dependencies: #151749
2025-04-22 03:29:13 +00:00
James Wu
a4fdae5c84 Lift guard checking logic to AOTAutogradCache (#151563)
This somewhat complicated PR does a few things:
- It separates out a lot of the guard checking logic into its own class, GuardedCache[T]
- It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic
- It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them.
- FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss.

# Why do this?
Lifting guards to AOTAutogradCache has a few benefits:
- First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. **I've added a unit test that failed before my diff, and now passes, as an example of this**
- Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries.
- Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself.

# Guard checking logic of AOTAutogradCache
When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to **evaluate** these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored.

After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic.

## Test Plan
Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563
Approved by: https://github.com/oulgen
2025-04-22 03:01:08 +00:00
PyTorch MergeBot
fd04c79878 Revert "[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256)"
This reverts commit 8e373592c8.

Reverted https://github.com/pytorch/pytorch/pull/151256 on behalf of https://github.com/Camyll due to breaking internal tests, cannot import ([comment](https://github.com/pytorch/pytorch/pull/151256#issuecomment-2819244186))
2025-04-21 18:49:23 +00:00
Oguz Ulgen
0f8613bf5c Introduce unsafe way to mark functions as cacheable (#151603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603
Approved by: https://github.com/jamesjwu
ghstack dependencies: #151768, #151609
2025-04-21 17:37:38 +00:00
Animesh Jain
8e373592c8 [aot autograd][logging] Profile large missing gaps in compile time tracing (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral
ghstack dependencies: #151330
2025-04-16 20:37:08 +00:00
Oguz Ulgen
3cf0e2d8ec Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-15 23:38:15 +00:00
PyTorch MergeBot
74f6bc28a7 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit c9aef50898.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))
2025-04-15 17:35:59 +00:00
Oguz Ulgen
c9aef50898 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 22:00:09 +00:00
Aaron Orenstein
1f5af12cd9 Using hasattr for _boxed_call is asking for trouble (#151130)
Summary:
There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example).

Change `hasattr()` to `getattr(..., False)` for these cases.

Test Plan: unit tests pass

Differential Revision: D72806693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130
Approved by: https://github.com/Skylion007
2025-04-14 18:36:30 +00:00
PyTorch MergeBot
24b3ab9255 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit bbc5fe8504.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))
2025-04-14 15:22:33 +00:00
Oguz Ulgen
bbc5fe8504 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 07:07:10 +00:00
James Wu
3dcb46c30e [easy] Add cache bypass traceback information to cache_info on autograd_cache_bypass (#151025)
This will help us better debug pickling errors, etc, in internal models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151025
Approved by: https://github.com/masnesral
2025-04-12 19:56:32 +00:00