pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Simon Fan	fae6f6c9ca	[aot] fix deepcopying of aot bwd containing real tensors (#153999 ) Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module. But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs. FIXES https://github.com/pytorch/pytorch/issues/153993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999 Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh	2025-05-21 23:30:02 +00:00
Tomasz Bohutyn	bb7e30c165	[MegaCache] Make MegaCache generic to allow external plugins registration (#152977 ) Implements #152976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152977 Approved by: https://github.com/oulgen	2025-05-21 18:18:47 +00:00
James Wu	c31e239910	[precompile] Add BundledAOTAutogradCacheEntry (#152840 ) Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry. This has some advantages: - No more dependency on FxGraphCache at all - Clearing FxGraphCache does not result in AOTAutogradCache miss - Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this could be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well. On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected. Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840 Approved by: https://github.com/zhxchen17	2025-05-21 18:08:42 +00:00
PyTorch MergeBot	3eb8fa081a	Revert "[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802 )" This reverts commit `32b1baa981`. Reverted https://github.com/pytorch/pytorch/pull/153802 on behalf of https://github.com/malfet due to It breaks ROCM testing, see `d23762974e/1` ([comment](https://github.com/pytorch/pytorch/pull/153802#issuecomment-2898695702))	2025-05-21 17:20:31 +00:00
Jane Xu	8817e5ac80	Render Example: and not Example:: in docs (#153978 ) Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected. The correct notation is: ``` Example:: >>> ... ``` It is common and wrong to have: ``` Example:: >>> ... ``` In the wrong example, we get these pesky double colons: ![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978 Approved by: https://github.com/soulitzer, https://github.com/malfet	2025-05-21 01:03:26 +00:00
Menglu Yu	32b1baa981	[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802 ) Summary: 1. Customers now can test with float8_e4m3fn. 2. To play safe, we set the scaling version as the default. Test Plan: ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization ``` Buck UI: https://www.internalfb.com/buck2/f679f362-8bf4-454c-87df-a85cbc2ab2a8 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549861047443 Network: Up: 16KiB Down: 3.9MiB (reSessionID-98badbfd-76f7-487f-ab1c-1ec4f850614d) Analyzing targets. Remaining 0/281 Executing actions. Remaining 0/5957 7.3s exec time total Command: test. Finished 3 local, 1 remote Time elapsed: 1:29.7s Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D74910193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153802 Approved by: https://github.com/nareshrajkumar866, https://github.com/Hahu803, https://github.com/Mingming-Ding	2025-05-21 00:21:54 +00:00
PyTorch MergeBot	d81217be2e	Revert "Improve torch.ops typing (#153558 )" This reverts commit `c5cba39d46`. Reverted https://github.com/pytorch/pytorch/pull/153558 on behalf of https://github.com/yangw-dev due to Your diff will not be landed to fbcode since we suspect it caused the following breakage in an internal test:[D75007157](https://www.internalfb.com/diff/D75007157) for instance: tests_gpu/lookup_gpu_index_test.py:232:8 Undefined attribute [16]: torch._ops._OpNamespace has no attribute simple_index_mm_batch ([comment](https://github.com/pytorch/pytorch/pull/153558#issuecomment-2892506789))	2025-05-19 23:32:36 +00:00
Benjamin Glass	c5cba39d46	Improve torch.ops typing (#153558 ) Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases. Decisions made along the way: 1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class. 2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables. The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558 Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever	2025-05-19 14:52:32 +00:00
Animesh Jain	7fdd754136	[compile-time traces] Profile large missing gaps in compile time (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel	2025-05-13 14:44:51 +00:00
Menglu Yu	88a068f33b	[2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770 ) Summary: Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version. Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it. Test Plan: ### how to enable ``` torch._inductor.config.post_grad_fusion_options = { "activation_quantization_aten_pass": { "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type "use_scaling": False, -> default is False, if you want to use scaling verison, set it to True "size_in_mb": 0.0, -> default is 100, you can tune the value. "exclude_primals": False, -> whether want to exclude quantize parameters, default is False "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16 }, } ``` ### toy model ``` buck2 run mode/opt //scripts/qyz/autoac:quantization ``` ``` Epoch [80/200], Loss: 19227.2109 Epoch [100/200], Loss: 1353.5272 Epoch [120/200], Loss: 38630.6758 Epoch [140/200], Loss: 6239.9155 Epoch [160/200], Loss: 6039.1567 Epoch [180/200], Loss: 3994.3569 Epoch [200/200], Loss: 146.3966 ``` Differential Revision: D73015996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770 Approved by: https://github.com/Mingming-Ding	2025-05-12 19:43:18 +00:00
James Wu	e21ff9c3be	Add logging for guard miss failure (#153125 ) Differential Revision: [D74371381](https://our.internmc.facebook.com/intern/diff/D74371381/) This PR adds some logging for guard misses to tlparse, so that we know when AOTAutogradCache and FxGraphCache miss due to guards. Example tlparse result: https://gist.github.com/jamesjwu/afa19335c0aee85b24546b13c1cf6427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153125 Approved by: https://github.com/oulgen, https://github.com/jingsh	2025-05-09 16:51:04 +00:00
Menglu Yu	2d25e4d478	[1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380 ) Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017 Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d) Executing actions. Remaining 0/4 6.7s exec time total Command: test. Finished 2 local Time elapsed: 3:11.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable (you can overrite the dtype, if nothing given, the default is fp8) ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"} }, ``` Differential Revision: D70522237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380 Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803	2025-05-08 04:44:15 +00:00
PyTorch MergeBot	a28dcdba2c	Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860 )" This reverts commit `613bd46272`. Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
Aaron Orenstein	7a0781eaad	Improve cache key graph printing performance (#151928 ) Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster. On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better). Fixes #151823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928 Approved by: https://github.com/laithsakka	2025-05-06 17:39:53 +00:00
James Wu	12a8b70247	[precompile] Refactor AOTAutogradCacheEntry to be generic (#152836 ) The purpose of this stack is to create a new BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that is self contained, i.e. it contains all of the CompiledFxGraph directly in the entry, instead of relying on FxGraphCache._lookup_graph. Because this woudl balloon the size of the actual cache entry to do this, our goal is not to use BundledAOTAutogradCacheEntry in cache scenarios: only for precompile use cases. Thus, it's important we make this whole setup generic, to be able to support these two workflows clearly. This PR genericizes AOTAutogradCacheEntry considerably, so that it can take in different types of Forwards and Backwards. Each GenericAOTAutogradCacheEntry is composed of two parts, a TForward and a TBackward. The forward and backward can be loaded in multiple ways, either via FxGraphCache._lookup_graph, or by saving the entire CompiledFxGraph. For simplicify, this PR only implements the generic code refactors needed, but does not fully implement BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that takes a full CompiledForward. We'll handle and implement BundledAOTAutogradCacheEntry in the PR above this, for easier review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152836 Approved by: https://github.com/oulgen	2025-05-06 15:19:17 +00:00
Francisco Massa	199d5a408a	[partitioner] Fix argument to _broadcast_on_rank0 (#152846 ) Summary: There was a bug when I refactored my original implementation. This should fix it Test Plan: Run on some internal workloads Differential Revision: D74190485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152846 Approved by: https://github.com/danthe3rd	2025-05-06 13:45:59 +00:00
zhxchen17	ffd58293f7	[dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616 ) Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states. ## Test Cases: 0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed. 1. torch.compile() nested in vmap. 2. torch.compile() nested in grad. 3. torch.compile() nested in jvp + vmap 4. torch.compile() nested functionalize 5. torch.compile() nested in vmap + grad Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616 Approved by: https://github.com/zou3519 ghstack dependencies: #152615	2025-05-05 18:05:56 +00:00
rzou	3d777bae10	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #148104	2025-05-03 00:02:24 +00:00
rzou	2b37a726e0	Refactor layout constraint selection logic (#148104 ) This PR: - cleans up some existing comments that don't make sense anymore - hooks up the "custom_op_default_layout_constraint" back (that seems to have broken) - cleans up the "lazy registration path" which seems to never get hit anymore - adds dislike_padding to nodes that require exact strides Test Plan: - tests + CI disable padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-05-03 00:02:24 +00:00
Ryan Guo	16153a0f27	[AOTAutogradCache][Easy] Move `"einops.einops.rearrange"` to `SAFE_NON_TORCH_FUNCTIONS` (#152640 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152640 Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/bdhirsh	2025-05-02 19:09:30 +00:00
Laith Sakka	376529c78b	consolidate guard_or_x and definitely_x (#152463 ) definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463 Approved by: https://github.com/bobrenjc93	2025-05-02 18:08:11 +00:00
Animesh Jain	9e3fc41060	[invoke_subgraph] rename identifiers to prevent python mangling (#152581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581 Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519 ghstack dependencies: #152547	2025-05-02 06:46:05 +00:00
Animesh Jain	4649fd17b0	[invoke_subgraph] Unpacked operands (#152547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152547 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-05-02 05:44:46 +00:00
Simon Fan	613bd46272	[aot][ca] save bw_module in AOTAutogradCache (#151860 ) Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash). Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860 Approved by: https://github.com/jamesjwu ghstack dependencies: #149707	2025-05-01 21:59:43 +00:00
Simon Fan	c461ba6522	[aot] mark dynamic activations as maybe dynamic (#149707 ) Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs. Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707 Approved by: https://github.com/bdhirsh	2025-05-01 21:59:36 +00:00
Francisco Massa	e82dc0769c	Respect checkpointed boundaries when using knapsack formulation in the partitioner (#141684 ) When multiple checkpoint regions are back-to-back with no operations in-between, we enforce the operation at the boundary to be force-saved, see `7ea0da2d57/torch/_functorch/partitioners.py (L772-L807)` When using the `memory_budget` formulation on a graph which already has AC inside, we should respect the boundaries of the AC decision (which is set to `MUST_SAVE`), and thus ban those nodes from possible recomputation. Adding tests would be nice, but not sure what's the best way to test this right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141684 Approved by: https://github.com/bdhirsh	2025-05-01 15:28:41 +00:00
Francisco Massa	b6f8209f54	Remove redundant line in partitioner (#152517 ) Summary: This is a cleanup from https://github.com/pytorch/pytorch/pull/152264, which contained a line which was a vestige from a previous implementation. Test Plan: Let CI run Differential Revision: D73904636 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152517 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh	2025-04-30 23:17:30 +00:00
Ryan Guo	e4994e2f73	[AOTAutogradCache] Allow `torch.Tensor` and a non-torch op from einops (#152369 ) This addresses part of #150706. Specifically, it reduces the warm start `torch.compile` overhead by 40~50% for GGUF models on 1. HuggingFace diffusers: [tlparse before, 224s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpqgbdva/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 126s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp950PFy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) 2. ComfyUI: [tlparse before, 93s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7SeJb4/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 51s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRwGNqA/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) The improvements should generalize to all other GGUF models on these platforms, because the cache miss was induced by framework code, which will be hit by every GGUF model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152369 Approved by: https://github.com/jamesjwu	2025-04-30 17:34:21 +00:00
Animesh Jain	d620fefb2c	[invoke_subgraph] Use backward identifier for min-cut parititioning (#152207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152207 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2025-04-30 14:34:56 +00:00
Brian Hirsh	4a63cab624	[cudagraphs] Fix issue in collecting static_input_idxs (#152287 ) related to https://github.com/pytorch/pytorch/issues/152275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287 Approved by: https://github.com/bdhirsh, https://github.com/eellison Co-authored-by: Brian Hirsh <hirsheybar@fb.com>	2025-04-30 03:24:05 +00:00
Francisco Massa	89c0c3ca80	Add private config to broadcast rank0 decision from the partitioner to all ranks (#152264 ) Summary: This PR adds a private configuration to the partitioner that ensures that the decision taken is the same across all ranks. This is a temporary workaround, as when size_hints are also taken into account in compiler collectives this workaround will not be needed anymore. Test Plan: This has been tested on some internal models, but I haven't added any tests in PyTorch (yet?) T Differential Revision: D73666017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152264 Approved by: https://github.com/bdhirsh	2025-04-29 21:27:57 +00:00
PyTorch MergeBot	a6d19fcfac	Revert "[cudagraphs] Fix issue in collecting static_input_idxs (#152287 )" This reverts commit `75a564608a`. Reverted https://github.com/pytorch/pytorch/pull/152287 on behalf of https://github.com/wdvr due to causing ao failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152287#issuecomment-2837686127))	2025-04-29 06:57:06 +00:00
Animesh Jain	75a564608a	[cudagraphs] Fix issue in collecting static_input_idxs (#152287 ) related to https://github.com/pytorch/pytorch/issues/152275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287 Approved by: https://github.com/bdhirsh, https://github.com/eellison	2025-04-28 23:07:52 +00:00
Yukio Siraichi	ee8166e94f	Correctly handle duplicated arguments when merging input views. (#146275 ) Fix: #135099 This PR changes how we map the original inputs into the new set of inputs that take in the tensor input's base instead of their aliases. Problem: in order to create this mapping, we had a dictionary that mapped the hashed arguments into their respective indices. However, if there's a group of equal arguments, we will have only one mapping for such an argument. This breaks the assumption that there will be one mapping for each argument. Solution: map the hashed arguments into a list of indices. Then, we will be able to correctly reconstruct the parameters for the new calling convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275 Approved by: https://github.com/bdhirsh	2025-04-26 14:50:16 +00:00
Yuanhao Ji	d7eb3a492c	[Typing] Enable torch.types.IntLikeType / FloatLikeType / BoolLikeType (#152157 ) ### Changes Replace `Union[SymInt, int]` and `Union[int, SymInt]` with `IntLikeType`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152157 Approved by: https://github.com/Skylion007	2025-04-25 19:00:10 +00:00
Animesh Jain	d743a7bd85	[invoke_subgraph] Cache fake tensor if no unbacked symint in the output (#151957 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151957 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #151409, #151633, #151477	2025-04-24 14:17:22 +00:00
rzou	596296fb0b	[standalone_compile] Dynamic shape handling (#151788 ) standalone_compile needs to get dynamic shape information from somewhere. We add a new `dynamic_shapes` argument with three options: 1. from the passed-in graph (dynamic="from_graph"). This is the default. 2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs") 3. from the current tracing context (dynamic="from_tracing_context") 1 and 3 are not exactly the same. 2 can also be used for more advanced things... (specialize on one input but not the other). Most of this PR is tests. Test Plan: - a lot of new tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788 Approved by: https://github.com/oulgen	2025-04-22 20:17:24 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	69ee6a9280	[Sana][HybridCache] Fix bug in detect_attr_assignment (#151824 ) Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour Test Plan: CI Differential Revision: D73388529 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824 Approved by: https://github.com/bdhirsh	2025-04-22 19:39:50 +00:00
Sam Larsen	529f698ad4	[logging] Put "everything" WaitCounters in dynamo_timed (#151757 ) Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic. Test Plan: Ran an internal model with `TORCHINDUCTOR_BENCHMARK_FUSION=1` (to get benchmarking at compile time in addition to runtime). Overall compile time from various sources matches up: * tlparse: https://fburl.com/9fgsstkr. Eyeballing, total time should be 32 ranks x 2175 = ~69.6k s * ods: https://fburl.com/canvas/r4clhnb7. Right on. * dynamo_compile: https://fburl.com/scuba/dynamo_compile/ax71aqox. Right on. * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/shcjd9ql. Right on. And the runtime overhead: * ods: https://fburl.com/canvas/nvgjb282 * dynamo_compile: https://fburl.com/scuba/dynamo_compile/f2dtv0qh If we compare that to a run of the same model without the changes in this stack, results can mismatch by a lot: * tlparse: https://fburl.com/cchxwd1s. Eyeballing, total time should be 32 ranks x 2300s = ~73.5k s * ods: https://fburl.com/canvas/x1i3wvf4. It's kinda close * dynamo_compile: https://fburl.com/scuba/dynamo_compile/l7sgxdxd. Waaay too high. * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/jb4s9z1u. This is the only one that's actually correct. The discrepancy is even worse if we focus on the runtime events: * ods: https://fburl.com/canvas/a4o9f7ou * dynamo_compile: https://fburl.com/scuba/dynamo_compile/95izaes1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151757 Approved by: https://github.com/ppanchalia ghstack dependencies: #151749	2025-04-22 03:29:13 +00:00
James Wu	a4fdae5c84	Lift guard checking logic to AOTAutogradCache (#151563 ) This somewhat complicated PR does a few things: - It separates out a lot of the guard checking logic into its own class, GuardedCache[T] - It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic - It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them. - FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss. # Why do this? Lifting guards to AOTAutogradCache has a few benefits: - First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. I've added a unit test that failed before my diff, and now passes, as an example of this - Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries. - Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself. # Guard checking logic of AOTAutogradCache When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to evaluate these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored. After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic. ## Test Plan Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563 Approved by: https://github.com/oulgen	2025-04-22 03:01:08 +00:00
PyTorch MergeBot	fd04c79878	Revert "[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256 )" This reverts commit `8e373592c8`. Reverted https://github.com/pytorch/pytorch/pull/151256 on behalf of https://github.com/Camyll due to breaking internal tests, cannot import ([comment](https://github.com/pytorch/pytorch/pull/151256#issuecomment-2819244186))	2025-04-21 18:49:23 +00:00
Oguz Ulgen	0f8613bf5c	Introduce unsafe way to mark functions as cacheable (#151603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603 Approved by: https://github.com/jamesjwu ghstack dependencies: #151768, #151609	2025-04-21 17:37:38 +00:00
Animesh Jain	8e373592c8	[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral ghstack dependencies: #151330	2025-04-16 20:37:08 +00:00
Oguz Ulgen	3cf0e2d8ec	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-15 23:38:15 +00:00
PyTorch MergeBot	74f6bc28a7	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit `c9aef50898`. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))	2025-04-15 17:35:59 +00:00
Oguz Ulgen	c9aef50898	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 22:00:09 +00:00
Aaron Orenstein	1f5af12cd9	Using hasattr for `_boxed_call` is asking for trouble (#151130 ) Summary: There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example). Change `hasattr()` to `getattr(..., False)` for these cases. Test Plan: unit tests pass Differential Revision: D72806693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130 Approved by: https://github.com/Skylion007	2025-04-14 18:36:30 +00:00
PyTorch MergeBot	24b3ab9255	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit `bbc5fe8504`. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))	2025-04-14 15:22:33 +00:00
Oguz Ulgen	bbc5fe8504	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 07:07:10 +00:00
James Wu	3dcb46c30e	[easy] Add cache bypass traceback information to cache_info on autograd_cache_bypass (#151025 ) This will help us better debug pickling errors, etc, in internal models Pull Request resolved: https://github.com/pytorch/pytorch/pull/151025 Approved by: https://github.com/masnesral	2025-04-12 19:56:32 +00:00

1 2 3 4 5 ...

887 Commits