pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Brian Hirsh	375d71cc5a	plumb is_export flag to FunctionalTensorMode in analysis pass (#138836 ) Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`. Test Plan: CI Differential Revision: D64915263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836 Approved by: https://github.com/tugsbayasgalan	2024-10-25 17:56:14 +00:00
PyTorch MergeBot	6f66398ab8	Revert "[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731 )" This reverts commit `245026af2d`. Reverted https://github.com/pytorch/pytorch/pull/138731 on behalf of https://github.com/jeanschmidt due to introduced regressions on linux-focal-cuda12.1-py3.10-gcc9-bazel-test ([comment](https://github.com/pytorch/pytorch/pull/138731#issuecomment-2438417669))	2024-10-25 17:37:32 +00:00
eellison	fe18a221eb	Add debug backend that applies CrossRefFakeMode, use in compiler bisector (#138651 ) I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph. I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option. It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-10-25 15:58:36 +00:00
Sam Larsen	6cadf616ae	[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 ) Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not. Test Plan: The new test fails with fast mode commented out, but succeeds when enabled: `python test/inductor/test_codecache.py -k test_stable_strings` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681 Approved by: https://github.com/oulgen	2024-10-25 15:52:58 +00:00
IvanKobzarev	245026af2d	[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138731 Approved by: https://github.com/bdhirsh	2024-10-25 12:35:52 +00:00
Pian Pawakapan	09848c892a	[aot_compile] propagate ShapeEnv during lowering (#138362 ) We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused. Differential Revision: D64613290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362 Approved by: https://github.com/angelayi	2024-10-24 22:22:14 +00:00
IvanKobzarev	5ea6777861	[subclass] Unwrap_tensor_subclasses micro optimization (#138498 ) unwrap_tensor_subclasses -> get_plain_tensors Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel. 1/ Removing asserts from runtime path 2/ Removing list creation with using optional output list to append argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498 Approved by: https://github.com/bdhirsh	2024-10-24 16:54:54 +00:00
Prajesh Praveen Anchalia	bb65c9b883	[PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054 ) Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead. Differential Revision: D63661569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054 Approved by: https://github.com/bdhirsh	2024-10-23 03:15:37 +00:00
eellison	5942b29850	Disabling amp context when invoking compiler (#138624 ) Fix for https://github.com/pytorch/pytorch/issues/133974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624 Approved by: https://github.com/bdhirsh, https://github.com/drisspg	2024-10-22 23:21:55 +00:00
Simon Fan	5a13282c75	[compiled autograd] tls access helpers (#138061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061 Approved by: https://github.com/yf225 ghstack dependencies: #137953, #137821	2024-10-22 08:03:52 +00:00
Simon Fan	49fa437097	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-22 08:03:52 +00:00
Parikshit Shah	853da168fc	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-21 21:45:13 +00:00
Aaron Orenstein	07cc4bd3e2	typing compile_fx.py (#138033 ) Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033 Approved by: https://github.com/Skylion007	2024-10-21 18:14:59 +00:00
PyTorch MergeBot	9bb327bfc6	Revert "[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 )" This reverts commit `a8b912f39d`. Reverted https://github.com/pytorch/pytorch/pull/137785 on behalf of https://github.com/ezyang due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/137785#issuecomment-2427295668))	2024-10-21 17:18:56 +00:00
Parikshit Shah	a8b912f39d	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Differential Revision: D64246495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong	2024-10-21 15:30:07 +00:00
Will Feng	0c76c68d7d	[tlparse][AOTAutograd] Rename to aot_inference_graph in tlparse output (#137803 ) Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803 Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang	2024-10-17 18:44:37 +00:00
PyTorch MergeBot	361f42bc42	Revert "[compiled autograd] Compiled autograd configs in TLS (#137821 )" This reverts commit `9aba0b91c8`. Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))	2024-10-16 16:38:29 +00:00
Tom Ritchford	af27f7888b	[dynamo] Remove an unused variable in AOTDispatchAutograd (#137989 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-16 16:37:19 +00:00
Tom Ritchford	15722debfb	Remove two unused variables in _functorch/partitioners.py (#137998 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998 Approved by: https://github.com/Skylion007	2024-10-16 10:58:31 +00:00
Simon Fan	9aba0b91c8	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-16 09:28:32 +00:00
Brian Hirsh	ed94725b8c	log ViewAndMutationMeta to trace_structured (#133784 ) I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?) Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt ``` import torch @torch.compile def f(x): out1 = torch.view_as_complex(x) out2 = torch.view_as_complex(x) return out1, out2, x * 2 x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64) out = f(x_) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784 Approved by: https://github.com/ezyang	2024-10-15 02:49:02 +00:00
Avik Chaudhuri	8262f6d271	fix test_lazy_module_kwargs (#137705 ) Test Plan: fixed Differential Revision: D64185644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705 Approved by: https://github.com/tugsbayasgalan	2024-10-11 01:53:10 +00:00
Bob Ren	36133f39db	Tensorify compute on Python scalars (#136674 ) Signed-off-by: Bob Ren <bobrenfb.com> Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674 Approved by: https://github.com/ezyang	2024-10-09 18:51:41 +00:00
James Wu	4d45536e92	Save aot graph code in AOTAutogradCache for logging purposes (#137432 ) Save the string graph code from print_readable Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432 Approved by: https://github.com/bdhirsh ghstack dependencies: #137431	2024-10-09 16:59:08 +00:00
Masaki Kozuki	b71d0ac3b1	remove unused variable (#137565 ) per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565 Approved by: https://github.com/Skylion007	2024-10-09 16:31:43 +00:00
PyTorch MergeBot	2fff990c16	Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 )" This reverts commit `932b9945c0`. Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))	2024-10-09 04:53:30 +00:00
Parikshit Shah	932b9945c0	[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 ) Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes. Test Plan: tbd Reviewed By: Chillee, basilwong Differential Revision: D63714905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314 Approved by: https://github.com/eellison, https://github.com/basilwong Co-authored-by: Parikshit Shah <parikshit@meta.com>	2024-10-09 00:39:29 +00:00
Edward Z. Yang	b499083a91	Get rid of quadratic tests to has_same_metadata (#136857 ) Fixes https://github.com/pytorch/pytorch/issues/136852 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857 Approved by: https://github.com/isuruf, https://github.com/bdhirsh	2024-10-08 20:49:23 +00:00
Avik Chaudhuri	28493efe6e	fix silly mapping issue with torch.Size (#137465 ) Test Plan: added test Differential Revision: D64022949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465 Approved by: https://github.com/yushangdi, https://github.com/angelayi	2024-10-08 16:53:15 +00:00
PyTorch MergeBot	796c3c3415	Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221 )" This reverts commit `7e13e7dd7e`. Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))	2024-10-07 21:46:13 +00:00
Pian Pawakapan	f33ffd01f2	[export] fix joint graph metadata (#136011 ) Differential Revision: D62652832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011 Approved by: https://github.com/tugsbayasgalan	2024-10-07 19:36:44 +00:00
James Wu	4db199f15f	Implement Remote AOTAutogradCache (#137278 ) Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name. Test Plan: Run benchmark: TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency See that it cache hits even with local cache removed. Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj New unit tests Reviewed By: oulgen Differential Revision: D63323958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278 Approved by: https://github.com/oulgen	2024-10-07 15:38:54 +00:00
Avik Chaudhuri	6a6a8b17b8	handle state tensors in training ir path (#137240 ) Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed. Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export` Differential Revision: D63802576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240 Approved by: https://github.com/tugsbayasgalan	2024-10-04 20:23:48 +00:00
rzou	f500cb43bb	Fix torch.library.register_vmap (#137306 ) We didn't support multiple levels of vmap. The main problem is, during the batching rule, we need to exclude the vmap dispatch key (FuncTorchBatched) like how our C++ batching rules do it. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306 Approved by: https://github.com/Chillee	2024-10-04 03:46:35 +00:00
rzou	7e13e7dd7e	Disallow FakeTensor.data_ptr access in eager mode (#137221 ) Previously we raised a deprecation warning (beginning PyTorch 2.4). Now that we are on 2.6, we're completing the deprecation and disallowing this behavior. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221 Approved by: https://github.com/albanD, https://github.com/eellison	2024-10-03 23:47:55 +00:00
James Wu	4d3c0fc061	[AOTAutogradCache] add config for AOTAutograd remote cache (#137011 ) Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff. This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE Test Plan: Existing tests should pass and should build Reviewed By: oulgen Differential Revision: D63321965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011 Approved by: https://github.com/oulgen	2024-10-03 16:03:47 +00:00
Brian Hirsh	bf73af4b4e	dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878 ) Previously if we had a graph like: ``` triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...) getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr'] getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr'] sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1) mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid) ``` The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning: (1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel) (2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent) Reviewed By: embg Differential Revision: D63551393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878 Approved by: https://github.com/zou3519	2024-10-02 13:52:44 +00:00
James Wu	dc8c0aaf4d	[AOTAutogradCache] Log time taken_ns (#136529 ) Summary: This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry. This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events. Test Plan: Run benchmark, see that the times are in the chromium events. Reviewed By: aorenste Differential Revision: D62590077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529 Approved by: https://github.com/oulgen	2024-09-27 16:14:09 +00:00
IvanKobzarev	34d788ffb0	[aotd] Do not force contiguous() for channels_last (#135225 ) Original Issue: https://github.com/pytorch/pytorch/issues/134644 We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing. => Tracing time: - Store trace_tangents_memory_formats in metadata - Coerce tangents to deduced memory_format Runtime: - Coerce tangents to tracing memory format from metadata Subclasses logic: - Previously coercing tangents logic did not handle nested subclasses case, fixing this. For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass: [subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ] If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure. The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method). Other small change: Remove duplicated not-related comment. Testing ``` python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous ``` Benchmarking: After change: ``` └─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms ``` Before change: ``` BEFORE_CHANGE SUBCLASS 4.1194 ``` No siginificant changes in processing time. (We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225 Approved by: https://github.com/bdhirsh	2024-09-27 15:01:20 +00:00
IvanKobzarev	9581508383	[aotd] Cleanup on subclasses in inductor freezing (#136549 ) Cleanup: 1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper 2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat. Verfified fbcode tests compiled_optimizers Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549 Approved by: https://github.com/bdhirsh	2024-09-27 11:20:03 +00:00
IvanKobzarev	342c031f0e	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-24 13:15:01 +00:00
James Wu	4649aeaebf	Make AOTAutogradCache support remote FXGraphCache (#136173 ) Summary: After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache. This does not implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache. Test Plan: (Meta only tests) Reviewed By: aorenste Differential Revision: D62384944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173 Approved by: https://github.com/oulgen	2024-09-23 17:24:27 +00:00
PyTorch MergeBot	df6a8fa1eb	Revert "[aotd] Fix freezing API for subclasses (#136265 )" This reverts commit `cdef760560`. Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))	2024-09-23 16:25:05 +00:00
IvanKobzarev	cdef760560	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-20 16:32:49 +00:00
Will Feng	a815611db9	[Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727 ) If node is AC region output and has a backward hook on it, we intentionally choose to save it. This is to work around circular dependencies in Traceable FSDP2+AC. Example: ``` out = fully_shard(utils.checkpoint(module))(x) norm_out = layer_norm(out) ``` and there is a circular dependency: 1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`. 2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed. 3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)! Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency. ---- Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727 Approved by: https://github.com/Chillee	2024-09-14 08:45:58 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Will Feng	84ae6b7d6b	AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193 ) This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960: Part of FSDP2's tracing strategy right now is that: (1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated (2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views) (3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup. It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960. However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through). So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when: (1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state) (2) and our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193 Approved by: https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-09-06 14:06:48 +00:00
Edward Z. Yang	d0591f4658	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that: I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4	2024-09-06 13:13:15 +00:00
Bob Ren	30b98940b8	Fix typo in comment (#135111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111 Approved by: https://github.com/aorenste, https://github.com/oulgen	2024-09-05 01:39:04 +00:00
Benjamin Glass	43c9b4e0e6	Fix unintentional deduplication of returned tensors (#134726 ) When CSE was used, returned tensors that had gone through identical processing steps but were distinct from a data perspective were pruned out of the graph. This commit protects tensors which are directly output from being pruned, and adds a test for this behavior. Closes #88813 and #114344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726 Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh	2024-09-04 23:42:56 +00:00
PyTorch MergeBot	fc07e6bf56	Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 )" This reverts commit `a178a053ad`. Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
Laith Sakka	c8ab9b06a2	Redesign custom op functionlaization for better re-inplace (#134409 ) - The new implementation (auto_functionalized_v2) is enabled by default but can be disable using an inductor flag. - In export mode the old implementation is used. Motiviation Previous functionalization fails to re-inplace arguments when they are view over other tensors. see issue https://github.com/pytorch/pytorch/issues/131192 The new functionalization is easier to re-inplace for views. A) Functionalizations pass consider a program: ``` func(t) x = t[0] y = t[1] foo(x, y) # custom operator with x, y mutable return (x, y, t) ``` - To functionalize `foo` we generate a function that operates on the base tensors of the inputs; (x.base() and y.base()) and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())``` - Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function. ``` auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 , _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1 , _all_bases = [arg0_1]) ``` - In the code above: - _all_bases[t]: refers to a unique set of bases for all foo arguments. - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1) regenerate x from _all_bases[_x_base_index] or a copy of a the base. - the output of auto_functionalized is foo output , followed by x tensors one for each base in _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base. - for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output for the function above after functionalization we get : ``` def forward(self, arg0_1: "f32[2][1]cpu"): auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1]) getitem_1: "f32[2][1]cpu" = auto_functionalized[1]; auto_functionalized = None copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1); arg0_1 = copy_ = None # No stacktrace found for following nodes select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0) select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1); getitem_1 = None return (select_2, select_3) ``` B) Semantics of auto_functionalize The new semantics of auto_functionalize is as the following: 1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it) 2. For each arg, regenerate the arg from the copy of its base using the view information above. 3. return the original foo output followed by the new bases. C) Re-inplace pass since auto_functionalize not copy the bases, what we actually inplace is the bases. (run just like before but on the beses instead of args). 1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409 Approved by: https://github.com/zou3519	2024-09-04 17:08:58 +00:00
Edward Z. Yang	a178a053ad	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ I'm not sure this is the right approach though... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4 ghstack dependencies: #134407	2024-09-04 13:25:08 +00:00
Aaron Orenstein	7239b8a4f1	Clean up RemoteCache classes (#134032 ) Summary: The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in. Update them to be more consistent: 1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile 2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only) 3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching. Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032 Approved by: https://github.com/oulgen, https://github.com/bhack	2024-08-31 20:18:59 +00:00
rzou	092349dcdd	Never CSE aten.empty in the partitioner (#134703 ) aten.empty is almost always fusible into its consumer, so we never CSE it. This fixes a bug that looks like the following: ```py @torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"}) def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None: out_sin.copy_(x.sin()) out_cos.copy_(x.cos()) @torch.compile def f(x): out0 = torch.empty_like(x) out1 = torch.empty_like(x) sin_cos(x, out0, out1) return x.clone(), out0, out1 x = torch.randn(3, requires_grad=True) f(x) ``` - cse would de-duplicate the empty nodes - reinplacing would add an additional clone (because it can't write to both tensors at the same time) - the clone lowers into a new buffer + a copy_ kernel - the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional buffer, it doesn't matter what the values are. We could attempt to fix this on the reinplacing side but this seemed better as a partitioner heuristic and the reinplacing fix is a bit more tricky (we'd need to identify that the op never reads from the empty node). Test Plan: - new test (the old number was 27, the new number is 21, so this PR helped). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703 Approved by: https://github.com/yf225 ghstack dependencies: #134466, #134490, #134491	2024-08-29 13:51:19 +00:00
Yanbo Liang	8693322ef0	[Dynamo][autograd.Function] Support mark_non_differentiable (#134087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087 Approved by: https://github.com/zou3519	2024-08-28 08:12:37 +00:00
rzou	c582602245	Update partitioner's is_fusible heuristic to respect triton kernels (#134491 ) mutated arguments to triton kernels are fusible into the triton kernel. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466, #134490	2024-08-27 15:57:32 +00:00
rzou	c7cbcdad76	Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490 ) We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes https://github.com/pytorch/pytorch/issues/134468 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466	2024-08-27 13:05:01 +00:00
IvanKobzarev	8ae4f82243	[aotd] Support HOP effects in backward (#132638 ) Support of effectful operations in backward: 1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function . FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`. 2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper. Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs. 2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals. _aot_autograd/utils.py 3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward. For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward. There are additional changes in partitioner to improve functionality of 'must_be_in_backward' 4/ Unlift tokens now should run for both forward and backward. - As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation - In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs. 5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens. Tests: ``` python test/higher_order_ops/test_with_effects.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638 Approved by: https://github.com/bdhirsh	2024-08-23 15:30:58 +00:00
Aaron Orenstein	d95aedf5fd	[BE] typing for decorators - fx/_compatibility (part 1) (#134202 ) Part of #134054. This corresponds to the pytorch mypy changes from D61493706. Updating takes so long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change. So landing these 'type: ignore' for pytorch in advance of them actually being needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202 Approved by: https://github.com/Skylion007	2024-08-22 17:07:33 +00:00
James Wu	3c5485fb7f	[Retry] Log chromium events to scuba (#134118 ) Summary: This diff implements a bunch of views for internal scuba viewing. TODOS that I might punt to another diff: - Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more. - We should definitely log frame id, compile id, etc - We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on. - idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk Test Plan: All of the above views are run with nanogpt benchmark: ``` buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance ``` Differential Revision: D61603243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118 Approved by: https://github.com/oulgen	2024-08-22 14:59:45 +00:00
IvanKobzarev	57625bacea	[partitioner] Fix must_be_in_backward corner cases (#134002 ) Preparation PR for https://github.com/pytorch/pytorch/pull/132638 "must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values. The fix is to prevent partitioner to pick those nodes during nodes classification. It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002 Approved by: https://github.com/bdhirsh ghstack dependencies: #134003	2024-08-21 15:58:49 +00:00
JackCaoG	27dfd63ee8	remove unnecessary slicing in EffectTokensWrapper (#133737 ) In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737 Approved by: https://github.com/IvanKobzarev	2024-08-18 05:52:48 +00:00
James Wu	0bde3c4f2f	Run cudagraphs on AOTAutograd cache hit (#132294 ) This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit. Specifics: - AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward. - We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time. ``` TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv ``` Twice, once on cache miss and once and cache hit. Here is the perfetto trace for each(FB only link): Cache Miss: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.66 seconds I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 101.24 seconds Average tokens/sec: 147.96 tokens/sec Average bandwidth achieved: 1955.22 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb) Cache Hit: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.67 seconds I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 9.40 seconds Average tokens/sec: 147.94 tokens/sec Average bandwidth achieved: 1954.98 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294 Approved by: https://github.com/eellison	2024-08-17 21:24:54 +00:00
Simon Fan	0a6cc15079	[compiled autograd] use same graph node names as AOTDispatcher (#133148 ) FIXES https://github.com/pytorch/pytorch/issues/132939 Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html ```python ===== Compiled autograd graph ===== <eval_with_key>.14 class CompiledAutograd(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem: "f32[]cpu" = inputs[0] aot1_primals_1: "f32[4]cpu" = inputs[1] aot1_primals_2: "f32[4]cpu" = inputs[2] aot0_sin: "f32[4]cpu" = inputs[3] aot0_cos: "f32[4]cpu" = inputs[4] getitem_5: "f32[4]cpu" = inputs[5]; inputs = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1) expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]); getitem = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2); aot1_primals_2 = None aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1); aot1_sin_1 = None aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg); aot1_neg = None aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1); aot1_primals_1 = None aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1); aot1_tangents_1 = aot1_cos_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin); aot0_sin = None aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg); aot0_tangents_2 = aot0_neg = None aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos); aot0_tangents_1 = aot0_cos = None aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1); aot0_mul = aot0_mul_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4) accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add); getitem_5 = aot0_add = accumulate_grad_ = None _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub(); _exec_final_callbacks_stub = None return [] ``` where aot1 is ```python class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos() sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2); primals_2 = None neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1); sin_1 = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg); neg = None cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1); tangents_1 = cos_1 = None return (mul_1, mul) ``` and aot0 is ```python class GraphModule(torch.nn.Module): def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos() neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin); sin = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg); tangents_2 = neg = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1); mul = mul_1 = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148 Approved by: https://github.com/jansel ghstack dependencies: #133115	2024-08-17 00:46:52 +00:00
Simon Fan	4b3ed8bc52	[compiled autograd] log aot id for CompiledFunctionBackward (#133115 ) Partially addresses https://github.com/pytorch/pytorch/issues/132939. Adds the AOT ID after the CompiledFunctionBackward annotation in verbose compiled autograd logging default (no change): https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_xw3ktsi_.log/index.html TORCH_LOGS="compiled_autograd_verbose": https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_gsc9q_43.log/index.html ```python # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) clone: "f32[4]" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None cos: "f32[4]" = torch.ops.aten.cos.default(getitem_1); getitem_1 = None mul: "f32[4]" = torch.ops.aten.mul.Tensor(clone, cos); clone = cos = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) cos_1: "f32[4]" = torch.ops.aten.cos.default(getitem_2) mul_1: "f32[4]" = torch.ops.aten.mul.Tensor(mul, cos_1); mul = cos_1 = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133115 Approved by: https://github.com/jansel	2024-08-17 00:46:52 +00:00
Guilherme Leobas	5ec9c0bc4a	Fix `linearize(grad(...))` call (#133364 ) Fixes #124550 Also moves `graph.eliminate_dead_code()` call to a few lines after `_inline_module(...)` in `const_fold.py` * Test plan: Add a new test on `test_eager_transforms.py` to ensure the reported issue was indeed fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/133364 Approved by: https://github.com/zou3519	2024-08-15 17:55:36 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
IvanKobzarev	7aee3376e2	[aotd] HOP effect tokens wrapper above SubclassWrapper (#131672 ) Original issue: https://github.com/pytorch/pytorch/issues/129486 Before subclass_wrapper() got inputs containing additional effect tokens and failed as this did not match SubclassMeta indexes. This happened as functionalization was responsible to add / remove those tokens. Functionalization can not be run above Subclasses, as args/outs are duplicated in case of mutations. The main design thought is to keep logic of EffectTokens, Subclasses, Functionalization to know as less as possible about each others transformations. For that extracting EffectTokens manipulation to a separate wrapper, which will be processed above SubclassWrapper, while functionalization will happen below SubclassWrapper as before. In that case subclass wrap/unwrap works without information of additional arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131672 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2024-08-14 05:57:17 +00:00
soulitzer	4af4910b1a	Reland "Construct NJT without graph breaks" (#133196 ) This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea. and adds changes from https://github.com/pytorch/pytorch/pull/133061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196 Approved by: https://github.com/ezyang ghstack dependencies: #133145	2024-08-14 01:11:13 +00:00
James Wu	cd565bc455	Refactor process_inputs outside of create_aot_dispatcher_function (#130962 ) This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed. This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming. # Guard behavior Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache. FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962 Approved by: https://github.com/bdhirsh	2024-08-13 14:56:00 +00:00
soulitzer	05de2b2d0f	Revert "Construct NJT without graph breaks" (#133145 ) This reverts commit 911154271309667b55dfb963ec6384bd0048019b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145 Approved by: https://github.com/YuqingJ	2024-08-10 03:11:16 +00:00
James Wu	f037803290	Add ChromiumEventLogger, log FXGraphCache and AOTAutogradCache (#132864 ) This PR implements ChromiumEventLogger in all @dynamo_timed events. For each dynamo timed call, we log: - A start event before starting the function execution - An end event after finishing the function execution - An extra pair of start/end events for any phase names included in dynamo. Separately, this also gives us the ability to log instant events. I use them to log cache hits/misses as a first step. The little arrows on the bottom of the UI are cache hits/misses, and you can look at cache details by clicking each triangle. The outputted chromium trace events can be viewed in perfetto for a timeline of an execution. Here's what it looks like for a run of nanogpt: ![image](https://github.com/user-attachments/assets/cb9e6c7a-1acf-45e6-8a27-6651d9ae6132) And another with warm start: ![image](https://github.com/user-attachments/assets/cd9709bc-59ef-4da1-a7dd-10b1a0ab9b8f) Trace events are based around the JSON Event format: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview We may want to switch to the less deprecated Protobuf format later, but so far I don't see any features we care about supported there. Internal FB employees can see a link to this in the tlparse output: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpVi1FIl/dedicated_log_torch_trace_bb4zl_bc.log/index.html I'll also work on logging these Pull Request resolved: https://github.com/pytorch/pytorch/pull/132864 Approved by: https://github.com/aorenste	2024-08-10 01:15:53 +00:00
Edward Z. Yang	aec6332356	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-08 12:03:06 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
vasiliy	48f7bdbbe1	aot_autograd: copy metadata from fw to bw nodes (#126573 ) Summary: Uses the `seq_nr` field (introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129) to map the aot_autograd fx bw nodes to the corresponding fw nodes, and copy the metadata over. I am trusting the `seq_nr` mapping in the linked PR here. I did some validation with a toy LLaMa 3 8b training run and the mapping seemed correct. I am also trusting that the forward is single threaded, since `seq_nr` is thread local. If this isn't always true, we'll need to also plumb `thread_id` through the same machinery which is populating `seq_nr`. I'd like to use this data in a future PR to make inductor kernels easily attributable to the nn.Module path in modeling land, to make it easier to do performance debugging. Test Plan: ``` // 1. unit test python test/dynamo/test_aot_autograd.py -k test_aot_sequence_nr // 2. manual test // run LLaMa 3 8B fw + bw with torch.compile, print out the inductor graphs // seen in `torch/_inductor/utils.py::get_kernel_metadata`, they seemed // right to me. ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126573 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-08-07 21:25:09 +00:00
PyTorch MergeBot	780310fed7	Revert "Only thunkify proxies in some situations (#132421 )" This reverts commit `bb99008c9e`. Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](`bb99008c9e`). Test got added in `f50621989b` which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))	2024-08-07 15:29:54 +00:00
Edward Z. Yang	bb99008c9e	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-07 11:51:17 +00:00
PyTorch MergeBot	cbee9c1fd2	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `0e7e61f7ce`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))	2024-08-07 00:05:20 +00:00
Brian Hirsh	e3394e5548	torch.autograd.graph.increment_version: accept List[Tensor], use in AOTDispatcher (#132652 ) The regression from https://github.com/pytorch/pytorch/issues/132281 pinpoints `e4ace1a396` as the cause. The main delta that commit introduces is that we now manually check `is_inference()` and call `increment_version()` (a pybind call) on every mutated input tensor to the graph. This PR attempts to reduce overhead a bit by bundling up all of those checks into a single pybind call, by: (1) updating `torch.autograd.graph.increment_version()` to accept a `Union[Tensor, List[Tensor]]` (2) updating its semantics to no-op if you pass in a tensor with no version counter, instead of erroring Pull Request resolved: https://github.com/pytorch/pytorch/pull/132652 Approved by: https://github.com/albanD	2024-08-06 17:46:48 +00:00
soulitzer	f50621989b	Construct NJT without graph breaks (#130292 ) Combines contributions from https://github.com/pytorch/pytorch/pull/130505 Some context can be found in this large comment block: `a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)` Changes in this PR - For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager. - Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic). - (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.) Basically unchanged, but worth noting: - `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified. - to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack. Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292 Approved by: https://github.com/bdhirsh ghstack dependencies: #131916, #131803	2024-08-06 17:03:39 +00:00
Michael Lazos	a8f0979962	Add cudagraph static inputs logging (#132726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132726 Approved by: https://github.com/anijain2305	2024-08-06 12:01:20 +00:00
Shuqi Yang	a74e5abda4	Fix issues in activation_memory_budget for float8 (#132687 ) Summary: When using activation_memory_budget for float8 training, two issues were noticed: - When `aggressive_options` (https://fburl.com/code/m1yoskxw) is called , all fp8 gemms (the scaled_mm op) are saved for recomputation. - After adding "scaled_mm" in the `compute_intensive_ops`, we got the next error from `estimate_runtime`: `mat2 must be col_major` from `meta_scaled_mm`. To fix it, modified `materialize_arg` to also include the stride of the original tensor. Test Plan: Run float8 training with `activation_memory_budget`. Differential Revision: D60777297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132687 Approved by: https://github.com/Chillee	2024-08-05 23:01:35 +00:00
Brian Hirsh	4db368a475	make functorch CSE respect mutations as barriers (like fsdp.set_) (#132243 ) Fixes https://github.com/pytorch/pytorch/issues/132200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132243 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/yf225	2024-08-05 21:28:55 +00:00
Aart Bik	a8490a0762	[traced-graph][sparse] propagate sparsity in fx graph (#131920 ) This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well. Feature request: https://github.com/pytorch/pytorch/issues/117188 Rebranch of older PRs (for history): https://github.com/pytorch/pytorch/pull/131474 https://github.com/pytorch/pytorch/pull/128549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920 Approved by: https://github.com/ezyang	2024-08-05 15:49:53 +00:00
Aaron Gokaslan	fd4b649e6c	[BE]: Simplify some list comps to generators C419 (#132578 ) Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578 Approved by: https://github.com/ezyang	2024-08-04 17:46:26 +00:00
Xuehai Pan	0e7e61f7ce	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-03 09:43:38 +00:00
David Berard	85f19ce14a	Support meta["val"] that is a dict, for triton kernels and for the partitioner (#132466 ) Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]` to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner. Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466 Approved by: https://github.com/zou3519 ghstack dependencies: #132356	2024-08-02 23:24:29 +00:00
Edward Z. Yang	290f09f829	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-02 12:00:46 +00:00
David Berard	7d8b95e8fb	[easy] more debug in partitioner assert (#132456 ) Print the name of the node that didn't have good meta['val']. An internal model is failing with this assert, we need this info to debug further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132456 Approved by: https://github.com/Chillee	2024-08-02 05:07:01 +00:00
PyTorch MergeBot	c8958f8f84	Revert "Ban decorator usage of dynamo_timed (#132328 )" This reverts commit `9853c048eb`. Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](`9853c048eb`). Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))	2024-08-01 20:20:28 +00:00
Edward Z. Yang	9853c048eb	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-01 19:27:58 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
Xuehai Pan	e7eeee473c	[BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]/` and `torch/_[e-h]/` and `torch/_[j-z]*/` (#129765 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765 Approved by: https://github.com/ezyang	2024-07-31 10:42:50 +00:00
IvanKobzarev	a94e507c39	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-31 07:25:19 +00:00
Guilherme Leobas	a843178529	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519	2024-07-30 14:22:23 +00:00
PyTorch MergeBot	f72266ecea	Revert "Let dynamo inline functional_call (#128646 )" This reverts commit `5aab1acc84`. Reverted https://github.com/pytorch/pytorch/pull/128646 on behalf of https://github.com/clee2000 due to the newly added test dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers [GH job link](https://github.com/pytorch/pytorch/actions/runs/10147452270/job/28058682000) [HUD commit link](`5aab1acc84`) is broken, probably a landrace since it passed on PR ([comment](https://github.com/pytorch/pytorch/pull/128646#issuecomment-2256375501))	2024-07-29 16:26:50 +00:00
Guilherme Leobas	5aab1acc84	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519 ghstack dependencies: #129091, #130490	2024-07-29 15:41:03 +00:00
PyTorch MergeBot	945bf78894	Revert "[BE] typing for decorators - fx/_compatibility (#131568 )" This reverts commit `193f62fde9`. Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	d3c17fea90	Revert "[BE] typing for decorators - _library/custom_ops (#131578 )" This reverts commit `c65b197b85`. Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
Boyuan Feng	40cc5c0697	[AOT Autograd] Donated Buffer (#130580 ) Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor. Fixes #129496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580 Approved by: https://github.com/bdhirsh	2024-07-26 17:14:34 +00:00
Brian Hirsh	e4ace1a396	AOTDispatcher: properly bump version counter on input mutations in inference graphs (#131665 ) This ensures that in an inference setting, we properly bump the VC of mutated graph inputs. Previously, we would only properly bump the VC for training graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131665 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482	2024-07-26 14:22:20 +00:00
IvanKobzarev	115994fea2	[aotd] Align partitioner graph output type to tuple (#131759 ) Brian debugged the difference of the output type for inference and train graph. Partitioner sometimes return list output type. After this PR it will always return tuple. Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing. This could be easily fixed with fast-forward fix on: ``` EXPECTTEST_ACCEPT=1 python test/test.py ``` Adding ciflows/periodic to minimize this probability Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-07-26 09:46:29 +00:00
Aaron Orenstein	c65b197b85	[BE] typing for decorators - _library/custom_ops (#131578 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577	2024-07-25 22:24:19 +00:00
Aaron Orenstein	193f62fde9	[BE] typing for decorators - fx/_compatibility (#131568 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568 Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519	2024-07-25 22:24:19 +00:00
IvanKobzarev	58b8704f28	[aot] Keep backward mutations in backward (#129130 ) https://github.com/pytorch/pytorch/issues/127561 Mutations of inputs in backward are emitted manually, after joint_fn tracing. With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs. To keep those mutations in backward: - Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward" In partitioner: Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130 Approved by: https://github.com/Chillee	2024-07-25 20:02:25 +00:00
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
Shangdi Yu	68c725a094	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 17:48:38 +00:00
PyTorch MergeBot	b435d84261	Revert "[custom ops] Add register_vmap for custom ops (#130589 )" This reverts commit `074b420641`. Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))	2024-07-23 01:44:44 +00:00
Shangdi Yu	074b420641	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 00:54:52 +00:00
rzou	207fb96155	[functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191 ) There's no reason to ban them for vmap or jvp, because without the {grad, vjp} transforms those just act above PyTorch autograd, which will end up saving regular Tensors. Test Plan: - some tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191 Approved by: https://github.com/drisspg	2024-07-19 23:16:27 +00:00
PyTorch MergeBot	120fdf7ee2	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit `e98135d1ad`. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))	2024-07-18 14:58:25 +00:00
IvanKobzarev	e98135d1ad	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-18 08:27:53 +00:00
Will Feng	d77af49380	[Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786 ) Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_fullgraph_backend_inductor` - `pytest -rA test/functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_fsdp_set__into_same_input` - `PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py -k TestAOTAutogradWithCache.test_input_mutation_fsdp_set__into_same_input` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130786 Approved by: https://github.com/bdhirsh ghstack dependencies: #129773	2024-07-17 23:25:42 +00:00
Oguz Ulgen	1e13cb2f28	Log cache state to structured logs (#130845 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpRm4MaD/0_0_0/fx_graph_cache_hash_4.json Differential Revision: D59795574 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130845 Approved by: https://github.com/jamesjwu	2024-07-17 16:45:45 +00:00
Michael Lazos	415d5e53ae	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh	2024-07-16 22:12:38 +00:00
PyTorch MergeBot	cbda8be537	Revert "Propagate buffer and parameter indices through AOT (#130393 )" This reverts commit `69a77389e2`. Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
Michael Lazos	69a77389e2	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392, #130503	2024-07-16 00:25:38 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
Aaron Orenstein	567482973d	typing fake_tensor.py (#128041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041 Approved by: https://github.com/eellison ghstack dependencies: #129182	2024-07-13 06:07:40 +00:00
Aaron Orenstein	634b62f111	typing proxy_tensor.py (#129182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182 Approved by: https://github.com/Chillee	2024-07-12 23:17:09 +00:00
Yidi Wu	0bf9a091ec	[torchbind] add tracing_mode support (#129586 ) Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object. This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class. Test Plan: `pytest test/export/test_torchbind.py -k safe` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586 Approved by: https://github.com/zou3519	2024-07-12 18:01:47 +00:00
Edward Z. Yang	9c6c0deadc	Add eager_compile_backwards_failure to tlparse (#130434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434 Approved by: https://github.com/albanD	2024-07-11 22:35:33 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
James Wu	5ed72ff5f5	Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583 ) This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this. One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior. Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583 Approved by: https://github.com/oulgen ghstack dependencies: #128335	2024-07-11 15:39:09 +00:00
PyTorch MergeBot	b81767161e	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit `08d5423d33`. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 `08d5423d33` test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))	2024-07-10 20:22:24 +00:00
IvanKobzarev	08d5423d33	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-10 17:56:32 +00:00
Shangdi Yu	c83b941141	[export] add dynamic shapes argument and infer from graph nodes (#129928 ) Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`. On a high level, the issue is caused by not detecting fake_mode when there's no input. Change plan: 1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`. 2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg. 3) If the input is a graph module, then we can traverse the graph and check. 4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type. Fixes #129927 Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo. Change plan: check `gm_torch_level` graph's node meta "example_value" for fake mode in addition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928 Approved by: https://github.com/angelayi	2024-07-10 15:51:05 +00:00
James Wu	9158bb7837	Ignore functional tensor wrapper when caching (#128335 ) This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work. To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335 Approved by: https://github.com/bdhirsh	2024-07-08 18:39:20 +00:00
James Wu	e7ab7b83bc	Have torch_key hash entire torch directory (#129250 ) Summary: Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files. There's an argument to be made to only hash .py and .cpp files. Maybe we can fix the glob to do that. We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources. Test Plan: Unit tests still pass on OSS For torch_key: ``` buck2 build caffe2:src_hash.txt -v 2 --show-output ``` See the output, then make any change to any torch file. See that the hash changes. Reviewed By: oulgen Differential Revision: D58875785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250 Approved by: https://github.com/oulgen	2024-07-05 15:37:16 +00:00
Peter Bell	e2e624a02f	[AOTAutograd] Micro-optimize runtime_wrapper (#128188 ) This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188 Approved by: https://github.com/bdhirsh	2024-07-04 03:53:06 +00:00
James Wu	9e1e58e052	Support allowlisted modules and op overloads in AOTAutogradCache (#128329 ) Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329 Approved by: https://github.com/bdhirsh	2024-07-03 14:59:24 +00:00
Aaron Gokaslan	6c2a8b6b38	[Ez][BE]: Enable new stable ruff rules (#129825 ) Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825 Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet	2024-07-02 14:47:10 +00:00
Mohamed Yassine Kabouri	0a337613f8	Fix typo in stack_module_state doc (#129126 ) I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126 Approved by: https://github.com/zou3519	2024-06-28 21:36:40 +00:00
Brian Hirsh	a4d7aa498b	[Traceable FSDP2] Add auto-functionalize support for mutable list[Tensor] (copy from Brian's PR #127347 ); enable E2E inductor unit test for transformer model (#129502 ) Copy of Brian's PR: https://github.com/pytorch/pytorch/pull/127347 with additional changes to support mutable `List[Tensor]` in Inductor. Also enable E2E inductor unit test for Traceable FSDP2 + transformer model. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_set_` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_aot_eager` - `pytest -rA test/dynamo/test_misc.py::MiscTests::test_auto_functionalize_tensorlist` - `pytest -rA test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129502 Approved by: https://github.com/zou3519	2024-06-27 17:50:57 +00:00
yuqingj	7bb558fd6e	add _flash_attention_forward and _efficient_attention_forward to compute intensive ops in partitioner (#129533 ) Avoid recompute of SDPA during the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129533 Approved by: https://github.com/drisspg	2024-06-27 00:49:00 +00:00
Tugsbayasgalan Manlaibaatar	6181e65cd8	Nested tensor subclass support (#127431 ) When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive. Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431 Approved by: https://github.com/bdhirsh	2024-06-26 04:45:22 +00:00
Will Feng	575bc1e3af	[Reopen #114036 ] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295 Approved by: https://github.com/Chillee	2024-06-25 23:47:08 +00:00
PyTorch MergeBot	45b2931b7e	Revert "[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 )" This reverts commit `b24787b757`. Reverted https://github.com/pytorch/pytorch/pull/129414 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
Will Feng	b24787b757	[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 ) This makes it easier to do pattern-matching on `fsdp.split_with_sizes_copy` in Inductor passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129414 Approved by: https://github.com/bdhirsh	2024-06-25 03:08:56 +00:00
Yidi Wu	b22f0f5f51	[torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844 ) This PR does two things: 1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong. 2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844 Approved by: https://github.com/angelayi	2024-06-24 23:14:34 +00:00
Brian Hirsh	b91a9dc328	[Brian's PR #128754 ] Use torch.ops.fsdp.set_ for FSDP2 storage resize; dont functionalize resize_, set_, split_with_sizes_copy.out (#129203 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203 Approved by: https://github.com/bdhirsh	2024-06-23 06:07:19 +00:00
James Wu	5b14943213	Run TestAOTAutograd test suite with cache (#128222 ) This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache. To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss. We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future. In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222 Approved by: https://github.com/bdhirsh	2024-06-22 02:13:28 +00:00
Simon Fan	8f320fd6c6	[compiled autograd] treat input params as static (#128987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128987 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng ghstack dependencies: #127960, #128905, #128982	2024-06-21 08:16:33 +00:00
Simon Fan	fafa1867d1	[compiled autograd] use in_compiled_autograd_region instead of compiled_autograd_enabled_count (#128982 ) current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too: - no lazy compile for compiled fw - no aot autograd cache for inference graphs we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905	2024-06-21 08:16:33 +00:00
Simon Fan	68b33453f4	[aot autograd] collect static parameter metadata when graphs fallback to inference (#128905 ) https://github.com/pytorch/pytorch/pull/126820 but for graphs that have requires_grad inputs but no requires_grad outputs i.e. inference graph the implementation of inference graph fallback was throwing away the static parameter information during metadata recomputation also adding a cudagraphs counter to test this easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/128905 Approved by: https://github.com/mlazos ghstack dependencies: #127960	2024-06-21 08:16:33 +00:00
chilli	a2b1673dfb	[Horace's PR #126446 ] Prevent partitioner from ever saving views (#129039 ) Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039 Approved by: https://github.com/Chillee	2024-06-19 23:21:16 +00:00
Guilherme Leobas	9818283da1	re-enable jacrev/jacfwd/hessian after #128028 landed (#128622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622 Approved by: https://github.com/zou3519	2024-06-18 17:08:58 +00:00
Simon Fan	4b96575a09	[dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196 ) FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched. For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196 Approved by: https://github.com/soulitzer	2024-06-14 20:28:08 +00:00
chilli	c486e2ab64	Add coloring to fx graph print out (#128476 ) Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail. Old: <img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6"> New: <img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476 Approved by: https://github.com/ezyang	2024-06-13 23:39:04 +00:00
James Wu	a3af32c2fb	Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618 ) This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses. That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache. To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function. By doing this, we can now run the dynamic shapes cache test with autograd turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618 Approved by: https://github.com/bdhirsh	2024-06-13 19:45:33 +00:00
PyTorch MergeBot	dd19c9150c	Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016 )" This reverts commit `b459713ca7`. Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))	2024-06-13 17:56:42 +00:00
Yidi Wu	e9b81e4edf	Fakify torch bind input by default (#128454 ) Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed Differential Revision: D58418251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454 Approved by: https://github.com/angelayi	2024-06-13 16:25:11 +00:00
IvanKobzarev	2b9465d62a	[aota] Allow some mutations in backward (#128409 ) https://github.com/pytorch/pytorch/issues/127572 Allow mutations in backward on forward inputs, if 1/ not mutationg metadata Enforced at compilation time. 2/ if create_graph=True: mutated input does not require_grad Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled() Adding input_joint_info to track mutations of inputs during joint. Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409 Approved by: https://github.com/bdhirsh	2024-06-13 12:09:08 +00:00
James Wu	6aef2052ea	Save backward graphs lazily to cache (#126999 ) This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999 Approved by: https://github.com/bdhirsh ghstack dependencies: #126791	2024-06-12 21:58:34 +00:00
James Wu	cc231a8e2b	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 20:04:44 +00:00
PyTorch MergeBot	71f491554c	Revert "First version of AOTAutogradCache (#126791 )" This reverts commit `abc3eec22d`. Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))	2024-06-12 13:59:29 +00:00
James Wu	abc3eec22d	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 13:44:30 +00:00
IvanKobzarev	b459713ca7	[aota] compiled forward outputs requires_grad alignment with eager (#128016 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 We assume only two possible mutually exclusive scenarios: 1. Running compiled region for training (Any of inputs has requires_grad) - Produced differentiable outputs should have requires_grad. 2. Running compiled region for inference (None of inputs has requires_grad) - All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016 Approved by: https://github.com/bdhirsh	2024-06-10 20:51:22 +00:00
Guilherme Leobas	4460e481bc	Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255 Approved by: https://github.com/zou3519	2024-06-10 20:47:53 +00:00
PyTorch MergeBot	90bb510ece	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `348b181a97`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))	2024-06-10 20:44:42 +00:00
Peter Bell	253fa9c711	[AOTAutograd] Remove runtime import from view replay function (#128184 ) `gen_alias_from_base` spends about ~0.5 us in this import statement, which is called for each view in the graph output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128184 Approved by: https://github.com/lezcano ghstack dependencies: #128183	2024-06-09 23:33:03 +00:00
Peter Bell	55b2a0a002	[AOTAutograd] Use _set_grad_enabled instead of no_grad (#128183 ) This saves ~1us of overhead from each inductor graph call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128183 Approved by: https://github.com/lezcano	2024-06-09 23:33:03 +00:00
James Wu	6e7a23475d	[easy] Run autograd if any mutations on inputs that require grad (#128229 ) If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach. Fixes #128035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229 Approved by: https://github.com/aorenste	2024-06-08 21:18:38 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Xuehai Pan	348b181a97	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007	2024-06-08 15:25:03 +00:00
Edward Z. Yang	73d6ec2db6	Increase verbosity of FX graph dumps (#128042 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128042 Approved by: https://github.com/aorenste	2024-06-08 07:24:58 +00:00
chilli	310f80995b	Added memory budget to partitioner (#126320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320 Approved by: https://github.com/shunting314	2024-06-08 05:52:40 +00:00
PyTorch MergeBot	128952625b	Revert "Added memory budget to partitioner (#126320 )" This reverts commit `2184cdd291`. Reverted https://github.com/pytorch/pytorch/pull/126320 on behalf of https://github.com/ZainRizvi due to The new test_ac.py fails on ROCm machines ([comment](https://github.com/pytorch/pytorch/pull/126320#issuecomment-2155141886))	2024-06-07 16:15:03 +00:00
Brian Hirsh	476bfe6cce	fix torch.compile with triton kernels under inference_mode (#124489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124489 Approved by: https://github.com/albanD	2024-06-07 03:29:37 +00:00
chilli	2184cdd291	Added memory budget to partitioner (#126320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320 Approved by: https://github.com/shunting314	2024-06-06 20:32:29 +00:00
Aaron Gokaslan	2d47385f0f	[BE]: Enable ruff TCH rules and autofixes for better imports (#127688 ) Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies. Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688 Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet	2024-06-06 16:55:58 +00:00
Michael Lazos	70ba6f0ab6	Collect static parameter metadata in aot (#126820 ) Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary. This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820 Approved by: https://github.com/bdhirsh	2024-06-06 06:39:50 +00:00
PyTorch MergeBot	4c074a9b8b	Revert "[torchbind] always fakify script object by default in non-strict export (#127116 )" This reverts commit `c27882ffa8`. Reverted https://github.com/pytorch/pytorch/pull/127116 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/127116#issuecomment-2147459339))	2024-06-04 12:53:19 +00:00
Yidi Wu	c27882ffa8	[torchbind] always fakify script object by default in non-strict export (#127116 ) This diff can be risky for internal tests: any torchbind class that hasn't registered a fake class will fail and we should fix them. We've gained some confidence that this can work e2e by implementing FakeTensorQueue for TBE models in sigmoid with [D54210823](https://www.internalfb.com/diff/D54210823). Differential Revision: [D57991002](https://our.internmc.facebook.com/intern/diff/D57991002) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127116 Approved by: https://github.com/zou3519 ghstack dependencies: #127113, #127114	2024-06-03 21:38:57 +00:00
Yidi Wu	3efac92888	[torchbind] support torch.compile with aot_eager backend (#127114 ) Differential Revision: [D57991001](https://our.internmc.facebook.com/intern/diff/D57991001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127114 Approved by: https://github.com/zou3519 ghstack dependencies: #127113	2024-06-03 21:38:57 +00:00
Yanbo Liang	7e97b33fbb	[Dynamo] Log backward graph compilation metrics (#126629 ) Fixes #125313 Compilation metric logs for the code example at #125313: ``` %s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True) %s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True) tensor([[-0.1622, -0.0000, -0.0000, 0.5643, -0.0000, 0.0000, -0.5087, 0.0914, -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>) %s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) %s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629 Approved by: https://github.com/ezyang	2024-06-03 03:55:33 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
Oguz Ulgen	0d9e527c4d	Remove tensor storage_offset/storage_bytes from the cache key (#127319 ) Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key. Test Plan: CI Reviewed By: masnesral Differential Revision: D57871276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319 Approved by: https://github.com/masnesral	2024-06-02 00:28:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Edward Z. Yang	0aaac68c57	Add structured logging for tensor fakeification (#126879 ) This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs when they are triggered from Dynamo. The logs look like this: ``` V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} ``` The `describer_id` is used to disambiguate ids. We expect it to be unique per frame id, but if there is a bug it possibly is not. Note you will get redundant dumps when evaluation restarts. tlparse can use this to give a visualization of input tensors to a model, you could also use this to generate example inputs to run graphs on. Some care is taken to avoid redumping the tensor metadata multiple times, which would happen ordinarily because AOTAutograd refakifies everything after Dynamo, to deal with metadata mutation. Partially fixes https://github.com/pytorch/pytorch/issues/126644 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126879 Approved by: https://github.com/jamesjwu	2024-05-31 01:58:44 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Michael Hsu	85172fbe84	Back out "Prevent partitioner from ever saving views (#126446 )" (#127316 ) Summary: Revert "Prevent partitioner from ever saving views (#126446)" due to a torchinductor failure on CU Training Framework tests. Reviewed By: Chillee Differential Revision: D57868343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127316 Approved by: https://github.com/Chillee	2024-05-29 00:29:44 +00:00
James Wu	71eafe9e97	Refactor dispatch logic to clarify control flow (#126402 ) As discussed, this cleans up the code so that create_aot_dispatcher literally chooses an aot_dispatch function and runs it. Moves wrapper logic to jit_compile_runtime_wrappers, and adds aot_dispatch_export to handle export cases in one place. This also makes aot_dispatch_* return the same type always: a Callable and the forward metadata, instead of returning different number of arguments in export cases. Callers that don't care about fw_metadata can just ignore it. Added return type hints to enforce the same exact interface among all the aot_dispatch_* functions. It'd be nice to move the checks from the synthetic base and dedup wrappers that have to do with export outside of those wrappers, but it's probably fine for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126402 Approved by: https://github.com/oulgen, https://github.com/bdhirsh ghstack dependencies: #126193	2024-05-25 16:06:34 +00:00
James Wu	cc9a3412d4	Implement a post_compile step for aot_dispatch_autograd (#126193 ) This PR moves the post compile portion of aot_dispatch_autograd into runtime_wrappers.py. Completing this allows us to run the post compile section on its own when warm starting. I considered leaving this thing in jit_compile_runtime_wrappers, but we're gonna run into circular dependency issues later if we don't move it over Pull Request resolved: https://github.com/pytorch/pytorch/pull/126193 Approved by: https://github.com/bdhirsh ghstack dependencies: #126907	2024-05-25 03:24:20 +00:00
James Wu	023c1baf82	Add global configurations to cache key (#126907 ) This adds a bunch of global configurations to the cache key. There's definitely more I haven't added, but this is just an audit of all of the `torch.*` globals that are used in jit_compile_runtime_wrappers, runtime_wrappers, etc. It also makes the hash details object subclass FXGraphHashDetails, which implements other hashed data like configs inductor depends on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126907 Approved by: https://github.com/aorenste	2024-05-24 21:26:46 +00:00
Angela Yi	cb6ef68caa	Propagate tokens in aotautograd (#127028 ) Test Plan: `buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 938593492 --output /tmp/938593492.zip --use-torchrec-eager-mp --use-manifold` Differential Revision: D57750072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127028 Approved by: https://github.com/tugsbayasgalan	2024-05-24 03:23:17 +00:00
James Wu	6eac3f45c7	Add basic sanity checks for graph ops to cache key (#124745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124745 Approved by: https://github.com/bdhirsh	2024-05-23 23:37:43 +00:00
Aart Bik	ff82e2e7cf	[traced-graph][sparse] propagate sparsity metadata into traced graph (#117907 ) Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc. For background see forum postings (with examples): https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145 https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803 And feature request: https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907 Approved by: https://github.com/pearu, https://github.com/ezyang	2024-05-23 22:46:46 +00:00
chilli	d4ec18bdad	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-22 17:28:46 +00:00
PyTorch MergeBot	0f37fd06d9	Revert "Prevent partitioner from ever saving views (#126446 )" This reverts commit `da2292ce6b`. Reverted https://github.com/pytorch/pytorch/pull/126446 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
chilli	da2292ce6b	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-20 23:40:56 +00:00
chilli	e3230f87aa	Cached required_fw_nodes creation (#126613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126613 Approved by: https://github.com/anijain2305	2024-05-19 01:48:52 +00:00
James Wu	078e530446	Delete refactored function, move changes over (#126407 ) Oops, in https://github.com/pytorch/pytorch/pull/125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. https://github.com/pytorch/pytorch/pull/126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126407 Approved by: https://github.com/eellison	2024-05-17 15:28:18 +00:00
chilli	f9a7033194	Refactor partitioner and clean it up (#126318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126318 Approved by: https://github.com/anijain2305	2024-05-17 06:15:00 +00:00
Alex Denisov	1a27e24ff5	Make inductor scheduler graph extension configurable (#125578 ) This patch makes the inductor scheduler graph extension configurable. It enables ease of debugging by changing the graph format (dot, png, etc.). Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz Pull Request resolved: https://github.com/pytorch/pytorch/pull/125578 Approved by: https://github.com/Chillee	2024-05-17 04:19:23 +00:00
James Wu	a91311e7c2	[easy] Remove aot_config from pre_compile returns, rename fw_metadata in post_compile (#125854 ) This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor. As @aorenste points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile. Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves. Currently, wrappers come in two categories: 1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile 2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata. So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125854 Approved by: https://github.com/aorenste, https://github.com/bdhirsh	2024-05-15 17:23:47 +00:00
eellison	5178baefa9	use statically known instead of suppress guard for ddp stride propagation (#126234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126234 Approved by: https://github.com/ezyang	2024-05-15 05:21:55 +00:00
Yukio Siraichi	02093b6c6a	Keep track of `ViewMeta` with symbolic inputs. (#125876 ) Fix: #125387 This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs. In summary, the changes are: - Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor parameter - Add the field `FunctionalTensorWrapper::is_symbolic_` and the method `FunctionalTensorWrapper::maybe_mark_symbolic` - Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have symbolic inputs - Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API - Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation - Use the AOTAutograd `ViewMeta`-replay path if: - `target_functional_tensor` is not `None`; and - `target_functional_tensor` is not symbolic (instead of using a functorch config) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876 Approved by: https://github.com/ezyang	2024-05-12 01:41:06 +00:00

... 2 3 4 5 6 ...

831 Commits