pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	030de07aff	[2/N] Use 'is' in callable comparisons (#166685 ) It is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685 Approved by: https://github.com/xmfan, https://github.com/mlazos	2025-10-31 08:08:07 +00:00
Boyuan Feng	601ae8e483	[CUDAGraph] add config to error on skipping cudagraph (#161862 ) Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-09-04 15:52:39 +00:00
PyTorch MergeBot	f27985b7e7	Revert "[CUDAGraph] add config to error on skipping cudagraph (#161862 )" This reverts commit `204697f0e6`. Reverted https://github.com/pytorch/pytorch/pull/161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](https://github.com/pytorch/pytorch/pull/161862#issuecomment-3249582583))	2025-09-03 14:50:44 +00:00
Boyuan Feng	204697f0e6	[CUDAGraph] add config to error on skipping cudagraph (#161862 ) Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862 Approved by: https://github.com/ezyang	2025-09-02 15:28:22 +00:00
Lucas Kabela	6fe6dd9fdc	Type cudagraphs.py (#160363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160363 Approved by: https://github.com/StrongerXi ghstack dependencies: #160362	2025-08-15 02:09:31 +00:00
Boyuan Feng	5f1010fbb3	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-12 04:37:58 +00:00
PyTorch MergeBot	09381f5dac	Revert "[Graph Partition] Pass all OSS unit tests (#154667 )" This reverts commit `ca7315c171`. Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](`ca7315c171`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))	2025-08-11 20:34:27 +00:00
Boyuan Feng	ca7315c171	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-11 16:25:12 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit `3627270bdf`. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
Boyuan Feng	3f54b14c75	[CUDAGraph] support meta tensor (#150478 ) Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue. ### Example ```python import torch def foobar(x, y): return x * 2, y * 3 foo_c = torch.compile(mode="reduce-overhead")(foobar) t = torch.empty((1, 16, 128, 128), device="meta") y = torch.rand([64], device="cuda") eager_out = foobar(t, y) for _ in range(3): compiled_out = foo_c(t, y) ``` Prior to this PR, above code leads to ``` skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta') ``` With this PR, we don't skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150478 Approved by: https://github.com/eellison	2025-04-02 07:21:50 +00:00
Boyuan Feng	3e605fe46d	[CUDAGraph] Graph Partition (#147648 ) This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR. ## Example ```python import torch torch._inductor.config.graph_partition = True def f(x, y): x1 = x + 1 y1 = y + 1 y_cpu = y1.cpu() + 1 z = x @ y return x1 + y1 + z + y_cpu.cuda() x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)] x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]] eager_out = f(x, y) f_compiled = torch.compile(f, mode="reduce-overhead") for _ in range(5): compiled_out = f_compiled(x_cloned, y_cloned) assert torch.allclose(eager_out, compiled_out) ``` w/o graph partition, we will skip cudagraph: ``` skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from : File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f y_cpu = y1.cpu() + 1 # 3 ``` w/ graph partition, we can see two cudagraphify under the same torch-compiled region: ![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25) ## Design PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions. The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is: - compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`. - during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`. - finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`. #### Q: How does it work with codecache? While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later. ## Edge Case 1 PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`. ## Edge Case 2 Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420) ![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40) To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648 Approved by: https://github.com/eellison	2025-03-13 16:00:21 +00:00
Boyuan Feng	5b60749e9e	[cudagraph] add log for skip reasons (#148797 ) Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph Test Plan: {F1975906635} Differential Revision: D70820791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797 Approved by: https://github.com/masnesral	2025-03-11 23:31:48 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
Boyuan Feng	079a3e0f75	[BE] Add type annotations to cudagraph_utils.py and test_cases.py (#145291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145291 Approved by: https://github.com/Skylion007	2025-01-22 14:54:45 +00:00
Aaron Orenstein	893ca1dfe1	PEP585 update - torch/_inductor/[_-i]* (#145137 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137 Approved by: https://github.com/bobrenjc93	2025-01-19 01:22:47 +00:00
bobrenjc93	a3ab27b8e0	Migrate from Tuple -> tuple in torch/_inductor (#144264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264 Approved by: https://github.com/eellison	2025-01-07 03:27:27 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
Boyuan Feng	6a348e5e57	[CUDAGraph] Warn once if too many distinct sizes (#132832 ) Warn once if there are too many distinct sizes for cudagraph, so we can avoid spamming logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132832 Approved by: https://github.com/eellison	2024-08-07 19:48:06 +00:00
Michael Lazos	a8f0979962	Add cudagraph static inputs logging (#132726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132726 Approved by: https://github.com/anijain2305	2024-08-06 12:01:20 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit `78927d37f6`. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
James Wu	f9e4d05c15	Save and run post compilation steps within FXGraphCache (#130572 ) This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that: - When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification). - When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch. What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it returns exactly what compile_fx_inner returns. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner. ## What's a post compile step? We define a post-compile to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include: - Setting the tracing context's output strides - Running cudagraphs if enabled - Maybe realign inputs if cudagraphs didn't run To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object. ## Splitting cudagraphs work into pre/post compile Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages. Implementation notes: - We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness. - ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache doesn't have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so only inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result. - Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572 Approved by: https://github.com/eellison	2024-07-31 18:32:40 +00:00
Xuehai Pan	e7eeee473c	[BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]/` and `torch/_[e-h]/` and `torch/_[j-z]*/` (#129765 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765 Approved by: https://github.com/ezyang	2024-07-31 10:42:50 +00:00
Boyuan Feng	aebfd3d4de	[CUDAGraph] skip cudagraph if too many distinct sizes (#131387 ) Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes. While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424. This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50). Fixes #119640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387 Approved by: https://github.com/eellison	2024-07-26 06:17:35 +00:00
Boyuan Feng	16d7cb5049	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 06:14:06 +00:00
PyTorch MergeBot	03f49c9523	Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 )" This reverts commit `16699c7d84`. Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))	2024-07-26 02:08:45 +00:00
Boyuan Feng	16699c7d84	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 01:40:23 +00:00
James Wu	1ad4e6f228	Refactor cudagraphs to use serializable placeholder info (#130252 ) This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252 Approved by: https://github.com/eellison, https://github.com/masnesral ghstack dependencies: #129384	2024-07-25 20:39:37 +00:00
James Wu	a6ebd56f7b	Factor out cudagraph post compile into its own function (#129384 ) Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start. No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384 Approved by: https://github.com/eellison	2024-07-25 20:15:44 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
Boyuan Feng	fa5f572748	[cudagraph] fallback to eager if re-record too many times (#129349 ) Summary: CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime. While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid: - [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses - Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION) - Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs. Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees. A summary on how static tensor address changes are handled now: - For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption. - If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node. - If the function_id has been re-recorded too many times, fallback to eager function and warning. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349 Approved by: https://github.com/eellison	2024-07-14 04:17:24 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Michael Lazos	f5328542b5	Allow multiple cudagraph recordings per compiled graph (#126822 ) ### Introduction/Problem Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship. With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance. ### Solution To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed. ### Next steps (if needed) It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822 Approved by: https://github.com/eellison ghstack dependencies: #126820, #126821	2024-06-06 06:39:59 +00:00
Boyuan Feng	ff23c5b7d7	[cudagraph] improve log for mutating static input tensor addresses (#127145 ) Summary: This diff adds more log for cudagraph when static input tensor mutates. For each placeholder whose static input tensor address mutates, we log its name, changed data pointer address, and the input stack trace. Since some placeholder may have empty stack trace, we find its first user with an non-empty stack trace and print this stack trace instead. Test Plan: buck2 run fbcode//caffe2/test/inductor:cudagraph_trees -- --r test_static_inputs_address_mutation_log Differential Revision: D57805118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127145 Approved by: https://github.com/eellison	2024-05-30 19:57:32 +00:00
Simon Fan	14430564ce	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-26 03:22:29 +00:00
Simon Fan	855939904b	[cudagraphs] add more info to skip messages (#124700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124700 Approved by: https://github.com/eellison ghstack dependencies: #119729	2024-04-26 03:22:29 +00:00
PyTorch MergeBot	6a92b352ee	Revert "[cudagraphs] add more info to skip messages (#124700 )" This reverts commit `0ed38c9b22`. Reverted https://github.com/pytorch/pytorch/pull/124700 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
PyTorch MergeBot	154157416c	Revert "[cudagraphs] add cudagraph_skips counter (#124804 )" This reverts commit `fdad16b851`. Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
Simon Fan	fdad16b851	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-25 03:38:09 +00:00
Simon Fan	0ed38c9b22	[cudagraphs] add more info to skip messages (#124700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124700 Approved by: https://github.com/eellison ghstack dependencies: #119729	2024-04-25 03:38:09 +00:00
Boyuan Feng	b91f83f181	[cudagraph] add config for cudagraph managed input mutation support (#124754 ) Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature. Test Plan: ci Differential Revision: D56481353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754 Approved by: https://github.com/eellison	2024-04-24 04:23:53 +00:00
Boyuan Feng	9a71d12d92	[CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231 ) # PR This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details. # Note on Optimistic Mutation Check To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs. Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster. - To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check. - To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes. Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is: - [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`. - [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function. - [Slow Path] Before `record_function`, we run `check_for_mutation` again. Q1: Would there be overhead for type a,b,c,d? A: No. We only check input mutation types for the first invocation of a function on a node. Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future? A: Yes. This is done by `check_invariants` and guarantees the correctness. Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future? A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231 Approved by: https://github.com/eellison	2024-04-19 10:32:12 +00:00
eellison	5f46312dbb	Reapply "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) (#122713 ) This reverts commit `92ed8553a6`. No longer importing codecache or boxed_nop at top level, both of which casued issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122713 Approved by: https://github.com/anijain2305	2024-04-02 16:11:00 +00:00
Animesh Jain	92ed8553a6	Revert "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) This reverts commit `9373ad0bb8`. Revert "Add Cudagraphs disable checking (#121018)" This reverts commit `4af0e634bf`. Causes compilation time increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864 Approved by: https://github.com/eellison	2024-03-15 00:03:09 +00:00
eellison	9373ad0bb8	Switch cudagraph backend to cudagraph trees (#121019 ) Switch torch.compile(..., backend="cudagraphs") to use cudagraph trees. Enabled a few test in cudagraph_trees and note that there is another test suite existing for cudagraphs backend: https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_cudagraphs.py. This is basically the inductor cudagraphs without inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121019 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #121017, #121018	2024-03-08 22:56:26 +00:00
eellison	4af0e634bf	Add Cudagraphs disable checking (#121018 ) Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018 Approved by: https://github.com/ezyang ghstack dependencies: #121017	2024-03-08 22:47:24 +00:00
eellison	b95c45fbf7	add stack trace to device skip (#118112 ) Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112 Approved by: https://github.com/bdhirsh	2024-01-26 22:33:48 +00:00

1 2

51 Commits