pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	f9953e0f61	Enable PLC0414 on ruff (#165828 ) This PR enables `PLC0414` that fixes redundant import aliases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165828 Approved by: https://github.com/albanD	2025-10-22 04:56:52 +00:00
Maggie Moss	9944cac6e6	Add suppressions to torch/_inductor (#165062 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Split this directory into two PRs to keep them from being too large. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062 Approved by: https://github.com/oulgen, https://github.com/mlazos	2025-10-09 20:34:20 +00:00
Aaron Orenstein	250ae2531c	Fix types in graphs.py (#158192 ) Added type annotations for torch/cuda/graphs.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192 Approved by: https://github.com/oulgen	2025-07-15 19:49:38 +00:00
Boyuan Feng	1044934878	[CUDAGraph] add config `cudagraph_capture_sizes` (#156551 ) Users may want CUDAGraph for certain sizes and fallback for other sizes. As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others. Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback. This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes. Example: ```python import torch torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)] def f(x): return x + 1 f = torch.compile(f, mode="reduce-overhead", dynamic=False) def run(batch_size, seq_len, d): x = torch.randn((batch_size, seq_len, d), device="cuda") # Need to mark the dimension as dynamic. Automated-dynamic # may have some ux issues on matching `cudagraph_capture_sizes` # with the actual dynamic shapes, since there are specialization and # multiple dynamo graphs. torch._dynamo.mark_dynamic(x, 0) torch._dynamo.mark_dynamic(x, 1) for _ in range(3): f(x) for i in range(2, 10): for j in range(2, 10): run(i, j, 8) num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id() assert num_cudagraph.id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551 Approved by: https://github.com/bobrenjc93	2025-06-24 05:14:49 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit `3627270bdf`. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
Simon Fan	28796f71d0	Redo D75092426: [internal] Expose additional metadata to compilation callbacks (#155063 ) Originally https://github.com/pytorch/pytorch/pull/153596 --------------- Summary: via reverting D75708685 gate the ROCm failure Test Plan: Unit tests in OSS, sandcastle Rollback Plan: Bifferential Revision: D75894349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155063 Approved by: https://github.com/masnesral	2025-06-05 23:40:31 +00:00
PyTorch MergeBot	35fc5c49b4	Revert "[internal] Expose additional metadata to compilation callbacks (#153596 )" This reverts commit `f889dea97d`. Reverted https://github.com/pytorch/pytorch/pull/153596 on behalf of https://github.com/izaitsevfb due to introduces bunch of callback-related failures on rocm ([comment](https://github.com/pytorch/pytorch/pull/153596#issuecomment-2923139061))	2025-05-30 18:39:27 +00:00
Simon Fan	f889dea97d	[internal] Expose additional metadata to compilation callbacks (#153596 ) These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward. Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout. ```python class CallbackTrigger(enum.Enum): # most common case, dynamo attempts to trace a new frame DYNAMO = 1 # backward compilation can be deferred to runtime LAZY_BACKWARD = 2 # some backends autotune at runtime TRITON_AUTOTUNING = 3 # cudagraphs record at runtime CUDAGRAPH_RECORDING = 4 ``` Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596 Approved by: https://github.com/masnesral	2025-05-30 08:07:04 +00:00
eellison	d6e29bf875	Reflect back mutation if we clone misaligned tensors (#154442 ) Fix for https://github.com/pytorch/pytorch/issues/152425 inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor. If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation. We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark: ``` import torch t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16) @torch.compile(dynamic=False) def foo(x): return x.add_(1) import triton print(triton.testing.do_bench(lambda: foo(t[:-1]))) torch._dynamo.reset() print(triton.testing.do_bench(lambda: foo(t[1:]))) ``` gives ``` 0.04063070610165596 0.07613472988113162 ``` So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case. In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh	2025-05-29 13:36:48 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00
Boyuan Feng	d969e2ec33	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison, https://github.com/ngimel	2025-05-02 04:26:35 +00:00
PyTorch MergeBot	56039b5778	Revert "[CUDAGraph Trees] support memory allocation on side stream (#152472 )" This reverts commit `c620763ec2`. Reverted https://github.com/pytorch/pytorch/pull/152472 on behalf of https://github.com/BoyuanFeng due to should use tid instead pid ([comment](https://github.com/pytorch/pytorch/pull/152472#issuecomment-2843491656))	2025-04-30 22:18:10 +00:00
Boyuan Feng	c620763ec2	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison	2025-04-30 17:45:07 +00:00
Sam Larsen	8542d55f0c	[logging] Clean up dynamo_timed usages in cudagraph_trees (#152136 ) Summary: I'm investigating differences in total torch.compile overhead in our two main internal sources: dynamo_compile and pt2_compile_events. One source of discrepancy is due to cudagraphs overheads. Currently, we have a context manager that optionally attributes a dynamo_timed region to a cudagraph-related column logged to dynamo_compile, but _all_ dynamo_timed regions show up in pt2_compile_events (hence the discrepancy; pt2_compile_events is overcounting). We could filter out these specific events from pt2_compile_events when measuring overall overhead. But I'm going to argue that those timed regions that we DO NOT consider as a compiler-related overhead don't have much value in logging in the first place. So I'm suggesting we just remove those instances. Here's the production job with the discrepancy: * dynamo_compile: https://fburl.com/scuba/dynamo_compile/3604eypl * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/c2dv8sty Test Plan: torchbench nanogpt: * tlparse: https://fburl.com/h1n2ascc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u37yrynp * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/s7avd0di Pull Request resolved: https://github.com/pytorch/pytorch/pull/152136 Approved by: https://github.com/BoyuanFeng	2025-04-25 19:18:12 +00:00
Sam Larsen	edba20b853	[logging] Fix duration logging for dynamo_compile (#151749 ) Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. Test Plan: Ran internal models and compared dynamo_compile to pt2_compile_events: `TORCHINDUCTOR_BENCHMARK_FUSION=0` * tlparse: https://fburl.com/itciwnxc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7 `TORCHINDUCTOR_BENCHMARK_FUSION=1` * tlparse: https://fburl.com/jgurcvkw * dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez Pull Request resolved: https://github.com/pytorch/pytorch/pull/151749 Approved by: https://github.com/Skylion007	2025-04-22 03:29:13 +00:00
eellison	5757aa8773	Cudagraph fix + comment cleanup (#149741 ) Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`. i also snuck in an outdated comment and a duplicate line removal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-03-21 21:12:36 +00:00
Sam Larsen	7cdbb913e7	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-13 03:50:58 +00:00
Sam Larsen	187d5c0eb1	[logging] Log cudagraphify timings to dynamo_timed (#143220 ) Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note: * Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries * The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id * I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`: * tlparse: https://fburl.com/21yrdn8h * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/r9mp7uiv * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220 Approved by: https://github.com/eellison	2025-03-07 23:07:13 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
eellison	481a57bc37	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-28 00:47:03 +00:00
PyTorch MergeBot	17358ce778	Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878 )" This reverts commit `ad0c879e22`. Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))	2025-02-27 03:36:16 +00:00
eellison	ad0c879e22	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-27 02:08:29 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
PyTorch MergeBot	9a883007a2	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )" This reverts commit `c7515da7b0`. Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))	2025-02-13 18:04:26 +00:00
Daniel Galvez	c7515da7b0	Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 ) This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361 I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534 Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979 Approved by: https://github.com/eqy, https://github.com/eellison	2025-02-11 18:16:15 +00:00
Harmen Stoppels	01554c7b5a	fix incorrect literal strings / accidental tuples (#146037 ) * `expr,` is short for `(expr,)` * literal strings over multiple lines need to escape the newline `\` or use `(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037 Approved by: https://github.com/Skylion007	2025-02-03 15:08:11 +00:00
Aaron Orenstein	893ca1dfe1	PEP585 update - torch/_inductor/[_-i]* (#145137 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137 Approved by: https://github.com/bobrenjc93	2025-01-19 01:22:47 +00:00
bobrenjc93	a3ab27b8e0	Migrate from Tuple -> tuple in torch/_inductor (#144264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264 Approved by: https://github.com/eellison	2025-01-07 03:27:27 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
PyTorch MergeBot	5c97ac9721	Revert "Remove unused Python variables in torch/[_-a]* (#133492 )" This reverts commit `fda975a7b3`. Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else. The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))	2024-12-11 17:29:12 +00:00
Tom Ritchford	fda975a7b3	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-10 21:48:44 +00:00
Boyuan Feng	17fd53d8e5	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-27 18:51:52 +00:00
PyTorch MergeBot	65dbd5cc2d	Revert "[Inductor] Inplacing with Donated Buffer (#140113 )" This reverts commit `eecc8e362c`. Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))	2024-11-26 21:20:59 +00:00
Boyuan Feng	eecc8e362c	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-26 17:19:50 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Boyuan Feng	e2e425b4f3	[CUDAGraph] Add dynamo timer to checkpoint, warmup, and record (#139818 ) Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`, `record`, and `checkpoint`. Test Plan: 1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph 2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9). {F1954034920} Differential Revision: D65505659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818 Approved by: https://github.com/eellison	2024-11-09 05:27:11 +00:00
Aaron Orenstein	07cc4bd3e2	typing compile_fx.py (#138033 ) Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033 Approved by: https://github.com/Skylion007	2024-10-21 18:14:59 +00:00
eellison	8893881867	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang Co-authored-by: eellison <elias.ellison@gmail.com>	2024-10-09 00:05:52 +00:00
Edward Z. Yang	175485097a	[EASY] Typofix (#135022 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135022 Approved by: https://github.com/albanD	2024-09-04 01:59:40 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Boyuan Feng	6a348e5e57	[CUDAGraph] Warn once if too many distinct sizes (#132832 ) Warn once if there are too many distinct sizes for cudagraph, so we can avoid spamming logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132832 Approved by: https://github.com/eellison	2024-08-07 19:48:06 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit `78927d37f6`. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
James Wu	f9e4d05c15	Save and run post compilation steps within FXGraphCache (#130572 ) This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that: - When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification). - When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch. What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it returns exactly what compile_fx_inner returns. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner. ## What's a post compile step? We define a post-compile to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include: - Setting the tracing context's output strides - Running cudagraphs if enabled - Maybe realign inputs if cudagraphs didn't run To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object. ## Splitting cudagraphs work into pre/post compile Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages. Implementation notes: - We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness. - ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache doesn't have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so only inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result. - Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572 Approved by: https://github.com/eellison	2024-07-31 18:32:40 +00:00
Boyuan Feng	aebfd3d4de	[CUDAGraph] skip cudagraph if too many distinct sizes (#131387 ) Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes. While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424. This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50). Fixes #119640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387 Approved by: https://github.com/eellison	2024-07-26 06:17:35 +00:00
Boyuan Feng	16d7cb5049	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 06:14:06 +00:00

1 2 3

135 Commits