pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	99dac7005f	Revert "[Inductor] FX backend via Wrapper IR (#146942 )" This reverts commit `a7691140a0`. Reverted https://github.com/pytorch/pytorch/pull/146942 on behalf of https://github.com/malfet due to Looks like it indeed breaks lint, see `a7691140a0/1` ([comment](https://github.com/pytorch/pytorch/pull/146942#issuecomment-2852192778))	2025-05-05 20:01:29 +00:00
Blaine Burton Rister	a7691140a0	[Inductor] FX backend via Wrapper IR (#146942 ) # Sub-PRs These PRs contain refactors from the main one. They should be reviewed and merged first. - https://github.com/pytorch/pytorch/pull/150458 - https://github.com/pytorch/pytorch/pull/152391 - https://github.com/pytorch/pytorch/pull/152587 # Feature The goals of this PR are twofold. ## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen. In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components. This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR. ## Goal 2: Convert Wrapper IR into FX IR. One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc. It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes. The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source. Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa. # Current status Things that seem to work: - Converted a lot of the most common Python codegen lines to Wrapper IR lines. - Handled the following cases, in addition to what was already in the Memory Planning IR: - Comments - Triton kernels - Extern/fallback kernels - Freeing tensors (`del buf0`) - MultiOutput - Graph outputs - ReinterpretView / StorageBox, for both call args and outputs. - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code. - Prototype FX converter which can handle some of the most common use cases. - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565). - Calling wrapped Triton kernels. - Calling extern kernels and certain types of fallback kernels. - Support both `extern_kernels.` and `aten.`. - Support multi-output kernels like `torch.topk`. - Graphs with multiple inputs/outputs. - Training i.e. calling `Tensor.backward()` in a compiled function. - Graph breaks (training). - Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen. Things that don't work: - Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections. - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX. - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR. - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this. # Out-of-tree compilers With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below. ``` from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen class MyCustomBackend(WrapperFxCodegen): def compile_graph(self, gm): # Add 1 to the graph's outputs def compiled_fn(args): return [x + 1 for x in gm.graph.forward(args)] return compiled_fn ``` # Example FX graphs This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`. Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) return (buf0,) ``` Here's a more complicated graph that calls a `torch.addmm` extern kernel. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}}) %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) return (buf2,) ``` Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {}) %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {}) %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}}) %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}}) return (buf1, buf2) ``` Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {}) return (buf0_view_buf0_0,) ``` Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`. ``` graph(): %s6 : [num_users=0] = placeholder[target=s6] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}}) return buf0 ``` Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions. ``` graph(): %s10 : [num_users=0] = placeholder[target=s10] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s102)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s102, XBLOCK: 64}}) return buf0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942 Approved by: https://github.com/jansel	2025-05-05 19:34:49 +00:00
PaulZhang12	84aa0985fb	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-03 02:23:54 +00:00
PyTorch MergeBot	7c3e679ddd	Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654 )" This reverts commit `fdcfc6a61a`. Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](`3c54e0c216`) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))	2025-05-02 06:31:38 +00:00
PaulZhang12	fdcfc6a61a	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-01 23:01:30 +00:00
Michael Lazos	a1f6d85b36	[Cutlass] Fixes for e2e compilation in arg rendering (#151405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151405 Approved by: https://github.com/eellison ghstack dependencies: #152305, #152306, #150905	2025-04-29 23:06:01 +00:00
Aaron Orenstein	c8b4a39d73	Add precedence to the infix printing done by sympy_str. (#151920 ) Add precedence to the infix printing done by sympy_str. Without this change sympy_str will print the same string for both `a+b(c+d)` and `(a+b)(c+d)`. While there I also cleaned up the printing for `-a` and `a - b`. Added some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151920 Approved by: https://github.com/jansel	2025-04-29 00:58:58 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
PyTorch MergeBot	72f711e200	Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 )" This reverts commit `8d81806211`. Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/henrylhtsang due to Revert because this change isn't needed ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2822768377))	2025-04-23 00:26:49 +00:00
Rachel Guo	c729f7dbee	[provenance_tracking][reland] Fix UT error and re-land `ExternKernel` support (#151709 ) Summary: ATT. reverted previous diff : D72572050 Test Plan: ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_extern_kernel ``` Differential Revision: D73281217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151709 Approved by: https://github.com/jingsh	2025-04-22 15:44:56 +00:00
rzou	29317f8585	[standalone_compile] Some misc fixes (#151502 ) This PR fixes two things. The first problem is that in the vLLM style standalone_compile is called from within a custom torch.compile backend. If there already is a FakeTensorMode (which there is), we shouldn't create a new FakeTensorMode with the same shape_env, instead we should just reuse the same FakeTensorMode. The second thing is that compile_fx can mutate the passed in gm, so we deepcopy (since standalone_compile should be standalone) Test Plan: - new test - updated old tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502 Approved by: https://github.com/oulgen ghstack dependencies: #151501, #151551	2025-04-18 12:34:13 +00:00
eellison	6d46b530fc	Remove libdevice ops in inductor (#151562 ) Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562 Approved by: https://github.com/isuruf, https://github.com/jansel	2025-04-17 22:18:00 +00:00
Chong Gu	a05cc9f494	Remove Clear Cache Time from do_bench_using_profiling (#150696 ) Summary: In most instances, this action would take ~33% of the total run time, which means that our benchmark would previously differ from the end results by a lot. Test Plan: We can compare the benchmark results for ``` CUDA_VISIBLE_DEVICES=4,5 buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-snapshot-id=672308665_0 --lower-backend=AOT_INDUCTOR --node-replacement-dict="{'torch.nn.Linear':{'(autotune)': 'fp8_float_model_dynamic_quantization_rowwise'}}" --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 ``` before and after the diff, and notice that on average, the benchmark results decrease by ~0.1ms per iteration, which is more closely aligned with the lowered modules. Differential Revision: D72469845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150696 Approved by: https://github.com/frank-wei	2025-04-17 07:25:41 +00:00
henrylhtsang	532025fbd0	[cutlass backend][ez] Ban FP32 output dtype from using CUTLASS GEMM backend (#151279 ) FP32 not supported: https://github.com/pytorch/pytorch/issues/145952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151279 Approved by: https://github.com/ColinPeppler	2025-04-16 01:12:18 +00:00
Oguz Ulgen	3cf0e2d8ec	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-15 23:38:15 +00:00
PyTorch MergeBot	74f6bc28a7	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit `c9aef50898`. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))	2025-04-15 17:35:59 +00:00
Oguz Ulgen	c9aef50898	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 22:00:09 +00:00
PyTorch MergeBot	24b3ab9255	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit `bbc5fe8504`. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))	2025-04-14 15:22:33 +00:00
Oguz Ulgen	bbc5fe8504	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 07:07:10 +00:00
Thomas Adams	8494d5582a	Propagate callable parameter types using ParamSpec (#142306 ) (#151014 ) Partially addresses #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151014 Approved by: https://github.com/Skylion007	2025-04-13 20:38:11 +00:00
Michael Lazos	fe961679d5	[Inductor] add support for disabling atomic adds (#151033 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-04-11 18:41:56 +00:00
henrylhtsang	8d81806211	[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 ) context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888 Approved by: https://github.com/jansel	2025-04-10 22:10:55 +00:00
PyTorch MergeBot	e786b3bf54	Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 )" This reverts commit `115a165f9b`. Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/malfet due to This indeed broke all those inductor tests ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2795231901))	2025-04-10 21:46:23 +00:00
henrylhtsang	115a165f9b	[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 ) context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888 Approved by: https://github.com/jansel	2025-04-10 19:46:35 +00:00
PaulZhang12	e62d958f02	[Inductor] Reland Merge Triton ScaledMM as epilogue to MM template #150045 (#150441 ) Merges https://github.com/pytorch/pytorch/pull/150438 and https://github.com/pytorch/pytorch/pull/150045. https://github.com/pytorch/pytorch/pull/150045 was already landed, but did not include a change that makes it unable to land internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150441 Approved by: https://github.com/clee2000	2025-04-02 17:49:32 +00:00
PyTorch MergeBot	f04cf13bdd	Revert "Merge Triton ScaledMM as epilogue to MM template (#150045 )" This reverts commit `981048854d`. Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))	2025-04-01 17:54:28 +00:00
PaulZhang12	981048854d	Merge Triton ScaledMM as epilogue to MM template (#150045 ) Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor. Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done. TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045 Approved by: https://github.com/drisspg	2025-03-31 23:20:14 +00:00
Sam Larsen	266bd22b44	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-28 01:06:39 +00:00
Boyuan Feng	c830d750e6	[graph partition] support splitting on custom ops (#149782 ) This PR adds support for graph partition on custom ops. Land after #149458. ### API This PR provides a new API to register/unregister custom ops for graph partition. ```python def register_custom_op_support_cudagraph( operator: torch._library.custom_ops.CustomOpDef, is_cudagraphable: bool, ) -> None ``` Example usage: ```python from torch._inductor.utils import register_custom_op_partition @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1) * 2 return cropped_img.cuda() / 255.0 @movement.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) ``` ### Example In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops. ![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c) Code to repro: ```python import torch from torch._inductor.utils import register_custom_op_support_cudagraph torch._inductor.config.graph_partition = True @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1)2 return cropped_img.cuda() / 255. @movement.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::modify", mutates_args=()) def modify(pic: torch.Tensor) -> torch.Tensor: pic1 = pic + 1 pic1_cpu = (pic1.cpu() + 1) 2 return pic1_cpu.cuda() + pic @modify.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::transform", mutates_args=()) def transform(pic: torch.Tensor) -> torch.Tensor: return (pic + 1) * 2 @transform.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) register_custom_op_support_cudagraph(modify, is_cudagraphable=False) img = torch.randn(3, 64, 64, device="cuda") def f(img): x = (img + 10) * 2 y = movement(x) z = y + 1 u = transform(z) v = 2*u + 1 out = modify(v) return out + 1 compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True) eager_out = f(img) for _ in range(3): compiled_out = compiled_f(img) assert torch.allclose(eager_out, compiled_out) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782 Approved by: https://github.com/zou3519	2025-03-27 16:23:07 +00:00
Rachel Guo	48cff64a54	[pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598 ) Summary: found it helpful when running prod model with combo_kernel feature enabled Test Plan: CI Differential Revision: D71513304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598 Approved by: https://github.com/yushangdi	2025-03-27 00:26:24 +00:00
PyTorch MergeBot	185aaaaf8e	Revert "Improve subproc autotuning implementation (#149700 )" This reverts commit `8cd6a133f2`. Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))	2025-03-26 23:17:01 +00:00
Sam Larsen	8cd6a133f2	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-25 20:07:28 +00:00
Ding, Yi1	f7d1b966c2	[Inductor] Unify the data type propagation between Triton and CPP Backend (#146970 ) Fixes #144246 Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel	2025-03-21 17:52:51 +00:00
James Wu	7bb9c36784	Hook StaticCudaLauncher up to torch.compile (cold start) (#148890 ) This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default. Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if: - The kernel is a cuda kernel and inductor can find a cubin file associated with it - The kernel takes less than 50 arguments - The kernel doesn't use any special features (launch hooks, large amounts of shared memory) - The kernel is not user defined (to be supported in a later PR) We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version. Some key features of StaticTritonCompileResult: - It is fully serializable - It stores the minimum amount of stuff, so that later it can be cached easily - It does not depend on any triton specific types (though it does have various triton metadata). For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime. Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes. Fixes #149448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890 Approved by: https://github.com/jansel	2025-03-20 17:32:20 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit `bfee141666`. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Boyuan Feng	3e605fe46d	[CUDAGraph] Graph Partition (#147648 ) This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR. ## Example ```python import torch torch._inductor.config.graph_partition = True def f(x, y): x1 = x + 1 y1 = y + 1 y_cpu = y1.cpu() + 1 z = x @ y return x1 + y1 + z + y_cpu.cuda() x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)] x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]] eager_out = f(x, y) f_compiled = torch.compile(f, mode="reduce-overhead") for _ in range(5): compiled_out = f_compiled(x_cloned, y_cloned) assert torch.allclose(eager_out, compiled_out) ``` w/o graph partition, we will skip cudagraph: ``` skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from : File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f y_cpu = y1.cpu() + 1 # 3 ``` w/ graph partition, we can see two cudagraphify under the same torch-compiled region: ![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25) ## Design PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions. The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is: - compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`. - during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`. - finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`. #### Q: How does it work with codecache? While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later. ## Edge Case 1 PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`. ## Edge Case 2 Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420) ![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40) To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648 Approved by: https://github.com/eellison	2025-03-13 16:00:21 +00:00
PyTorch MergeBot	c916a8efc5	Revert "Use the device interface for detecting Triton availability (#139171 )" This reverts commit `940b60db97`. Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))	2025-03-11 18:49:21 +00:00
George White	940b60db97	Use the device interface for detecting Triton availability (#139171 ) This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present. This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171 Approved by: https://github.com/jansel	2025-03-11 03:56:11 +00:00
drisspg	b215841ebb	[MM] Add sm carevout to lowerings (#148793 ) # Summary See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using the following to verify, need to figure out how to do proper guarding This does do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793 Approved by: https://github.com/lw, https://github.com/eellison	2025-03-10 23:49:26 +00:00
eellison	4c13a859e5	Workaround no triton float8_e8m0fnu support in inductor (#148722 ) Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873. The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722 Approved by: https://github.com/vkuzo ghstack dependencies: #148450	2025-03-10 17:37:39 +00:00
xinan.lin	50c9f6d83b	[Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323 ) In `fresh_inductor_cache` remove pyd files will raise permission error on Windows because they are still used by the process. So we clear the references to the loaded pyd libray obj and unload them from the process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323 Approved by: https://github.com/jansel ghstack dependencies: #148534, #148538, #147727	2025-03-07 17:19:59 +00:00
Shunting Zhang	262411e48b	[inductor] online softmax (#127011 ) Softmax need do some preparation work that access the input tensor in two passes - compute amax of each row - compute (x - amax).exp.sum for each row When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded. Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ). Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54 ## Microbenchmark - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax - eager_ms=6.671296119689941 - opt_ms=8.06931209564209 - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax - eager_ms=6.634047985076904 - opt_ms=6.230591773986816 Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011 Approved by: https://github.com/jansel	2025-03-06 21:07:18 +00:00
Benjamin Glass	d6d670ab4d	[AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587 ) In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits). Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587 Approved by: https://github.com/desertfire	2025-03-05 22:47:46 +00:00
Xia, Weiwen	ab81ca5053	[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes. Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested. Test plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-03 00:56:29 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
Animesh Jain	eb9c127341	[dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824 ) Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter. This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2025-02-28 03:22:11 +00:00
eellison	481a57bc37	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-28 00:47:03 +00:00
PyTorch MergeBot	17358ce778	Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878 )" This reverts commit `ad0c879e22`. Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))	2025-02-27 03:36:16 +00:00

1 2 3 4 5 ...

507 Commits