pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Laith Sakka	128b32f363	cache loaded python modules (#149910 ) I am splitting caching the loading of modules from the caching the codegen since its trivial and much easier. Module loading is 50% of the cost, and codegen is 50% of maybe_append choice on full graph model. which is 40% of total compile time. <img width="434" alt="Screenshot 2025-03-24 at 4 35 12 PM" src="https://github.com/user-attachments/assets/aa851c6a-bde9-43f8-b12d-e439504ef62c" /> running mm_loop benchmark, before this change: 67947323682 after this change: 25845073249 2.6X faster. it seems that the cache was there then got dropped. I added benchmark so it wont be dropped again by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149910 Approved by: https://github.com/eellison, https://github.com/aorenste ghstack dependencies: #149932	2025-03-27 00:45:09 +00:00
Shangdi Yu	46dd226702	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 ) Summary: We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly. - _fakify_script_objects in `compile_fx` - Allow fake torchbind objects in `torchbind_constants` Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens. Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API. Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`. Test Plan: ``` buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc ``` Differential Revision: D70013257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529 Approved by: https://github.com/angelayi	2025-03-21 18:58:28 +00:00
Zhuoran Zhao	a703107f7b	[AOTInductor] Fix skip cpp wrapper unit test (#149606 ) Summary: as title Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test -- --exact 'deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_aoti_ep_called (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)' ``` ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --exact 'caffe2/test/inductor:cudagraph_trees_expandable_segments - test_skip_cpp_wrapper (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' ``` https://www.internalfb.com/phabricator/paste/view/P1758059197 Reviewed By: henryoier Differential Revision: D71528281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149606 Approved by: https://github.com/desertfire	2025-03-20 20:55:33 +00:00
Kai Londenberg	f17ae3f7b7	[Inductor Cutlass backend] Fix imports and compilation of Cutlass SM100 Kernels (#149515 ) Summary: Fixes the import and compilation of Cutlass SM100 Kernels. Test Plan: Cutlass backend unit tests, running benchmarks/inductor_backends/cutlass.py Differential Revision: D71196747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149515 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-03-20 20:35:18 +00:00
William Wen	a66a9581da	[dynamo] support Python 3.13t (#149549 ) A few bug fixes to get Dynamo mostly working with 3.13 nogil. Dynamo encounters internal CPython assert errors in older versions of 3.13. The fix has been landed on [CPython's 3.13 branch](https://github.com/python/cpython/tree/3.13) and will be included in 3.13.3 (https://peps.python.org/pep-0719/ - april 8). If you wish to try `torch.compile` on the latest 3.13 branch, you can comment out the error checking (i.e. `70b6cd4e11/torch/__init__.py (L2535)` and `70b6cd4e11/torch/_dynamo/eval_frame.py (L899)`). We will work on getting PyTorch CI up for Dynamo/dynamo-wrapped/inductor once 3.13.3 is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149549 Approved by: https://github.com/jansel	2025-03-20 09:49:27 +00:00
Benjamin Glass	e8dd58b8cf	cpp_wrapper: Precompile device-specific header files (#146928 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Relands #144002, with changes needed by fbcode internals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928 Approved by: https://github.com/desertfire	2025-03-17 20:40:15 +00:00
Sam Larsen	c83c711da8	Remove some memory overhead in parallel compile workers (#149168 ) Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead. Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu: * After importing torch in a subproc: 371M * Without this PR, after compiling 15k kernels: 825M * With this PR, after compiling 15k kernels: 531M Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168 Approved by: https://github.com/jansel	2025-03-15 14:20:40 +00:00
Huamin Li	e7e477c1f9	Not generate custom obj json when it's empty (#149246 ) Summary: as title. See internal Diff summary for more context. Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated Differential Revision: D71241676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246 Approved by: https://github.com/houseroad Co-authored-by: Huamin Li <huaminli@meta.com>	2025-03-15 13:00:48 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit `8d08b49015`. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
Shangdi Yu	cf19efd3d9	Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506 ) Summary: Codegen - Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code - Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`. Serialization Add support for torchbind object in serialization - For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, args, kwargs)`. - it.TorchBindObject inputs are serialized to `as_custom_obj` Argument. Packaging* Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`. The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive. The torchbind objects are stored in data/constants/ folder in pt2 archive. The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`. CustomClassHolder objects implement their own pickle methods. Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json. The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code. This is required for both internal and OSS torchbind support. For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output. Work Left We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D69490718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506 Approved by: https://github.com/angelayi	2025-03-11 20:55:18 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
Benjamin Glass	ed7e964f2b	codecache.py: use str.format rather than % formatting (#148691 ) Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691 Approved by: https://github.com/desertfire	2025-03-10 18:33:58 +00:00
Zhuoran Zhao	3745da18f4	[AOTI] Swith to local cpp compile for fbcode (#148592 ) Summary: as title, otherwise we can not find lamdhip64 Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431 Differential Revision: D70637798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592 Approved by: https://github.com/hl475	2025-03-08 08:38:26 +00:00
Aaron Orenstein	a3b77d434a	Subprocess compile (attempt 2) (#148635 ) Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Fixed the test which caused the previous version (#146134) to be reverted: ``` $ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635 Approved by: https://github.com/jamesjwu	2025-03-07 17:50:14 +00:00
Benjamin Glass	d6d670ab4d	[AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587 ) In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits). Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587 Approved by: https://github.com/desertfire	2025-03-05 22:47:46 +00:00
PyTorch MergeBot	897fd9b514	Revert "Subprocess compile (#146134 )" This reverts commit `07f876e960`. Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see `e1dee4ccb3/3` ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))	2025-03-05 22:41:19 +00:00
Bin Bao	df7e43e5d4	[AOTI] Fix aot_inductor_package test errors (#148279 ) Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory. Differential Revision: D70454149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279 Approved by: https://github.com/zoranzhao	2025-03-05 05:22:48 +00:00
Aaron Orenstein	07f876e960	Subprocess compile (#146134 ) Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134 Approved by: https://github.com/jamesjwu	2025-03-03 21:10:12 +00:00
PyTorch MergeBot	608377d341	Revert "[import][inductor] Simplify grid handling (#147583 )" This reverts commit `b59776d857`. Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))	2025-03-03 00:49:32 +00:00
Jason Ansel	b59776d857	[import][inductor] Simplify grid handling (#147583 ) Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Note the attached diff contains some minor fbcode-only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-03-02 07:31:07 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
Raymond Li	c5bf9aaf1c	Log graph breaks (#146537 ) Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537 Approved by: https://github.com/c00w	2025-02-27 11:06:33 +00:00
Bin Bao	f104ef1248	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: D70141808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-02-27 00:35:12 +00:00
PyTorch MergeBot	acca9b9cb0	Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803 )" This reverts commit `0b9da1ae0a`. Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))	2025-02-26 05:32:17 +00:00
Bin Bao	0b9da1ae0a	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803 Approved by: https://github.com/malfet ghstack dependencies: #147805, #147806, #147807	2025-02-25 13:33:12 +00:00
Bin Bao	7ed0670e21	[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806 ) Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680 Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806 Approved by: https://github.com/malfet ghstack dependencies: #147805	2025-02-25 13:33:03 +00:00
Bin Bao	2680e835c8	[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805 ) Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679. Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805 Approved by: https://github.com/malfet	2025-02-25 13:32:54 +00:00
PyTorch MergeBot	890213f65f	Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679 )" This reverts commit `0b52d801d2`. Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))	2025-02-25 04:11:13 +00:00
PyTorch MergeBot	9b06b30468	Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680 )" This reverts commit `22fae0d948`. Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))	2025-02-25 04:06:40 +00:00
Bin Bao	22fae0d948	[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680 ) Consolidate cpp compilation action to CppBuilder Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680 Approved by: https://github.com/yushangdi, https://github.com/angelayi ghstack dependencies: #147679	2025-02-24 21:45:15 +00:00
Bin Bao	0b52d801d2	[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679 ) The option really means to compile a cpp file using its basename instead of the its full path. Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679 Approved by: https://github.com/angelayi	2025-02-24 21:44:33 +00:00
James Wu	574371d828	Add current cuda device index to FXGraphCache key (#147464 ) This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405. It does not handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key. Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key. A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts. I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change. Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-02-20 12:38:21 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Henry Tsang	48203bec63	[BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409 ) Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also https://github.com/pytorch/pytorch/pull/147158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409 Approved by: https://github.com/chenyang78	2025-02-19 23:04:22 +00:00
Bin Bao	d38db94689	[inductor][refactor] Move _compile_file to cpp_builder (#147202 ) Summary: To further conslidate cpp build logic into cpp_builder Test Plan: CI Differential Revision: D69595327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147202 Approved by: https://github.com/yushangdi	2025-02-14 21:02:30 +00:00
Bin Bao	7b4efb492b	[inductor][refactor] Make _compile_file only used for fbcode (#147106 ) Summary: _compile_file in codecache.py only handles specific cpp compilation in fbcode. The next step is to consolidate it with cpp_builder. Test Plan: CI Differential Revision: D69592025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147106 Approved by: https://github.com/yushangdi	2025-02-13 20:22:31 +00:00
James Wu	23524699d5	Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417 ) ### Big idea This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417 Approved by: https://github.com/jansel	2025-02-11 03:46:16 +00:00
PyTorch MergeBot	2fafcd37c3	Revert "cpp_wrapper: Precompile device-specific header files (#144002 )" This reverts commit `de6efa1feb`. Reverted https://github.com/pytorch/pytorch/pull/144002 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this breaks some inductor tests running internally ([comment](https://github.com/pytorch/pytorch/pull/144002#issuecomment-2649569562))	2025-02-11 00:42:22 +00:00
zeshengzong	da216baaa2	Optimize inductor `Self` typing (#146669 ) Replace method return type with `Self` typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146669 Approved by: https://github.com/jansel	2025-02-10 20:39:56 +00:00
Henry Tsang	ddcc97bb8c	Make sure cutlass kernel .cu file has configuration name and nvcc compile command (#146668 ) I think its good to have everything in the .cu file. Especially the nvcc compile command. Technically, the configuration name can be found in the template already. So let me know if you think its not needed. Differential Revision: D69281295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146668 Approved by: https://github.com/chenyang78	2025-02-10 18:16:44 +00:00
Benjamin Glass	de6efa1feb	cpp_wrapper: Precompile device-specific header files (#144002 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Differential Revision: [D69185685](https://our.internmc.facebook.com/intern/diff/D69185685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144002 Approved by: https://github.com/desertfire	2025-02-10 17:13:09 +00:00
Aaron Orenstein	57d8278ab9	pickler for GraphModule (#141659 ) Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that. Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659 Approved by: https://github.com/jamesjwu	2025-01-31 05:34:28 +00:00
Sam Larsen	2811f33d12	Fix code cache + freezing compile-time regression (#145868 ) Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only. Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868 Approved by: https://github.com/eellison	2025-01-31 02:04:15 +00:00
bglass@quansight.com	40ccb7a86d	cpp_wrapper: Move #includes to per-device header files (#145932 ) Summary: This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts. Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)> Differential Revision: D68656960 Pulled By: benjaminglass1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932 Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1 Co-authored-by: bglass@quansight.com <bglass@quansight.com>	2025-01-29 21:08:45 +00:00
Sam Larsen	cd68d54911	Inductor cache: Revamp how we handle frozen params (#143808 ) Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing): 1) When creating a cache entry, store the names of any frozen params, but the values of any other constants. 2) On a cache hit, restore the values of the frozen params by looking up in the current GM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808 Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison	2025-01-24 01:20:07 +00:00
Aaron Orenstein	893ca1dfe1	PEP585 update - torch/_inductor/[_-i]* (#145137 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137 Approved by: https://github.com/bobrenjc93	2025-01-19 01:22:47 +00:00
Jason Ansel	7c1fb9b1ae	[inductor] Refactor CachingAutotuner so that it can pickle (#144044 ) These are refactors needed for #144288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144044 Approved by: https://github.com/eellison	2025-01-18 01:44:16 +00:00
PyTorch MergeBot	94c0f15302	Revert "cpp_wrapper: Move #includes to per-device header files (#143909 )" This reverts commit `d62b3979da`. Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))	2025-01-17 00:36:38 +00:00
Benjamin Glass	d62b3979da	cpp_wrapper: Move #includes to per-device header files (#143909 ) This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909 Approved by: https://github.com/desertfire	2025-01-15 21:14:02 +00:00

1 2 3 4 5 ...

609 Commits