pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Maggie Moss	4fada51ada	Fix existing Pyrefly errors (#166439 ) Trying to keep main as clean of type errors as possible until we are able to swtich to just one checker. This adds suppressions for existing type errors on main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166439 Approved by: https://github.com/Skylion007	2025-10-29 02:08:02 +00:00
Shunting Zhang	3895ce093f	[inductor] add in-kernel nan-check (#166008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166008 Approved by: https://github.com/eellison	2025-10-28 20:19:10 +00:00
Shunting Zhang	5d0b3e28dc	[inductor] generate fused rms/layer norm bwd (#165370 ) RMS/Layer norm backward would generated 2 kind of reductions: - the reduction computing dx which reduce across the hidden dimension (in the context of transformer) - the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension. These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders. There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically. The generated kernel is very similar to `33924d20b6/src/liger_kernel/ops/rms_norm.py (L114)` . To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload. To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following: 1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening. 2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction 3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16). Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370 Approved by: https://github.com/jansel ghstack dependencies: #166204	2025-10-28 05:53:52 +00:00
Shunting Zhang	a04edcb27a	[inductor] a few workspace api change (#166204 ) A few workspace API changes: 1. return outer name when creating. Usually a use case does not care about outer name. But for mix-order-reduction (stacked PR), we need it to do the next-layer of reduction on the workspace tensor 2. be able to override workspace tensor dtype 3. be able to delay the deallocation of workspace tensors in TritonKernel.call_kernel since they may be used after the call. The lifetime of the workspace tensors are only enlarged a little bit. They would be deallocated once the next layer reduction is done. Test with the stacked PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166204 Approved by: https://github.com/jansel	2025-10-27 18:10:23 +00:00
Maggie Moss	9940e894ea	Fix pyrefly ignore syntax in _inductor (#166247 ) Ensures pyrefly ignores only ignore the intended error code. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166247 Approved by: https://github.com/oulgen	2025-10-27 02:48:42 +00:00
Maggie Moss	eb83c3ca23	Clean up unused Pyrefly suppressions (#166178 ) Cleaning up ignores that are no longer needed in the repo and adding select suppressions so the main branch is clean. test plan: `lintrunner -a` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166178 Approved by: https://github.com/oulgen	2025-10-25 05:32:21 +00:00
Yuanyuan Chen	3255e7872b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99, https://github.com/mlazos	2025-10-19 00:59:28 +00:00
Maggie Moss	f02e3947f6	Expand type checking to mypy strict files (#165697 ) Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697 Approved by: https://github.com/ezyang	2025-10-18 04:34:45 +00:00
PyTorch MergeBot	2928c5c572	Revert "Pyrefly suppressions 2 (#165692 )" This reverts commit `43d78423ac`. Reverted https://github.com/pytorch/pytorch/pull/165692 on behalf of https://github.com/seemethere due to This is causing merge conflicts when attempting to land internally, see D84890919 for more details ([comment](https://github.com/pytorch/pytorch/pull/165692#issuecomment-3416397240))	2025-10-17 17:13:04 +00:00
Maggie Moss	43d78423ac	Pyrefly suppressions 2 (#165692 ) This is the last directory to opt in for the regular mypy.ini file. Will put up a diff to remove unused ignores before making sure we're also type checking all the files in the mypy strict configurations Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165692 Approved by: https://github.com/oulgen	2025-10-17 04:15:25 +00:00
Maggie Moss	5641de7b6b	Add suppressions for _inductor/codegen (#165659 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165659 Approved by: https://github.com/oulgen	2025-10-16 21:37:37 +00:00
Yuanyuan Chen	fbe0d20a17	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-14 14:22:54 +00:00
nullplay	ac529df244	Native matmul (#157743 ) ### Implementation of #151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705: 1. Basic support (this PR) 2. Lazy broadcasting for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743 Approved by: https://github.com/jansel	2025-10-14 04:22:30 +00:00
PyTorch MergeBot	b8be796a57	Revert "[2/N] More ruff SIM fixes (#165031 )" This reverts commit `38095fbd13`. Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))	2025-10-10 13:42:14 +00:00
Yuanyuan Chen	38095fbd13	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-10 05:37:46 +00:00
PaulZhang12	600267ea56	Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 ) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296	2025-10-06 14:29:07 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit `321e602692`. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Isuru Fernando	8c590cab9d	[inductor] add a runtime assert for triton shapes (#164242 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164242 Approved by: https://github.com/eellison, https://github.com/mlazos ghstack dependencies: #164241	2025-10-01 18:55:33 +00:00
PyTorch MergeBot	20edc5b26a	Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 )" This reverts commit `22c5e8c17c`. Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274))	2025-10-01 16:23:03 +00:00
karthickai	8c98aee436	[Inductor] Update DeviceAssert op to behave like store (#163696 ) Updated the DeviceAssert operation to match the behavior of Store, it will fixes the issue mentioned in [this PR](https://github.com/pytorch/pytorch/pull/163023) and updated testcases as Elias [suggested](https://github.com/pytorch/pytorch/pull/160677#discussion_r2353834646). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163696 Approved by: https://github.com/mlazos	2025-09-24 23:35:56 +00:00
Jason Ansel	3e1b1a30f2	Revert "[inductor] Fix issue with scalar arg handling" (#163737 ) This reverts commit a8cd437183142e17ba6fc8d7b5e9dcee462d7904. See https://github.com/pytorch/pytorch/pull/163481#issuecomment-3326310774 This PR might also cause issues with cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163737 Approved by: https://github.com/ezyang ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520, #163482	2025-09-24 07:33:12 +00:00
Jason Ansel	ca512af3e7	[inductor] Fix issue with scalar arg handling (#163481 ) Fixes #163420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163481 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422	2025-09-24 02:52:36 +00:00
PaulZhang12	22c5e8c17c	Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 ) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296	2025-09-23 20:36:39 +00:00
Ruben Rodriguez Buchillon	25f1a5d8d1	[inductor][ez] add src_hash property for Templates (#161468 ) # why enable caching/overriding/filtering based on src hash later # what - KernelTemplate has a src_hash that is None by default - sha256 on TritonTemplate of the template src code - None on ExternKernelChoice to have same API # testing n/a (not in use in this change) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D81821149](https://our.internmc.facebook.com/intern/diff/D81821149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161468 Approved by: https://github.com/eellison ghstack dependencies: #161351, #161350, #162293	2025-09-12 21:10:45 +00:00
Blaine Burton Rister	9aedb3cd87	[AOTI-FX] Support registering custom FX backends (#162317 ) # Feature Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen. # Test plan Added a CI test registering a custom FX backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317 Approved by: https://github.com/jansel	2025-09-06 07:32:03 +00:00
Boyuan Feng	771f369448	[Inductor] Improve RoPE (#161420 ) This PR fuses ROPE from 2 kernels into 1 kernel. Shape: ``` q: [B, Hq, S, D] k: [B, Hkv, S, D] ``` `Hq=32, Hkv=8, D=128` following Llama3 setting. <img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420 Approved by: https://github.com/shunting314	2025-09-05 20:55:20 +00:00
Ruben Rodriguez Buchillon	d63ad53a99	[inductor][ez] return choicecallers directly (#161345 ) # why - remove repeat patterns - we have everything to make the choicecallers - templates - input_nodes - layouts - all the kwargs # what - yield a choicecaller directly from V.choices.get_mm_configs # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344	2025-09-05 18:02:38 +00:00
Ruben Rodriguez Buchillon	4902c76c65	[inductor][ez] add template/externchoice uid (#161341 ) # why - to have a central registry of templates/externkernelchoice to match them to heuristics etc, they need unique names - mm is both the triton template name and the aten_mm name # what - add a uid() to KernelTemplate/ExternKernelChoice that returns name - override in ExternKernel to prepend "aten::" - override in TritonTemplate to prepend "triton::" This id is just use to find template heuristics, so it has no other impact # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162075, #161340	2025-09-05 18:01:58 +00:00
Isuru Fernando	f305019377	[inductor] propagate shapes in CSEVariable (#152198 ) Fixes #149905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198 Approved by: https://github.com/eellison	2025-08-19 16:46:38 +00:00
angelayi	bab79824cb	[aoti-fx] Initial AOTInductor FX (#160765 ) Using the existing WrapperFxCodegen backend, this PR prototypes an AOT version of it which will directly return a graph module. How to use: ```python exported_gm = torch.export.export(model, inp, dynamic_shapes=dynamic_shapes).module() compiled_gm = torch._inductor.aot_compile( exported_gm, inp, options={"fx_wrapper": True, "compile_threads": 1} ) assert torch.allclose(model(inp), compiled_gm(inp)) ``` The motivation behind this is that backends like ExecuTorch/MTIA would like to use inductor's optimization technologies, but might have their own graph lowering pipelines so they might not want to use AOTI (which generates an so). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160765 Approved by: https://github.com/jansel	2025-08-18 18:14:08 +00:00
David Berard	62bac07981	[inductor][triton] support profile_scratch launcher arg (#159772 ) This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel. Tests: https://github.com/pytorch/pytorch/pull/159158. I also verified these test locally with triton 3.2, 3.3, and 3.4. Fixes: * static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`) * AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`) Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772 Approved by: https://github.com/NikhilAPatel, https://github.com/eellison	2025-08-08 14:27:38 +00:00
Markus Hoehnerbach	e167c7d0f3	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-07 17:07:26 +00:00
PyTorch MergeBot	83ba3f1101	Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 )" This reverts commit `6085bf7565`. Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
Markus Hoehnerbach	6085bf7565	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-04 21:22:11 +00:00
anwang	c55e72bea1	[Re-land][Inductor] Support native Inductor as backend for MTIA (#159211 ) The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error: <img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" /> I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function. So this diff/PR is an exactly copy of the previous one, except for adding the docstring. ------------- This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel	2025-07-29 17:03:24 +00:00
PyTorch MergeBot	fe0ff12dab	Revert "[Inductor] Support native Inductor as backend for MTIA (#158526 )" This reverts commit `cd68559d04`. Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))	2025-07-26 17:58:00 +00:00
anwang	cd68559d04	[Inductor] Support native Inductor as backend for MTIA (#158526 ) This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526 Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison	2025-07-26 08:16:34 +00:00
Mwiza Kunda	d3d9bc1c31	[inductor] Allow backends to register their own custom config object (#158254 ) An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254 Approved by: https://github.com/eellison	2025-07-23 15:56:06 +00:00
David Berard	b6c00dfe24	[user triton] AOT inductor support for device-side TMA (#155896 ) Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma` Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel. To support this in AOTI, this PR: * records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen * allocates global scratch, if needed (cuda/device_op_overrides.py) * plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device) * updates tests to verify this works for dynamically shaped inputs This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works) Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space). For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896 Approved by: https://github.com/desertfire	2025-06-27 04:28:04 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit `3627270bdf`. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
Nikita Shulga	bb1f3d1a55	[MPSInductor] Improve `_default` dtype inference (#156121 ) By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121 Approved by: https://github.com/dcci	2025-06-16 23:11:53 +00:00
David Berard	bc9b8ea230	[user triton] JIT inductor support for new host-side TMA api (#155814 ) This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api. * handle TensorDescriptor.from_tensor in ir.py * codegen TensorDescriptor.from_tensor in wrapper.py * generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator) AOTI support is not implemented yet. Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814 Approved by: https://github.com/aakhundov ghstack dependencies: #155777	2025-06-15 20:24:19 +00:00
Marcin Pioch	ce79056471	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-14 17:29:54 +00:00
Oguz Ulgen	d1947a8707	Migrate from lru_cache to cache (#155613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613 Approved by: https://github.com/ezyang ghstack dependencies: #155612	2025-06-11 19:44:18 +00:00
PyTorch MergeBot	79bdafe5b6	Revert "Custom FX pass for inductor's backend registration (#154841 )" This reverts commit `e694280d12`. Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))	2025-06-09 16:56:45 +00:00
Marcin Pioch	e694280d12	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-06 06:49:44 +00:00
eellison	0827464002	Replace runtime type parameterization (#155221 ) See: ``` >>> import timeit; print(f"OrderedSet[str](): {timeit.timeit('OrderedSet[str]()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s, OrderedSet(): {timeit.timeit('OrderedSet()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s") ``` > `OrderedSet[str]()`: 0.354622s, OrderedSet(): 0.095376s Type parameterization should be on type hint, not in runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155221 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-06-05 21:43:54 +00:00

1 2 3 4 5 ...

451 Commits