pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Maggie Moss	0be0de4ffa	Add type suppressions to _inductor/runtime (#165918 ) Original PR that did this was reverted due to merge conflicts. Trying it again Pull Request resolved: https://github.com/pytorch/pytorch/pull/165918 Approved by: https://github.com/oulgen	2025-10-21 02:54:22 +00:00
Nichols A. Romero	0bbdd6b8db	[ROCm][inductor] heuristic improvements for pointwise kernels (#163197 ) Heuristic improvements for pointwise kernels for MI350. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @AmdSampsa @iupaikov-amd @@xiaohuguo2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163197 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/jansel Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2025-10-18 07:23:41 +00:00
PyTorch MergeBot	2928c5c572	Revert "Pyrefly suppressions 2 (#165692 )" This reverts commit `43d78423ac`. Reverted https://github.com/pytorch/pytorch/pull/165692 on behalf of https://github.com/seemethere due to This is causing merge conflicts when attempting to land internally, see D84890919 for more details ([comment](https://github.com/pytorch/pytorch/pull/165692#issuecomment-3416397240))	2025-10-17 17:13:04 +00:00
Maggie Moss	43d78423ac	Pyrefly suppressions 2 (#165692 ) This is the last directory to opt in for the regular mypy.ini file. Will put up a diff to remove unused ignores before making sure we're also type checking all the files in the mypy strict configurations Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165692 Approved by: https://github.com/oulgen	2025-10-17 04:15:25 +00:00
Oguz Ulgen	7d0f872cb3	Use union syntax in torch/_inductor runtime and fx_passes (#165652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652 Approved by: https://github.com/aorenste	2025-10-16 20:51:59 +00:00
Oguz Ulgen	ab6014a903	Codemod inductor/runtime from Optional to union none (#165605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165605 Approved by: https://github.com/aorenste ghstack dependencies: #165604	2025-10-16 04:59:47 +00:00
Nikita Shulga	a06ec54d40	[MPS] Add API to query GPU core count (#160414 ) Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType\|head -n10` ``` % python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())" Apple M1 Pro 16 % system_profiler SPDisplaysDataType\|head -n10 Graphics/Displays: Apple M1 Pro: Chipset Model: Apple M1 Pro Type: GPU Bus: Built-In Total Number of Cores: 16 Vendor: Apple (0x106b) Metal Support: Metal 3 ``` This would significantly improve occupancy for torch.compile generated kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414 Approved by: https://github.com/dcci	2025-08-14 00:05:17 +00:00
anwang	c55e72bea1	[Re-land][Inductor] Support native Inductor as backend for MTIA (#159211 ) The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error: <img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" /> I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function. So this diff/PR is an exactly copy of the previous one, except for adding the docstring. ------------- This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel	2025-07-29 17:03:24 +00:00
PyTorch MergeBot	fe0ff12dab	Revert "[Inductor] Support native Inductor as backend for MTIA (#158526 )" This reverts commit `cd68559d04`. Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))	2025-07-26 17:58:00 +00:00
anwang	cd68559d04	[Inductor] Support native Inductor as backend for MTIA (#158526 ) This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526 Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison	2025-07-26 08:16:34 +00:00
Oguz Ulgen	d1947a8707	Migrate from lru_cache to cache (#155613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613 Approved by: https://github.com/ezyang ghstack dependencies: #155612	2025-06-11 19:44:18 +00:00
Bin Bao	33a5179269	[AOTI][reland2] Remove typedef for half and bfloat16 (#153467 ) Summary: Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues. typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen. Differential Revision: D74398762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever	2025-05-14 02:37:18 +00:00
PyTorch MergeBot	471025c489	Revert "[AOTI][reland] Remove typedef for half and bfloat16 (#151109 )" This reverts commit `a0d440a26a`. Reverted https://github.com/pytorch/pytorch/pull/151109 on behalf of https://github.com/wdvr due to causing AOTI test failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/151109#issuecomment-2840386483))	2025-04-29 22:37:16 +00:00
Bin Bao	a0d440a26a	[AOTI][reland] Remove typedef for half and bfloat16 (#151109 ) Summary: Reland https://github.com/pytorch/pytorch/pull/150657 typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Differential Revision: [D72878456](https://our.internmc.facebook.com/intern/diff/D72878456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151109 Approved by: https://github.com/angelayi	2025-04-26 23:17:35 +00:00
PyTorch MergeBot	31162214d8	Revert "[AOTI] Remove typedef for half and bfloat16 (#150657 )" This reverts commit `357814c85c`. Reverted https://github.com/pytorch/pytorch/pull/150657 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150657#issuecomment-2795042772))	2025-04-10 20:08:03 +00:00
Bin Bao	357814c85c	[AOTI] Remove typedef for half and bfloat16 (#150657 ) Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150657 Approved by: https://github.com/malfet	2025-04-09 21:21:17 +00:00
dilililiwhy	86ae672b6a	Use has_triton_package in _inductor.runtime.hints (#147442 ) Fixes #ISSUE_NUMBER Use existing method for triton check Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442 Approved by: https://github.com/Skylion007	2025-02-21 05:52:00 +00:00
David Berard	69e82d02d3	[inductor][3/N] triton support post-#5512, tt.divisibility format (#145575 ) 1. Fix the tt.divisibility format in hints.py. Previously, it was `{((0,), (1,)): [["tt.divisibility", 16]]}`. Now it is `{(0,): [["tt.divisibility", 16]], (1,): [["tt.divisibility", 16]]}`. This was an oversight in the first PR I added. I've verified that we now get `{ tt.divisibility = 16 }` in the generated TTGIR. 2. Update the test_codegen_triton.py test to work with multiple triton versions (and test this divisibility format in the new triton version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145575 Approved by: https://github.com/SamGinzburg	2025-01-27 21:48:58 +00:00
David Berard	b963ab5325	[inductor][1/N] triton support post-#5512, main components (#145051 ) Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed. The main changes in 5220 and 5512 that need to be supported: * AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now. * "signature" changes: the signature now contains _all_ args, including constexpr and constant args. * ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants". What this PR supports: * Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025 * (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling. What this PR doesn't support (TODO in follow-up PRs): * Triton versions between Dec 9, 2024 and before Jan 1, 2025 * (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch) * (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation) * (triton jan 1+) AOTI / cpp wrapper thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051 Approved by: https://github.com/jansel	2025-01-24 00:34:01 +00:00
Aaron Orenstein	bac62341eb	PEP585 update - torch/_inductor (#145198 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198 Approved by: https://github.com/bobrenjc93	2025-01-21 21:04:33 +00:00
David Berard	b90231a189	[inductor][BE] don't try/except ImportError for AttrsDescriptor versions (#144807 ) motivation: Ed's advice to avoid `except ImportError` (i.e. based on the fact that your target module/class might in fact exist, but you might run into some different ImportError whose stacktrace you now ignore). additional motivation: I'm going to add some more cases to this list, and would like to avoid this pattern: ``` try: ... except ImportError: try: ... except ImportError: try: ... ``` suggestions on better ways to do this would be appreciated! test: ran with triton commit e5be006a (last working commit) and 34a6a2ff8 (in june, when AttrsDescriptor was still in triton.compiler.compiler) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144807 Approved by: https://github.com/ezyang	2025-01-16 00:32:29 +00:00
Nikita Shulga	9157a748a6	[MPSInductor] Add dummy properties (#144509 ) For compute capabilitiy (which is an empty string, same as CPU) And for multicore count return 8, as this is smallest number of GPU cores on Apple silicon Pull Request resolved: https://github.com/pytorch/pytorch/pull/144509 Approved by: https://github.com/jansel	2025-01-14 20:12:38 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
Jason Ansel	81edca08ab	[inductor] Refactor some DeviceProperties usage (#142033 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142033 Approved by: https://github.com/eellison ghstack dependencies: #142219	2024-12-07 17:48:45 +00:00
Jason Ansel	0367a31401	[inductor] Minor typing changes (#142219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219 Approved by: https://github.com/Skylion007, https://github.com/yanboliang	2024-12-07 17:48:37 +00:00
Jason Ansel	2c6bd9f6f6	[inductor] Support fixed triton configs defined at compile time (#140217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217 Approved by: https://github.com/shunting314 ghstack dependencies: #139585	2024-11-17 16:10:37 +00:00
PyTorch MergeBot	c0ddd10f6d	Revert "[inductor] Support fixed triton configs defined at compile time (#140217 )" This reverts commit `29114e44fa`. Reverted https://github.com/pytorch/pytorch/pull/140217 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))	2024-11-12 19:32:14 +00:00
Jason Ansel	29114e44fa	[inductor] Support fixed triton configs defined at compile time (#140217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217 Approved by: https://github.com/shunting314 ghstack dependencies: #139585	2024-11-12 00:56:02 +00:00
Alex Baden	5c6d35482e	[Inductor] Support Triton AttrsDescriptor cls field (#139193 ) Fixes #139179 Adding corresponding changes to https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139193 Approved by: https://github.com/bertmaher	2024-10-30 18:16:38 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit `fed37dbfbc`. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Alex Baden	487873f7ca	[Inductor]: Support updated Triton `AttrsDescriptor` (#137757 ) The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor. Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments. Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757 Approved by: https://github.com/jansel	2024-10-15 19:34:59 +00:00
Alex Baden	39d21ed803	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 ) The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](`72c9833927`)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes. Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively). With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458 Approved by: https://github.com/jansel	2024-10-14 20:20:29 +00:00
xinan.lin	0a26851601	[Inductor] Handle device property `warp_size` is None but used on XPU. (#136834 ) Fix #136820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136834 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-09-30 02:08:45 +00:00
David Berard	9c2c61d2dd	[inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472 ) AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this. Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472 Approved by: https://github.com/jansel	2024-09-25 18:21:09 +00:00
Jack Taylor	a15774563b	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-09-13 16:45:39 +00:00
Yichen Yan	c0d2f991b1	Increase `TRITON_MAX_BLOCK['X']` (#135181 ) Fixes #135028 As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181 Approved by: https://github.com/jansel	2024-09-10 05:54:37 +00:00
PyTorch MergeBot	5f981388ec	Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 )" This reverts commit `d7a78ec8b9`. Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143))	2024-07-19 19:46:26 +00:00
Jack Taylor	d7a78ec8b9	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-07-19 09:45:03 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Jason Ansel	0abcca85b7	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314	2024-07-03 05:56:40 +00:00
PyTorch MergeBot	a83eaf1c3a	Revert "[halide-backend] Support manual schedules (#129321 )" This reverts commit `9ae78a578c`. Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664))	2024-07-01 14:42:33 +00:00
Jason Ansel	9ae78a578c	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314 ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320	2024-06-29 14:06:28 +00:00
Jason Ansel	4cb8cb04a7	[halide-backend] Enable bfloat16 support (#129036 ) Requires https://github.com/halide/Halide/pull/8255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506	2024-06-29 14:06:25 +00:00
Jason Ansel	da5f37515e	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-29 14:06:12 +00:00
Jason Ansel	e34b7e6af3	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-29 14:06:08 +00:00
PyTorch MergeBot	1a54bb0f96	Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 )" This reverts commit `4f9399bd0d`. Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121))	2024-06-24 16:50:15 +00:00
PyTorch MergeBot	063facf352	Revert "[halide-backend] Generate standalone runtime (#129025 )" This reverts commit `10c64c3b49`. Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467))	2024-06-24 16:47:25 +00:00
Jason Ansel	10c64c3b49	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-22 17:39:52 +00:00

1 2

62 Commits