pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Isuru Fernando	44186a0a4e	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-26 18:11:00 +00:00
Jason Ansel	995e3079c9	[inductor] Fix for "Failed to find static RBLOCK" (#141434 ) Summary: I expect this to fix https://fb.workplace.com/groups/1075192433118967/permalink/1547962839175255/ Test Plan: Ask poster to confirm fix Differential Revision: D66413828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141434 Approved by: https://github.com/ezyang	2024-11-23 22:08:56 +00:00
PyTorch MergeBot	f23621ec56	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit `c25b201583`. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))	2024-11-22 15:43:39 +00:00
Isuru Fernando	c25b201583	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-22 02:04:36 +00:00
PyTorch MergeBot	701e06b643	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit `aefcdb3c9f`. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453))	2024-11-20 22:13:57 +00:00
Isuru Fernando	aefcdb3c9f	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-20 20:26:49 +00:00
Jason Ansel	808f0f656d	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-20 19:50:46 +00:00
eellison	eff22171d2	Add Current Mask Var To CSE Cache Key (#140838 ) This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838 Approved by: https://github.com/jansel ghstack dependencies: #140841	2024-11-20 00:55:56 +00:00
Jason Ansel	2c6bd9f6f6	[inductor] Support fixed triton configs defined at compile time (#140217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217 Approved by: https://github.com/shunting314 ghstack dependencies: #139585	2024-11-17 16:10:37 +00:00
Jason Ansel	318eaa2be7	[inductor] Refactor reduction type choices into V.choices (#139585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585 Approved by: https://github.com/shunting314	2024-11-17 16:10:37 +00:00
PyTorch MergeBot	069a71023b	Revert "[inductor] Refactor reduction type choices into V.choices (#139585 )" This reverts commit `6438c8637a`. Reverted https://github.com/pytorch/pytorch/pull/139585 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))	2024-11-12 19:32:14 +00:00
PyTorch MergeBot	c0ddd10f6d	Revert "[inductor] Support fixed triton configs defined at compile time (#140217 )" This reverts commit `29114e44fa`. Reverted https://github.com/pytorch/pytorch/pull/140217 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))	2024-11-12 19:32:14 +00:00
Jason Ansel	29114e44fa	[inductor] Support fixed triton configs defined at compile time (#140217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217 Approved by: https://github.com/shunting314 ghstack dependencies: #139585	2024-11-12 00:56:02 +00:00
Jason Ansel	6438c8637a	[inductor] Refactor reduction type choices into V.choices (#139585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585 Approved by: https://github.com/shunting314	2024-11-12 00:56:02 +00:00
Jason Ansel	ed30fa74ab	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-04 04:28:40 +00:00
Jason Ansel	d189f92eb1	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-04 04:28:18 +00:00
PyTorch MergeBot	0863d6a08e	Revert "[inductor] Remove SIMDKernel.last_usage (#139364 )" This reverts commit `286d3ce266`. Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:11 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit `c53beab377`. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Jason Ansel	f9ef880c0b	[inductor] Refactor kernel args into SIMDKernelFeatures (#139327 ) This is a refactor PR to move stuff around. I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 00:30:14 +00:00
drisspg	a884462bca	Add workspace to TritonTemplates (#138050 ) Here's a markdown summary for the PR: # Add workspace buffer support for Triton templates ## Summary Adds support for templates to allocate and use temporary workspace buffers ## Key Changes - Add `WorkspaceArg` support in Triton template system - Automatic workspace allocation/deallocation around kernel execution - Zero-initialization support for workspace buffers - Seamless integration with existing tensor management ## Example Usage ```python def generate(self, ...): workspace_arg = WorkspaceArg( count=1024*1024, # 1MB workspace zero_fill=True # Zero-initialized ) return TritonTemplateCaller(..., workspace_arg=workspace_arg) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-10-29 18:17:54 +00:00
Jason Ansel	a762dc0357	[inductor] Multi-kernel + cooperative reductions (#138893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138893 Approved by: https://github.com/shunting314 ghstack dependencies: #138533	2024-10-29 15:45:17 +00:00
Jason Ansel	77b0ae832d	[inductor] Allow cooperative + persistent reductions (#138533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138533 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-10-29 15:45:17 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit `fed37dbfbc`. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Xinran / Allan Rui	ba6526814a	Add dtype attribute to CSEVariable (#136778 ) Summary: - This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op. - There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen. Test Plan: CI Differential Revision: D61815079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-10-25 18:00:30 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Jason Ansel	85a6a782e5	[inductor] Generalize WorkspaceArg for graph-level semaphores (#138170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170 Approved by: https://github.com/Chillee	2024-10-18 23:05:54 +00:00
Bin Bao	2e67d7cc35	[AOTI] Remove the non-ABI-compatible mode (part 1) (#138009 ) Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic. Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016	2024-10-17 02:48:26 +00:00
Jason Ansel	0d7b2118ed	[inductor] Refactor triton dtype helpers (#137946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137946 Approved by: https://github.com/eellison	2024-10-16 06:35:10 +00:00
Alex Baden	487873f7ca	[Inductor]: Support updated Triton `AttrsDescriptor` (#137757 ) The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor. Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments. Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757 Approved by: https://github.com/jansel	2024-10-15 19:34:59 +00:00
Alex Baden	39d21ed803	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 ) The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](`72c9833927`)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes. Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively). With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458 Approved by: https://github.com/jansel	2024-10-14 20:20:29 +00:00
Artemiy Bulavin	74e871355b	Add hooks to Scheduler nodes for generating device-specific debug strings (#135015 ) Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015 Approved by: https://github.com/jansel	2024-10-11 20:30:49 +00:00
Xinran / Allan Rui	1d15dd7891	Fix triton_reshape to properly expand `Min` keyword in triton codegen (#137357 ) Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape ``` Differential Revision: D63850158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2024-10-09 15:53:45 +00:00
Sam Larsen	c87c9f0a01	[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701 Approved by: https://github.com/eellison	2024-10-07 19:47:04 +00:00
Benjamin Glass	a968576777	Add lowering for aten.searchsorted (#135701 ) Adds lowering for `aten.searchsorted`. This entails: 1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`. 2. Adding support for striding to `ops.bucketize`. 3. Adding support for sorting tensors to `ops.bucketize`. 4. Adding a lowering for `aten.searchsorted.Tensor`. 5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors. 6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions. Closes #135873 Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701 Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98	2024-10-04 19:26:05 +00:00
Edward Z. Yang	6bd9d37266	Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935, #136972	2024-10-01 13:22:10 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `31c0467594`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Sam Larsen	45a8b5682e	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136858 ) This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858 Approved by: https://github.com/atalman	2024-09-27 15:14:12 +00:00
PyTorch MergeBot	5eb68d565f	Revert "[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 )" This reverts commit `2c5f5e303a`. Reverted https://github.com/pytorch/pytorch/pull/136594 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136594#issuecomment-2378358302))	2024-09-27 04:06:05 +00:00
Blaine Burton Rister	86631eccda	[Inductor] Remove stride-0 dimensions from more complex block pointers (#135557 ) Related issue: #125077 ### Feature Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading. This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`. Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`). ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Before this PR, we would have stride-0 dimensions: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 8 xnumel = 8 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1]) tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` ### Test Plan Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.) This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly. Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557 Approved by: https://github.com/shunting314, https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-27 04:01:40 +00:00
Sam Larsen	2c5f5e303a	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 ) Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594 Approved by: https://github.com/mengluy0125, https://github.com/jansel	2024-09-27 04:01:09 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Xiaozhu Meng	5a29a06aa3	[AMD][inductor] do not use float64 on AMD internally (#136441 ) Summary: Internal AMD triton seems to have issue with float64 constant: ``` ### Most recent error lines found on the logs: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] ^ E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK]) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp7 = tmp5 + tmp6 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp6 = 0.5 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp5 = tmp4.to(tl.float32) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp4 = (((r3 + (x0((17 + (16ks0ks1)) // 18))) % ks2) // ks0) % ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp3 = tmp2.to(tl.int1) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp2 = tmp0 < tmp1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp1 = 16ks0ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp0 = r3 + (x0((17 + (16ks0*ks1)) // 18)) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] r3 = rindex E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rmask = rindex < rnumel E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rindex = roffset + rbase E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ``` Bisecting showing this error introduced by D62465575 This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0. Test Plan: rocm6.0 can work ``` TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1 ``` emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354) Differential Revision: D63263806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441 Approved by: https://github.com/houseroad	2024-09-25 19:13:17 +00:00
David Berard	9c2c61d2dd	[inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472 ) AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this. Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472 Approved by: https://github.com/jansel	2024-09-25 18:21:09 +00:00
Jokeren	cabfbef6cf	[pytorch][PR] [inductor] More fixes on the keys of `constants` and `signature` dictionaries (#136514 ) Summary: Previous PR forgets to change two other places that also create `constants` and `signature`. Test Plan: Imported from GitHub, without a `Test Plan:` line. {F1884584338} Differential Revision: D63027728 Pulled By: Myrthan Co-authored-by: Jokeren <robinho364@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514 Approved by: https://github.com/jansel Co-authored-by: Jokeren <robinho364@gmail.com>	2024-09-25 09:34:14 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00

1 2 3 4 5 ...

497 Commits