pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Cynthia Yang	960c4b9937	[inductor] Enable triton kernels with unbacked inputs (#164509 ) Summary: We need to pass in fallback value to avoid converting symbols to int original failure log in onefeed Slimper MB - P1973406565 `raise TypeError("Cannot convert symbols to int")` Test Plan: if not passing in fallback value - https://www.internalfb.com/intern/everpaste/?handle=GGeAoh_M11kEGOECAFELOaq8ooRCbswMAAAz `raise TypeError("Cannot convert symbols to int")` ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_triton_kernel_with_unbacked_symint_fallback --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)" ``` Buck UI: https://www.internalfb.com/buck2/4d27cd49-770b-40de-8c65-9ee04c5dd687 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149324695031 Network: Up: 0B Down: 16MiB (reSessionID-8e8b07a2-e31c-402d-bf6a-ebb92253e654) Executing actions. Remaining 0/6 5.0s exec time total Command: test. Finished 2 cache (100% hit) 5.0s exec time cached (100%) Time elapsed: 33.8s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D83684260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164509 Approved by: https://github.com/ColinPeppler	2025-10-03 21:05:18 +00:00
eellison	86474ce996	Update mask dtype (#164472 ) Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472 Approved by: https://github.com/bdhirsh	2025-10-03 00:19:36 +00:00
Isuru Fernando	f465ea6752	[inductor] require shape in TritonCSEVariable (#162275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275 Approved by: https://github.com/mlazos ghstack dependencies: #164158	2025-10-02 21:52:09 +00:00
PyTorch MergeBot	20edc5b26a	Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 )" This reverts commit `22c5e8c17c`. Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274))	2025-10-01 16:23:03 +00:00
Pian Pawakapan	d615f6b935	[inductor] use hint_override in kernel benchmark args (#164207 ) Summary: forward fix T239259207 Test Plan: test_multi_kernel Differential Revision: D83539263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164207 Approved by: https://github.com/bobrenjc93, https://github.com/mlazos	2025-09-30 18:09:29 +00:00
Nick Riasanovsky	719b64ee8b	Fix TMA transpose logic to handle 1D shapes + string differences (#163966 ) Fixes #163702. This fixes 2 issues: 1. The value may inconsistently be a shape or string. This normalizes to handle both of these. 2. 1D shapes should not transpose data. This fixes the order of operations to prevent this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966 Approved by: https://github.com/eellison	2025-09-30 17:51:37 +00:00
Yuanyuan Chen	85012fe167	Remove unnecessary list comprehensions (#164103 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-09-30 03:56:54 +00:00
PyTorch MergeBot	6b473c90cf	Revert "[inductor] require shape in TritonCSEVariable (#162275 )" This reverts commit `c257570e6c`. Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/jeffdaily due to sorry this broke rocm CI; inductor/test_select_algorithm.py::TestTemplateRender::test_finalized_subclass_hooks [GH job link](https://github.com/pytorch/pytorch/actions/runs/18048893250/job/51366715091) [HUD commit link](`c257570e6c`) ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3348159095))	2025-09-29 17:26:54 +00:00
Markus Hoehnerbach	069ccf5f1e	[inductor] pdl: enable launch and deduplicate waits (#162014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162014 Approved by: https://github.com/eellison	2025-09-29 16:10:26 +00:00
Isuru Fernando	c257570e6c	[inductor] require shape in TritonCSEVariable (#162275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275 Approved by: https://github.com/mlazos	2025-09-26 20:41:12 +00:00
Shangdi Yu	520fca82c8	Refactor Provenance Tracking (#163378 ) Summary: - Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code - Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`. - If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a `codegen_comment` call. - Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling - Remove `debug_handle` from `call_kernel`. Test Plan: CI ``` buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing ``` Differential Revision: D82839271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378 Approved by: https://github.com/angelayi	2025-09-25 22:55:59 +00:00
karthickai	8c98aee436	[Inductor] Update DeviceAssert op to behave like store (#163696 ) Updated the DeviceAssert operation to match the behavior of Store, it will fixes the issue mentioned in [this PR](https://github.com/pytorch/pytorch/pull/163023) and updated testcases as Elias [suggested](https://github.com/pytorch/pytorch/pull/160677#discussion_r2353834646). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163696 Approved by: https://github.com/mlazos	2025-09-24 23:35:56 +00:00
Nick Riasanovsky	0390798dad	[Triton] [Inductor] Enable Epilogue Subtiling in the blackwell ws template (#163145 ) Summary: Enables support for epilogue subtiling in the blackwell ws template. This requires the ability to call `store_output` twice in the same kernel and reuse the same tensor descriptor across allocations. Test Plan: Tested with test_max_autotune.py on a Blackwell server. Rollback Plan: Differential Revision: D82610077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163145 Approved by: https://github.com/eellison	2025-09-24 05:38:02 +00:00
Markus Hoehnerbach	eb3fbf5b08	[inductor] in emulate_precision_casts, disable fma fusion in triton (#163073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163073 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-23 23:59:17 +00:00
eellison	c63e417c79	use reduction hint for aggressive rblock (#163371 ) I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163371 Approved by: https://github.com/PaulZhang12	2025-09-23 22:04:22 +00:00
PaulZhang12	22c5e8c17c	Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446 ) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296	2025-09-23 20:36:39 +00:00
Jason Ansel	518c320676	[inductor] libdevice.sqrt => tl.sqrt_rn (#163419 ) Fixes #163082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419 Approved by: https://github.com/Skylion007, https://github.com/mlazos ghstack dependencies: #163386, #163398, #163387, #163414, #163415	2025-09-23 15:37:21 +00:00
PaulZhang12	2b036632ca	Allow add_persistent_r_block to scale up rblock up to a limit (#162296 ) <img width="654" height="392" alt="Screenshot 2025-09-18 at 4 22 53 PM" src="https://github.com/user-attachments/assets/975650ec-f769-43a6-bdf5-2885a8d40d3c" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162296 Approved by: https://github.com/eellison	2025-09-22 21:41:46 +00:00
Markus Hoehnerbach	c5e7bb08b0	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-18 06:35:28 +00:00
Isuru Fernando	c77726b1d7	[inductor] fix expand_shape when copy_shape is not a string (#162739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162739 Approved by: https://github.com/eellison, https://github.com/mlazos	2025-09-15 23:22:07 +00:00
Nick Riasanovsky	74a35c6344	[Triton] [Inductor] Enable TMA store for TMA mm templates (#160480 ) Summary: Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support. This also includes a couple notable changes: - Adds support in the TMA template support for checking the output layout. - Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation. - Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches. Test Plan: Tested with test_max_autotune.py unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480 Approved by: https://github.com/NikhilAPatel	2025-09-14 04:56:49 +00:00
Isuru Fernando	f654cff566	[inductor] Add shape to load_input in matmul templates (#162513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513 Approved by: https://github.com/eellison ghstack dependencies: #162426	2025-09-11 01:51:15 +00:00
eellison	f4aeceaa9d	Use upper bound for persistent rblock (#162441 ) Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2. Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441 Approved by: https://github.com/PaulZhang12	2025-09-10 22:29:02 +00:00
Colin Peppler	348303ebd2	[ez] add docstring/typing for codegen_kernel_benchmark (#162609 ) ``` lintrunner init && lintrunner -m origin/main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609 Approved by: https://github.com/coconutruben ghstack dependencies: #162442	2025-09-10 20:49:38 +00:00
Colin Peppler	94755e81c4	[inductor] Enable combo kernels with unbacked inputs (#162442 ) Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes. ### Example exception ``` File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel kernel_code_list = self.generate_combo_kernel_code( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code src_code = kernel.codegen_kernel() ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel code.splice(self.codegen_kernel_benchmark(num_gb=0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark var_names.extend(self.kernel_benchmark_extra_args()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args extra_args.append(str(V.graph.sizevars.size_hint(tree.numel))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint return int(out) ^^^^^^^^ File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__ raise TypeError("Cannot convert symbols to int") torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int ``` Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442 Approved by: https://github.com/jansel	2025-09-10 20:49:38 +00:00
PyTorch MergeBot	ada43ed39c	Revert "[inductor] pdl inductor option (disabled by default) (#160928 )" This reverts commit `9458d1ac3b`. Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))	2025-09-07 07:37:37 +00:00
Markus Hoehnerbach	9458d1ac3b	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-04 00:35:23 +00:00
Mwiza Kunda	b994f6e3b3	[inductor] check block options after broadcasting and singleton dims have been removed (#161602 ) This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match because the innermost dimension does not have stride 1 ```python block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0]) ``` After broadcasting dimensions and singleton dimensions are removed: ```python block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602 Approved by: https://github.com/jansel	2025-08-30 08:10:51 +00:00
Boyuan Feng	77d8e98e1b	[Inductor] update exp codegen for better precision (#161829 ) Prior to this PR, we have: ``` [Default Behavior] uses `tl.math.exp({x})`: eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592 TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`: eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367 Update inductor to use `tl.extra.libdevice.exp(tmp0)`: eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593 ``` Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now. Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829 Approved by: https://github.com/jansel	2025-08-30 04:56:51 +00:00
eellison	ebfee60101	[WIP] more aggressive persistent reduction (#161055 ) Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055 Approved by: https://github.com/jansel	2025-08-30 01:08:45 +00:00
Karthick Panner Selvam	130e50afff	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos, https://github.com/atalman	2025-08-28 18:57:34 +00:00
PyTorch MergeBot	c55bdb26e1	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit `378edb047f`. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))	2025-08-27 23:45:12 +00:00
PyTorch MergeBot	014b98dd09	Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 )" This reverts commit `77bc959fe1`. Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))	2025-08-27 21:21:19 +00:00
Karthick Panner Selvam	378edb047f	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-27 14:49:20 +00:00
PyTorch MergeBot	de58505890	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit `cddcaa1903`. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))	2025-08-27 02:36:42 +00:00
Karthick Panner Selvam	cddcaa1903	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-26 22:33:23 +00:00
Charlie West-Taylor	77bc959fe1	Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 ) Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314 Approved by: https://github.com/eellison	2025-08-26 19:40:37 +00:00
Mwiza Kunda	5ee464db5c	[inductor] Fix descriptor broadcasting for singleton dimensions (#160310 ) This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail. As an example, consider the following store instruction: ``` name=buf1 index=x2 + 192y0 + 64y1 valule=TritonCSEVariable('tmp7') params = BlockParameters( shape=[3, 4, 1, 1, 64], block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK], strides=[64, 192, 0, 0, 1], offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset] ) broadcasting_dims=[False, False, True, True, False] broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK] ``` Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`. ``` 9 pre_broadcast_shape = [ 280 sympy.S.One if is_broadcasting else dim 281 for dim, is_broadcasting in zip( 282 -> self.broadcast_shape, self.broadcasting_dims 283 ) 284 ] ``` The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]` Triton throws the following error: `reshape() cannot change total number of elements in tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310 Approved by: https://github.com/blaine-rister	2025-08-20 09:48:58 +00:00
Isuru Fernando	f305019377	[inductor] propagate shapes in CSEVariable (#152198 ) Fixes #149905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198 Approved by: https://github.com/eellison	2025-08-19 16:46:38 +00:00
Nick Riasanovsky	df60736410	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-19 07:32:55 +00:00
PyTorch MergeBot	04c7be903d	Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 )" This reverts commit `8f434545c2`. Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))	2025-08-17 14:22:48 +00:00
Nick Riasanovsky	8f434545c2	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-17 00:35:12 +00:00
Markus Hoehnerbach	89654db1ab	[inductor] fix triton bucketize mask propagation (#159961 ) See `6b414f56a4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961 Approved by: https://github.com/eellison	2025-08-12 19:59:32 +00:00
David Berard	1f4057c11a	[inductor] remove no_x_dim (#159810 ) no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional. no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue. However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell. H100 inference benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a H100 training benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a Overall, the benchmarks show minimal change in performance. Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810 Approved by: https://github.com/ngimel, https://github.com/eellison	2025-08-12 17:10:31 +00:00
Tom Ritchford	a5725965ea	Remove unnecessary "# noqa: set_linter" comments (#159467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467 Approved by: https://github.com/eellison	2025-08-06 21:31:52 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
PyTorch MergeBot	410812763b	Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 )" This reverts commit `bbc0df1094`. Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))	2025-08-05 22:00:24 +00:00
Nick Riasanovsky	bbc0df1094	[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Relying on CI. Should be a NFC. Rollback Plan: Reviewed By: davidberard98 Differential Revision: D79378792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777 Approved by: https://github.com/davidberard98	2025-08-05 03:29:13 +00:00
Anatoly Myachev	46b925681c	[inductor] Update `to(tl.int8).to(tl.uint8)` workaround from #94717 to handle entire range of `torch.uint8` (#158567 ) https://github.com/pytorch/pytorch/pull/94717/files#r2210265070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-07-26 19:11:37 +00:00
PyTorch MergeBot	7d6f340238	Revert "[AOTI] Add more default options to compile_standalone (#158560 )" This reverts commit `a991e285ae`. Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))	2025-07-22 16:20:17 +00:00

1 2 3 4 5 ...

697 Commits