pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Jason Ansel	518c320676	[inductor] libdevice.sqrt => tl.sqrt_rn (#163419 ) Fixes #163082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419 Approved by: https://github.com/Skylion007, https://github.com/mlazos ghstack dependencies: #163386, #163398, #163387, #163414, #163415	2025-09-23 15:37:21 +00:00
PaulZhang12	2b036632ca	Allow add_persistent_r_block to scale up rblock up to a limit (#162296 ) <img width="654" height="392" alt="Screenshot 2025-09-18 at 4 22 53 PM" src="https://github.com/user-attachments/assets/975650ec-f769-43a6-bdf5-2885a8d40d3c" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162296 Approved by: https://github.com/eellison	2025-09-22 21:41:46 +00:00
Markus Hoehnerbach	c5e7bb08b0	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-18 06:35:28 +00:00
Isuru Fernando	c77726b1d7	[inductor] fix expand_shape when copy_shape is not a string (#162739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162739 Approved by: https://github.com/eellison, https://github.com/mlazos	2025-09-15 23:22:07 +00:00
Nick Riasanovsky	74a35c6344	[Triton] [Inductor] Enable TMA store for TMA mm templates (#160480 ) Summary: Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support. This also includes a couple notable changes: - Adds support in the TMA template support for checking the output layout. - Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation. - Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches. Test Plan: Tested with test_max_autotune.py unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480 Approved by: https://github.com/NikhilAPatel	2025-09-14 04:56:49 +00:00
Isuru Fernando	f654cff566	[inductor] Add shape to load_input in matmul templates (#162513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513 Approved by: https://github.com/eellison ghstack dependencies: #162426	2025-09-11 01:51:15 +00:00
eellison	f4aeceaa9d	Use upper bound for persistent rblock (#162441 ) Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2. Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441 Approved by: https://github.com/PaulZhang12	2025-09-10 22:29:02 +00:00
Colin Peppler	348303ebd2	[ez] add docstring/typing for codegen_kernel_benchmark (#162609 ) ``` lintrunner init && lintrunner -m origin/main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609 Approved by: https://github.com/coconutruben ghstack dependencies: #162442	2025-09-10 20:49:38 +00:00
Colin Peppler	94755e81c4	[inductor] Enable combo kernels with unbacked inputs (#162442 ) Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes. ### Example exception ``` File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel kernel_code_list = self.generate_combo_kernel_code( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code src_code = kernel.codegen_kernel() ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel code.splice(self.codegen_kernel_benchmark(num_gb=0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark var_names.extend(self.kernel_benchmark_extra_args()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args extra_args.append(str(V.graph.sizevars.size_hint(tree.numel))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint return int(out) ^^^^^^^^ File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__ raise TypeError("Cannot convert symbols to int") torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int ``` Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442 Approved by: https://github.com/jansel	2025-09-10 20:49:38 +00:00
PyTorch MergeBot	ada43ed39c	Revert "[inductor] pdl inductor option (disabled by default) (#160928 )" This reverts commit `9458d1ac3b`. Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))	2025-09-07 07:37:37 +00:00
Markus Hoehnerbach	9458d1ac3b	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-04 00:35:23 +00:00
Mwiza Kunda	b994f6e3b3	[inductor] check block options after broadcasting and singleton dims have been removed (#161602 ) This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match because the innermost dimension does not have stride 1 ```python block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0]) ``` After broadcasting dimensions and singleton dimensions are removed: ```python block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602 Approved by: https://github.com/jansel	2025-08-30 08:10:51 +00:00
Boyuan Feng	77d8e98e1b	[Inductor] update exp codegen for better precision (#161829 ) Prior to this PR, we have: ``` [Default Behavior] uses `tl.math.exp({x})`: eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592 TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`: eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367 Update inductor to use `tl.extra.libdevice.exp(tmp0)`: eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593 ``` Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now. Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829 Approved by: https://github.com/jansel	2025-08-30 04:56:51 +00:00
eellison	ebfee60101	[WIP] more aggressive persistent reduction (#161055 ) Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055 Approved by: https://github.com/jansel	2025-08-30 01:08:45 +00:00
Karthick Panner Selvam	130e50afff	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos, https://github.com/atalman	2025-08-28 18:57:34 +00:00
PyTorch MergeBot	c55bdb26e1	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit `378edb047f`. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))	2025-08-27 23:45:12 +00:00
PyTorch MergeBot	014b98dd09	Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 )" This reverts commit `77bc959fe1`. Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))	2025-08-27 21:21:19 +00:00
Karthick Panner Selvam	378edb047f	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-27 14:49:20 +00:00
PyTorch MergeBot	de58505890	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit `cddcaa1903`. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))	2025-08-27 02:36:42 +00:00
Karthick Panner Selvam	cddcaa1903	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-26 22:33:23 +00:00
Charlie West-Taylor	77bc959fe1	Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 ) Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314 Approved by: https://github.com/eellison	2025-08-26 19:40:37 +00:00
Mwiza Kunda	5ee464db5c	[inductor] Fix descriptor broadcasting for singleton dimensions (#160310 ) This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail. As an example, consider the following store instruction: ``` name=buf1 index=x2 + 192y0 + 64y1 valule=TritonCSEVariable('tmp7') params = BlockParameters( shape=[3, 4, 1, 1, 64], block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK], strides=[64, 192, 0, 0, 1], offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset] ) broadcasting_dims=[False, False, True, True, False] broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK] ``` Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`. ``` 9 pre_broadcast_shape = [ 280 sympy.S.One if is_broadcasting else dim 281 for dim, is_broadcasting in zip( 282 -> self.broadcast_shape, self.broadcasting_dims 283 ) 284 ] ``` The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]` Triton throws the following error: `reshape() cannot change total number of elements in tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310 Approved by: https://github.com/blaine-rister	2025-08-20 09:48:58 +00:00
Isuru Fernando	f305019377	[inductor] propagate shapes in CSEVariable (#152198 ) Fixes #149905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198 Approved by: https://github.com/eellison	2025-08-19 16:46:38 +00:00
Nick Riasanovsky	df60736410	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-19 07:32:55 +00:00
PyTorch MergeBot	04c7be903d	Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 )" This reverts commit `8f434545c2`. Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))	2025-08-17 14:22:48 +00:00
Nick Riasanovsky	8f434545c2	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-17 00:35:12 +00:00
Markus Hoehnerbach	89654db1ab	[inductor] fix triton bucketize mask propagation (#159961 ) See `6b414f56a4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961 Approved by: https://github.com/eellison	2025-08-12 19:59:32 +00:00
David Berard	1f4057c11a	[inductor] remove no_x_dim (#159810 ) no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional. no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue. However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell. H100 inference benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a H100 training benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a Overall, the benchmarks show minimal change in performance. Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810 Approved by: https://github.com/ngimel, https://github.com/eellison	2025-08-12 17:10:31 +00:00
Tom Ritchford	a5725965ea	Remove unnecessary "# noqa: set_linter" comments (#159467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467 Approved by: https://github.com/eellison	2025-08-06 21:31:52 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
PyTorch MergeBot	410812763b	Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 )" This reverts commit `bbc0df1094`. Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))	2025-08-05 22:00:24 +00:00
Nick Riasanovsky	bbc0df1094	[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Relying on CI. Should be a NFC. Rollback Plan: Reviewed By: davidberard98 Differential Revision: D79378792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777 Approved by: https://github.com/davidberard98	2025-08-05 03:29:13 +00:00
Anatoly Myachev	46b925681c	[inductor] Update `to(tl.int8).to(tl.uint8)` workaround from #94717 to handle entire range of `torch.uint8` (#158567 ) https://github.com/pytorch/pytorch/pull/94717/files#r2210265070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-07-26 19:11:37 +00:00
PyTorch MergeBot	7d6f340238	Revert "[AOTI] Add more default options to compile_standalone (#158560 )" This reverts commit `a991e285ae`. Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))	2025-07-22 16:20:17 +00:00
Bin Bao	a991e285ae	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-07-21 21:16:48 +00:00
bobrenjc93	5221448574	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-12 15:08:21 +00:00
Xuehai Pan	7f14b42adf	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 05:47:06 +00:00
PyTorch MergeBot	e15f4248ad	Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 )" This reverts commit `7a92b51196`. Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250))	2025-07-12 04:40:52 +00:00
PyTorch MergeBot	9c189ed29a	Revert "multi-kernel matmuls based on varying hint sizes (#156628 )" This reverts commit `6c79530637`. Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))	2025-07-12 03:48:39 +00:00
Xuehai Pan	7a92b51196	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 01:47:22 +00:00
bobrenjc93	6c79530637	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-11 19:38:10 +00:00
Mwiza Kunda	ed508cc018	[inductor][triton] Add experimental use_tensor_descriptor config option (#157906 ) Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met. Some implementation details: - The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store - The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met. Testing: - Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option. This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906 Approved by: https://github.com/jansel	2025-07-11 09:32:40 +00:00
Gabriel Ferns	7e83d50845	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-07 22:13:34 +00:00
PyTorch MergeBot	6ef70edd9a	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit `47f10d0ad0`. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))	2025-07-01 22:11:53 +00:00
Gabriel Ferns	47f10d0ad0	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-01 16:51:03 +00:00
PyTorch MergeBot	c038719731	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit `347ace4c7a`. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail on ROCm ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3020006655))	2025-06-30 16:58:54 +00:00
Tom Ritchford	e3afbb0362	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-30 15:56:35 +00:00
Gabriel Ferns	347ace4c7a	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-29 05:00:47 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit `3627270bdf`. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00

1 2 3 4 5 ...

681 Commits