pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
David Berard	cd995bfb2a	[inductor] re-enable TMA templates w/ AOTI (#157819 ) Follow-up from #155896: now that AOTI can codegen non-null TMA workspace args, we can re-enable TMA templates w/ AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157819 Approved by: https://github.com/drisspg	2025-07-10 08:35:29 +00:00
Sheng Fu	3584e84c24	Fixed the function to get the origin nodes of fused triton kernel. (#157578 ) Summary: This DIFF is to fix the following issue: In python source code for CompiledFxGraph,the FX graph segment for the Triton kernel is broken. For example, the following function def fn(a, b, c): x = torch.nn.functional.linear(a, b) x = x.sin() x = x.t() + c return x Inductor compiled this FX graph into two nodes: the first one is mm, the second one is a triton kernel for sin + transpose + add. The FX graph segment for the triton kernel is like the following: Graph fragment: %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %arg2_1), kwargs = {}) Basically only "add" node in the FX graph. The root cause is function caffe2/torch/_inductor/utils.py:gather_origins does not detect the realized node correctly. To fix this issue, the IRNode is checked if it is one of the following IRNode: ir.ComputedBuffer, ir.InputsKernel, ir.InputBuffer, ir.ReinterpretView, ir.TemplateBuffer, If it is one of them, it is realized, otherwise, it is not. Test Plan: buck2 run mode/opt caffe2/test/inductor:provenance_tracing -- caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact.test_triton_kernel_to_post_grad_tracing_cuda Rollback Plan: Differential Revision: D77748371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157578 Approved by: https://github.com/mlazos	2025-07-10 05:34:50 +00:00
Shangdi Yu	effe376db0	Adding aoti_standalone config (#157731 ) Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if `package_cpp_only` is explicitly set to False in config. Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r TestAOTInductorConfig ``` Rollback Plan: Differential Revision: D77889754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731 Approved by: https://github.com/desertfire	2025-07-09 04:30:04 +00:00
Gabriel Ferns	7e83d50845	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-07 22:13:34 +00:00
Paul Zhang	0edc1b91f7	[Inductor] Disable decompose_k for AMD (#157283 ) Differential Revision: D77544250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157283 Approved by: https://github.com/bdhirsh	2025-07-02 15:21:46 +00:00
Chong Gu	617e3f69f8	[FP8] Fix Benchmarking for certain Priors (#155722 ) Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead. Test Plan: Trying this on model id 737772166 with ``` buck2 run mode/opt mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}" ``` will allow more linears to be correctly replaced with fp8. An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces. Rollback Plan: Differential Revision: D76092551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155722 Approved by: https://github.com/Skylion007	2025-07-02 00:01:23 +00:00
PyTorch MergeBot	6ef70edd9a	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit `47f10d0ad0`. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))	2025-07-01 22:11:53 +00:00
Gabriel Ferns	47f10d0ad0	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-01 16:51:03 +00:00
PyTorch MergeBot	c038719731	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit `347ace4c7a`. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail on ROCm ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3020006655))	2025-06-30 16:58:54 +00:00
Tom Ritchford	e3afbb0362	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-30 15:56:35 +00:00
Gabriel Ferns	347ace4c7a	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-29 05:00:47 +00:00
Valentine233	02c7ab2f9b	[cpp wrapper] add AOTI shim for collective ops (#154492 ) Implementations: 1. Move collective ops to c10d namespace, so that we can call them externally. 2. Add AOTI shims for collective ops. Testing 1. Add c10d functional UT for cpu. 2. Include the above one in cpp wrapper UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154492 Approved by: https://github.com/desertfire	2025-06-25 01:20:05 +00:00
Paul Zhang	86996c15dc	[Inductor] Allow exhaustive autotuning across all GEMM options (#156610 ) Differential Revision: D76843916 Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature: Excessive register spillage Using much larger amounts of shared memory than available on the hardware This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive: ``` AUTOTUNE mm(1152x21504, 21504x1024) strides: [21504, 1], [1, 21504] dtypes: torch.bfloat16, torch.bfloat16 mm 0.1167 ms 100.0% triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610 Approved by: https://github.com/jansel	2025-06-24 01:42:05 +00:00
Simon Fan	6b45af38a5	[easy] better copy_misaligned_inputs assertion failure message (#154472 ) internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/688540560729579/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/154472 Approved by: https://github.com/williamwen42	2025-06-23 15:39:15 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit `3627270bdf`. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
drisspg	88b9c285e0	Workaround for e4m2 dtype (#156461 ) Found in: https://github.com/pytorch/ao/pull/2408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156461 Approved by: https://github.com/vkuzo	2025-06-21 04:01:44 +00:00
Nicolas Macchioni	6098209bff	[BE][5/X] Phase out usage of use_max_autotune() (#156269 ) These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269 Approved by: https://github.com/masnesral	2025-06-20 22:37:45 +00:00
xinan.lin	83259cf7a7	[Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287 ) This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode. The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion. The performance improvement: \| Suite \| Inductor Speedup (Baseline) \| Inductor Speedup (Compared) \| Acc Failed \| Perf Failed \| Inductor Perf Ratio \| Speedup \| \|-------------\|-----------------------------\|------------------------------\|------------\|--------------\|----------------------\|----------\| \| Huggingface \| 2.134838 \| 2.125740314 \| 0 \| 0 \| 1.001462504 \| 100.43% \| \| Torchbench \| 1.808558 \| 1.675100479 \| 0 \| 0 \| 1.075722187 \| 107.97% \| \| Timm \| 2.343893 \| 2.070476653 \| 0 \| 0 \| 1.131023832 \| 113.21% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287 Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel	2025-06-19 13:17:22 +00:00
Laith Sakka	3f69e3b3a0	Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-19 04:50:18 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
Marcin Pioch	ce79056471	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-14 17:29:54 +00:00
penknife6153	3e38feb05f	[inductor] Add configuration control for CUTLASS operation selection. (#155770 ) Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time. Fixes #155718 ## Usage Examples ```bash # Enable CUTLASS for all operations (default behavior) export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL" # Enable CUTLASS only for matrix multiplication operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm" # Enable CUTLASS only for batch operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm" # Disable CUTLASS for all operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770 Approved by: https://github.com/henrylhtsang	2025-06-14 08:19:54 +00:00
PyTorch MergeBot	06408dae49	Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 )" This reverts commit `0029259bdf`. Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))	2025-06-13 19:11:43 +00:00
Laith Sakka	f4376cac54	unify symbolic_shapes and sizevars dynamic shapes APIs naming 1 (#154774 ) Inductor have a set of APIs that allows performing symbolic evaluations similar to that of symbolic shapes but it operates on sympy expressions instead of symnodes. Namings are not consistent making them consistent in this stack. Step 1 : unify statically_know_true naming! for consistent experience. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154774 Approved by: https://github.com/drisspg, https://github.com/bobrenjc93, https://github.com/eellison	2025-06-12 16:11:55 +00:00
Laith Sakka	0029259bdf	Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-12 09:58:15 +00:00
David Berard	c3ecabf059	[inductor][triton pin] add support for new TMA API for mm.py templates (#155723 ) Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488 For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs). For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API). For mm_scaled_grouped.py, https://github.com/pytorch/pytorch/pull/150944 will remove TMA support for Triton 3.2. Note: we attempted this earlier with https://github.com/pytorch/pytorch/pull/154858, but this broke TMA usage in Triton 3.2. Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155723 Approved by: https://github.com/NikhilAPatel	2025-06-12 06:25:47 +00:00
Shuqi Yang	1b6772a90f	A small fix in do_bench_using_profiling (#155500 ) Summary: Results: https://docs.google.com/document/d/1B_4rtiDFPH_jV3VpnqLPnInwDMpF7yX29G82UoJTcu8/edit?tab=t.0 Test Plan: ``` buck2 run mode/opt -c fbcode.enable_gpu_sections=true ai_acceleration/float8/benchmarks/bench:bench_fp8_shapes_eval 2>&1 \| tee output44.txt ``` Rollback Plan: Differential Revision: D76298690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155500 Approved by: https://github.com/yoyoyocmu, https://github.com/nmacchioni	2025-06-11 20:06:19 +00:00
Oguz Ulgen	d1947a8707	Migrate from lru_cache to cache (#155613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613 Approved by: https://github.com/ezyang ghstack dependencies: #155612	2025-06-11 19:44:18 +00:00
Shunting Zhang	0b677560e6	[inductor] use int64 for large index (#154575 ) Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31. Fix https://github.com/pytorch/pytorch/issues/154168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-06-10 18:30:43 +00:00
PyTorch MergeBot	eb152ab1dd	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit `060838c231`. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/clee2000 due to broke a bunch of tests internally D76299454, probably also broke rocm inductor/test_analysis.py::TestAnalysisCUDA::test_augment_trace_against_flop_counter_maxat0_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545277599/job/43766911025) [HUD commit link](`060838c231`) ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2959747153))	2025-06-10 15:38:40 +00:00
Gabriel Ferns	060838c231	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-09 21:43:21 +00:00
Max Podkorytov	1e6a653234	[ROCm][Inductor][CK] Split ck and ck-tile inductor backend(s) (#155294 ) ... and fix ck-tile instances not being generated due to incorrect caching ### Testing Added test cases for CKTILE instances ``` pytest test/inductor/test_ck_backend.py -k gemm_backends_CKTILE ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155294 Approved by: https://github.com/coconutruben	2025-06-09 20:40:26 +00:00
PyTorch MergeBot	79bdafe5b6	Revert "Custom FX pass for inductor's backend registration (#154841 )" This reverts commit `e694280d12`. Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))	2025-06-09 16:56:45 +00:00
PyTorch MergeBot	27df0c56b7	Revert "[inductor] use int64 for large index (#154575 )" This reverts commit `2596e3d061`. Reverted https://github.com/pytorch/pytorch/pull/154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](`2596e3d061`), note for self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154575#issuecomment-2954175761))	2025-06-08 16:58:59 +00:00
Shunting Zhang	2596e3d061	[inductor] use int64 for large index (#154575 ) Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31. Fix https://github.com/pytorch/pytorch/issues/154168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-06-07 18:41:46 +00:00
PyTorch MergeBot	7e4c097b07	Revert "[inductor] Add typing to _inductor/ir.py (#149958 )" This reverts commit `529e0357c6`. Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see `b0fbbef136/1` ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))	2025-06-06 15:19:16 +00:00
Tom Ritchford	529e0357c6	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-06 14:15:01 +00:00
Marcin Pioch	e694280d12	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-06 06:49:44 +00:00
PyTorch MergeBot	5e03433443	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit `e5afbe3124`. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Broke rocm, see `642687af29/1` ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2942415600))	2025-06-05 01:38:13 +00:00
Gabriel Ferns	e5afbe3124	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-04 20:03:46 +00:00
Boyuan Feng	a4da1d4a47	[Graph Partition] support standalone_compile (#154698 ) For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698 Approved by: https://github.com/oulgen	2025-06-03 07:40:42 +00:00
Paul Zhang	0c6c7780d9	[Inductor] Add envvar to disable decomposeK (#154421 ) Summary: Add envvar to Inductor config to disable decomposeK autotuning choice Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_max_autotune_decompose_k_dynamic_False_sizes2 (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled` Reviewed By: eellison Differential Revision: D75174823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154421 Approved by: https://github.com/eellison	2025-05-29 23:34:41 +00:00
eellison	d6e29bf875	Reflect back mutation if we clone misaligned tensors (#154442 ) Fix for https://github.com/pytorch/pytorch/issues/152425 inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor. If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation. We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark: ``` import torch t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16) @torch.compile(dynamic=False) def foo(x): return x.add_(1) import triton print(triton.testing.do_bench(lambda: foo(t[:-1]))) torch._dynamo.reset() print(triton.testing.do_bench(lambda: foo(t[1:]))) ``` gives ``` 0.04063070610165596 0.07613472988113162 ``` So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case. In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh	2025-05-29 13:36:48 +00:00
angelayi	26471fc203	[aoti] Initial Metal support (#153959 ) An example generated file: P1816629015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: #153964	2025-05-23 05:45:35 +00:00
PyTorch MergeBot	47a01f3efb	Revert "[aoti] Initial Metal support (#153959 )" This reverts commit `28bcd9eb30`. Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))	2025-05-22 16:17:07 +00:00
angelayi	28bcd9eb30	[aoti] Initial Metal support (#153959 ) An example generated file: P1816629015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: #153964	2025-05-21 21:55:59 +00:00
PyTorch MergeBot	01bb249978	Revert "`has_triton`: Use the device interface for detecting Triton availability (#139171 )" This reverts commit `48bfe9afc7`. Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/masnesral due to Performance regression for huggingface ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2868939790))	2025-05-10 14:46:23 +00:00
Menglu Yu	2d25e4d478	[1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380 ) Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017 Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d) Executing actions. Remaining 0/4 6.7s exec time total Command: test. Finished 2 local Time elapsed: 3:11.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable (you can overrite the dtype, if nothing given, the default is fp8) ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"} }, ``` Differential Revision: D70522237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380 Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803	2025-05-08 04:44:15 +00:00

1 2 3 4 5 ...

560 Commits