pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	7b7604fdb4	Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 )" This reverts commit `0c04492e3b`. Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/malfet due to Breaks lint, see `3742b7fb3a/1` ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2896031661))	2025-05-20 23:12:11 +00:00
Slawomir Siwek	3742b7fb3a	Treat dim=[] same as dim=None (#153570 ) Fixes https://github.com/pytorch/pytorch/issues/153568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153570 Approved by: https://github.com/ngimel	2025-05-20 22:44:29 +00:00
albanD	f7b8eadd9d	Add codeowner for merge rules (#152354 ) To ensure changes to merge rights are properly reviewed Also make the codeowner file valid by removing invalid users Pull Request resolved: https://github.com/pytorch/pytorch/pull/152354 Approved by: https://github.com/malfet	2025-05-20 22:24:23 +00:00
henrylhtsang	0c04492e3b	[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 ) Motivation: By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive. Observations: Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time. I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations. Logs: Baseline: https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2 ``` AUTOTUNE mm(2048x2048, 2048x2048) strides: [2048, 1], [1, 2048] dtypes: torch.bfloat16, torch.bfloat16 cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 ``` with prescreening: ``` AUTOTUNE mm(147456x6144, 6144x2048) strides: [6144, 1], [2048, 1] dtypes: torch.bfloat16, torch.bfloat16 cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335 Approved by: https://github.com/eellison	2025-05-20 22:19:02 +00:00
Bin Bao	2c2524f74b	[AOTI] Generate unique cubin file names when package_cpp_only (#153948 ) Summary: * When package_cpp_only is specified, generate kernel file names with unique kernel names to make the final packaged package files more readable. Assert on unique_kernel_names in case somehow it was explicitly set to False. * Fix a rocm test skip, see https://github.com/pytorch/pytorch/pull/153828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153948 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-05-20 22:07:53 +00:00
Wei Wang	8cabd23b3d	[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever	2025-05-20 21:56:47 +00:00
Eddie Yan	2b43d635d3	[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556 ) Also enables unified workspaces by default for non-FBCODE use cases. Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0). Recommended defaults are documented here: https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-20 21:51:49 +00:00
Yiming Zhou	aeb734f519	[nativert] Move GraphSignature to pytorch core (#152969 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading. The serialization schema is defined in torch/_export/serde/schema.py See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp` Differential Revision: D73895378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969 Approved by: https://github.com/swolchok	2025-05-20 21:49:56 +00:00
Jane Xu	8f943046f8	[BE] light cleanups to linter logic (#153965 ) some BE cleanup on other lint things I saw while doing the top of the this stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/153965 Approved by: https://github.com/soulitzer	2025-05-20 21:28:48 +00:00
Grace Cheng	deaf6c2f2f	Address the ignored warnings for `-Wmissing-field-initializers` in the file fbcode/caffe2/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (#153958 ) Summary: the error message https://www.internalfb.com/sandcastle/workflow/698057942249983018/artifact/actionlog.698057942382778255.stderr.1?selectedLines=66-66-70-148 from D74892646 When switching the host compiler to Clang, maybe we should only silence these warnings in this file. Test Plan: sandcastle_green Differential Revision: D75029051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153958 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-05-20 21:25:56 +00:00
Nikita Shulga	6cb7e4b5a5	[EZ] Update mps xfail reason (#153971 ) cummin is not implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/153971 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #153970	2025-05-20 21:15:14 +00:00
Nikita Shulga	03859242ce	[Testing] Fix `test_deterministic_`... on MPS (#153970 ) By decorated emitted kernels with `'''` rather than `"""` To match regex in `torch._inductor.utils.run_and_get_kernels` This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970 Approved by: https://github.com/dcci, https://github.com/jansel	2025-05-20 21:15:14 +00:00
soulitzer	3aa95b252a	Fix test_side_stream_backward_overlap flakiness (#153963 ) Fixes https://github.com/pytorch/pytorch/issues/153927 Although the autograd backward should always execute SideBw before MainBw, there is still a small chance the recorded events won't be in that order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153963 Approved by: https://github.com/janeyx99, https://github.com/Skylion007 ghstack dependencies: #151079, #153412	2025-05-20 21:02:56 +00:00
PyTorch MergeBot	500a710422	Revert "Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245 )" This reverts commit `2e56ce097a`. Reverted https://github.com/pytorch/pytorch/pull/153245 on behalf of https://github.com/yangw-dev due to tests failed internally [D75078034](https://www.internalfb.com/diff/D75078034) ([comment](https://github.com/pytorch/pytorch/pull/153245#issuecomment-2895785642))	2025-05-20 20:45:55 +00:00
Xu Han	179e7d8624	Fix vs2022 caused AVX512 illegal instruction issue. (#153480 ) Fixes #145702 Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine. Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459 Local test passed: <img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480 Approved by: https://github.com/Blackhex, https://github.com/cyyever, https://github.com/atalman	2025-05-20 20:37:00 +00:00
Anita Katahoire	996c4d803d	Removing conda references from PyTorch Docs (#152702 ) Addresses #148339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152702 Approved by: https://github.com/svekars, https://github.com/albanD, https://github.com/atalman	2025-05-20 20:33:28 +00:00
Gantaphon Chalumporn	05bc78e64f	[submodule] Update fbgemm pinned version (#153950 ) Summary: Update fbgemm pinned version in PyTroch. Related update in fbgemm: D74434751 Included changes: Update fbgemm external dependencies directory in setup.py Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec Test Plan: PyTorch OSS CI Differential Revision: D75073516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-20 20:24:27 +00:00
eqy	823a35807c	[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 ) For #152816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101 Approved by: https://github.com/Skylion007	2025-05-20 20:19:03 +00:00
pbialecki	e8f8baf71f	set CUDA_MODULE_LOADING for older drivers only (#152695 ) `CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable. (the `LOG(WARNING)` has to be removed before merging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia	2025-05-20 19:34:40 +00:00
Joel Schlosser	7587350458	Make python_agnostic cpp extension tests standalone (#153274 ) Related: #148920 This PR: * Introduces a new file `test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py` with testing that follows the usual python testing patterns * This replaces the testing for python_agnostic in `test/test_cpp_extensions_aot.py` After this PR, it is now possible to run: ``` python test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py ``` and the test will build the prerequisite wheel before running the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153274 Approved by: https://github.com/janeyx99, https://github.com/cyyever ghstack dependencies: #153264	2025-05-20 19:18:09 +00:00
Joel Schlosser	3ecd444004	Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264 ) Related: #148920 This PR: * Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()` * Updates libtorch_agnostic tests to use this * Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py` * Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now After this PR, it is now possible to run: ``` python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py ``` and the test will build the `libtorch_agnostic` extension before running the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264 Approved by: https://github.com/janeyx99	2025-05-20 19:18:09 +00:00
Tsung-Hsien Lee	f1f54c197d	[c10d] Simplify `new_subgroups()` by using `new_subgroups_by_enumeration()` (#153843 ) Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`. Test Plan: contbuild & OSS CI Differential Revision: D75007368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843 Approved by: https://github.com/Skylion007, https://github.com/wz337	2025-05-20 19:15:20 +00:00
Bert Maher	2d20106922	[inductor] Support cutlass backend with remote execution (#153844 ) Meta-internal builds need to use RE to build with nvcc, since the trainers do not have nvcc (and its attendant build toolchain) installed. This diff enables building using an RE service (via the same code path used for Triton) Differential Revision: [D74907192](https://our.internmc.facebook.com/intern/diff/D74907192/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153844 Approved by: https://github.com/henrylhtsang	2025-05-20 19:05:23 +00:00
Dan Zimmerman	e0f8174001	[triton][fb] Move build_paths into triton_utils (#153652 ) Summary: TSA, this is just a small cleanup Test Plan: CI Differential Revision: D74835506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153652 Approved by: https://github.com/Skylion007	2025-05-20 18:59:50 +00:00
Eddie Yan	f9bb7cf72a	[cuBLASLt] relax `addmm` cuBLASLt constraint (#153675 ) `beta == 1.0` doesn't seem to be required anymore https://github.com/pytorch/pytorch/issues/153590 `self.dim() == 1` restriction seems to still hold but not sure if that's due to a lack of handling on the PyTorch side or the cuBLASLt side, will investigate Pull Request resolved: https://github.com/pytorch/pytorch/pull/153675 Approved by: https://github.com/Skylion007	2025-05-20 18:43:38 +00:00
Svetlana Karslioglu	7c9d94e9bb	Redirect mobile_optimizer.rst to executorch (#153664 ) Redirect mobile_optimizer.rst to executorch Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153664 Approved by: https://github.com/byjlw, https://github.com/malfet	2025-05-20 18:13:45 +00:00
xinan.lin	0087f5f0af	[AOTI][XPU] Embed SPRI-V files into .so (#153924 ) Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary. Fixes #153829 Fixes #153830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924 Approved by: https://github.com/desertfire	2025-05-20 17:38:53 +00:00
Yang Wang	335c89c6f1	[Monitoring] enable local logs and add mac test monitoring (#153454 ) Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too, Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454 Approved by: https://github.com/huydhn	2025-05-20 17:14:40 +00:00
henrylhtsang	b910d37ec6	[cutlass backend] Reduce log level for cutlass runtime error (#153457 ) Want to make sure we always call self.cleanup_run_fn() even if we crash. I think this is the reason why sometimes we get ``` in _dlclose TypeError: 'NoneType' object is not callable ``` Differential Revision: [D74629230](https://our.internmc.facebook.com/intern/diff/D74629230/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153457 Approved by: https://github.com/ColinPeppler	2025-05-20 17:03:17 +00:00
Shivam Raikundalia	6b5b69a468	[Memory Snapshot] Fix RecordFunction Callback Handling (#153839 ) Fixes #153571 Summary: 1. Set annotation callback to global to include all threads 2. Only init callbacks when enable == true and callbacks are empty under mutex 3. When enable == false, check if callbacks are present and if so remove them and set handle to 0 under mutex We don't expect memory snapshots to be called from several different threads (almost always called just from main) but we make sure to add thread safety in the off case that users do want to call it from different points of entry Test Plan: Ran basic snapshot and saw that the callbacks were registered properly Reviewed By: ngimel Differential Revision: D74771491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153839 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-05-20 17:01:00 +00:00
angelayi	ddfaab3b56	[aoti] Reset expr when generating cpp code (#153898 ) Maybe fixes https://github.com/pytorch/pytorch/issues/153896 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153898 Approved by: https://github.com/desertfire	2025-05-20 16:31:25 +00:00
Eddie Yan	5163bf0069	[CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655 ) Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone Tests that should exercise both backends should explicitly parametrize this setting Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655 Approved by: https://github.com/ngimel	2025-05-20 16:18:35 +00:00
PaulZhang12	a7c01d7f13	[Inductor] Subgraph check output strides (#153755 ) Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous. Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755 Approved by: https://github.com/eellison ghstack dependencies: #153754	2025-05-20 16:07:18 +00:00
PaulZhang12	63e5d46478	[Inductor] Subgraph support dynamic input expressions (#153754 ) Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754 Approved by: https://github.com/eellison	2025-05-20 16:07:18 +00:00
iupaikov-amd	2e56ce097a	Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245 ) Fixes #153239 Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jeffdaily	2025-05-20 15:46:21 +00:00
Eddie Yan	ef958fa152	[cuDNN][cuDNN frontend] upgrade cuDNN frontend submodule to 1.12 (#153888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153888 Approved by: https://github.com/Skylion007	2025-05-20 15:08:37 +00:00
PyTorch MergeBot	3102ae6798	Revert "[AOTI] Add an option to specify custom op C shim (#153851 )" This reverts commit `365ac49840`. Reverted https://github.com/pytorch/pytorch/pull/153851 on behalf of https://github.com/malfet due to Looks like it broke fuzzer test, but I could be wrong, see `c4d1ff02f8/1` ([comment](https://github.com/pytorch/pytorch/pull/153851#issuecomment-2894619773))	2025-05-20 14:23:50 +00:00
Nikita Shulga	c4d1ff02f8	[Lint] Update clang-format to 19.1.4 (#153889 ) All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-20 14:12:46 +00:00
Aaron Gokaslan	d869ea11e0	[BE]: Update fmtlib submodule to 11.2.0 (#153853 ) Update fmtlib to 11.2.0 with a lot of miscellaneous fixes for various compilers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153853 Approved by: https://github.com/malfet	2025-05-20 14:11:18 +00:00
James Wu	4b759d98f8	Recheck autotune cache on static cuda launcher load (#153565 ) When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config. Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs. Test Plan: New unit test that failed before Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565 Approved by: https://github.com/aorenste	2025-05-20 14:00:43 +00:00
Michael Lazos	d68d4d31f4	[Cutlass] EVT tests update (#153926 ) Fixes internal EVT tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153926 Approved by: https://github.com/williamwen42	2025-05-20 10:03:10 +00:00
Michael Lazos	d44074f01a	[Dynamo] Fix einops regression (#153925 ) Fixes https://github.com/pytorch/pytorch/issues/153476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153925 Approved by: https://github.com/williamwen42	2025-05-20 09:52:42 +00:00
Panagiotis Kourdis	44f19c7179	Record the XPU and XCCL build settings in the compiled binary (#147161 ) Fixes #ISSUE_NUMBER Currently the XPU and XCCL build settings are not recorded in the compiled binary and are not shown using the `torch.__config__.show()` which is a quick way to check if the binary has been built with such support. Below is the output adding them (see end of last line): ``` Python 3.12.8 \| packaged by conda-forge \| (main, Dec 5 2024, 14:24:40) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.__config__.show()) PyTorch built with: - GCC 13.3 - C++ Version: 201703 - Intel(R) oneAPI Math Kernel Library Version 2025.1-Product Build 20250203 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - CPU capability usage: AVX512 XPU backend - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=43eb39d7c832b5560f7bfa8d29cc7919ac21c0ca, CXX_COMPILER=/home/pkourdis/compilers/gcc-13.3.0/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-error=redundant-move -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147161 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-05-20 09:21:39 +00:00
PyTorch MergeBot	1075bb37d3	Revert "Fix fake tensor caching when output has unbacked (#153034 )" This reverts commit `cb5f31a4a1`. Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2893059329))	2025-05-20 06:02:38 +00:00
PyTorch MergeBot	9849c79fa2	Revert "FakeTensorMode dispatch shouldn't include bypass in exception context (#153780 )" This reverts commit `aa84c037f0`. Reverted https://github.com/pytorch/pytorch/pull/153780 on behalf of https://github.com/malfet due to Reverting to clearly revert https://github.com/pytorch/pytorch/pull/153034, that seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153780#issuecomment-2893053304))	2025-05-20 05:59:42 +00:00
Bin Bao	365ac49840	[AOTI] Add an option to specify custom op C shim (#153851 ) Summary: Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions. Differential Revision: [D75014177](https://our.internmc.facebook.com/intern/diff/D75014177) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153851 Approved by: https://github.com/hl475	2025-05-20 05:12:09 +00:00
Sidharth	89ebd29fdc	[Dynamo] added warning message for tracing lru_cache wrapped functions (#153744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744 Approved by: https://github.com/williamwen42	2025-05-20 04:08:29 +00:00
angelayi	5ef90e14a3	[export] Remove unused constants (#153800 ) An internal test case ran into a weird issue when exporting, where the model imported a file which creates tensor constants upon importing [(code ptr)](https://fburl.com/code/xwmhxm7n). This causes the tracer to create some tensor constants even though it's not used in the model code. This PR updates the lift_constant_tensors pass to remove constant nodes that are not being used instead of lifting them as tensor constants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153800 Approved by: https://github.com/dolpm, https://github.com/pianpwk	2025-05-20 03:15:27 +00:00
Yanli Zhao	a79e621c1c	[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 ) Differential Revision: D72437251 Enable to rebuild bucket order when find_unused_parameters=true. It should be always better than not rebuilding bucket order when find_unused_parameters=True: 1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything 2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases: a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2025-05-20 02:45:01 +00:00
Nikita Shulga	8b94d30b26	[Testing] Benchmark more tests for MPSInductor (#153897 ) And report HF tests as HF tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153897 Approved by: https://github.com/dcci	2025-05-20 02:41:38 +00:00

... 4 5 6 7 8 ...

88238 Commits