Commit Graph

88238 Commits

Author SHA1 Message Date
PyTorch MergeBot
7b7604fdb4 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit 0c04492e3b.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/malfet due to Breaks lint, see 3742b7fb3a/1 ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2896031661))
2025-05-20 23:12:11 +00:00
Slawomir Siwek
3742b7fb3a Treat dim=[] same as dim=None (#153570)
Fixes https://github.com/pytorch/pytorch/issues/153568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153570
Approved by: https://github.com/ngimel
2025-05-20 22:44:29 +00:00
albanD
f7b8eadd9d Add codeowner for merge rules (#152354)
To ensure changes to merge rights are properly reviewed
Also make the codeowner file valid by removing invalid users
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152354
Approved by: https://github.com/malfet
2025-05-20 22:24:23 +00:00
henrylhtsang
0c04492e3b [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-20 22:19:02 +00:00
Bin Bao
2c2524f74b [AOTI] Generate unique cubin file names when package_cpp_only (#153948)
Summary:
* When package_cpp_only is specified, generate kernel file names with unique kernel names to make the final packaged package files more readable. Assert on unique_kernel_names in case somehow it was explicitly set to False.
* Fix a rocm test skip, see https://github.com/pytorch/pytorch/pull/153828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153948
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-05-20 22:07:53 +00:00
Wei Wang
8cabd23b3d [CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever
2025-05-20 21:56:47 +00:00
Eddie Yan
2b43d635d3 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 21:51:49 +00:00
Yiming Zhou
aeb734f519 [nativert] Move GraphSignature to pytorch core (#152969)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading.

The serialization schema is defined in torch/_export/serde/schema.py
See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature

Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp`

Differential Revision: D73895378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969
Approved by: https://github.com/swolchok
2025-05-20 21:49:56 +00:00
Jane Xu
8f943046f8 [BE] light cleanups to linter logic (#153965)
some BE cleanup on other lint things I saw while doing the top of the this stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153965
Approved by: https://github.com/soulitzer
2025-05-20 21:28:48 +00:00
Grace Cheng
deaf6c2f2f Address the ignored warnings for -Wmissing-field-initializers in the file fbcode/caffe2/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (#153958)
Summary:
the error message  https://www.internalfb.com/sandcastle/workflow/698057942249983018/artifact/actionlog.698057942382778255.stderr.1?selectedLines=66-66-70-148 from D74892646

When switching the host compiler to Clang, maybe we should only silence these warnings in this file.

Test Plan: sandcastle_green

Differential Revision: D75029051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153958
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-05-20 21:25:56 +00:00
Nikita Shulga
6cb7e4b5a5 [EZ] Update mps xfail reason (#153971)
cummin is not implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153971
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #153970
2025-05-20 21:15:14 +00:00
Nikita Shulga
03859242ce [Testing] Fix test_deterministic_... on MPS (#153970)
By decorated emitted kernels with `'''` rather than `"""`

To match regex in `torch._inductor.utils.run_and_get_kernels`
This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970
Approved by: https://github.com/dcci, https://github.com/jansel
2025-05-20 21:15:14 +00:00
soulitzer
3aa95b252a Fix test_side_stream_backward_overlap flakiness (#153963)
Fixes https://github.com/pytorch/pytorch/issues/153927

Although the autograd backward should always execute SideBw before MainBw, there is still a small chance the recorded events won't be in that order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153963
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
ghstack dependencies: #151079, #153412
2025-05-20 21:02:56 +00:00
PyTorch MergeBot
500a710422 Revert "Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)"
This reverts commit 2e56ce097a.

Reverted https://github.com/pytorch/pytorch/pull/153245 on behalf of https://github.com/yangw-dev due to tests failed internally [D75078034](https://www.internalfb.com/diff/D75078034) ([comment](https://github.com/pytorch/pytorch/pull/153245#issuecomment-2895785642))
2025-05-20 20:45:55 +00:00
Xu Han
179e7d8624 Fix vs2022 caused AVX512 illegal instruction issue. (#153480)
Fixes #145702

Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine.

Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459

Local test passed:
<img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480
Approved by: https://github.com/Blackhex, https://github.com/cyyever, https://github.com/atalman
2025-05-20 20:37:00 +00:00
Anita Katahoire
996c4d803d Removing conda references from PyTorch Docs (#152702)
Addresses #148339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152702
Approved by: https://github.com/svekars, https://github.com/albanD, https://github.com/atalman
2025-05-20 20:33:28 +00:00
Gantaphon Chalumporn
05bc78e64f [submodule] Update fbgemm pinned version (#153950)
Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751

Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec

Test Plan: PyTorch OSS CI

Differential Revision: D75073516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 20:24:27 +00:00
eqy
823a35807c [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-20 20:19:03 +00:00
pbialecki
e8f8baf71f set CUDA_MODULE_LOADING for older drivers only (#152695)
`CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable.

(the `LOG(WARNING)` has to be removed before merging)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia
2025-05-20 19:34:40 +00:00
Joel Schlosser
7587350458 Make python_agnostic cpp extension tests standalone (#153274)
Related: #148920

This PR:
* Introduces a new file `test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py` with testing that follows the usual python testing patterns
    * This replaces the testing for python_agnostic in `test/test_cpp_extensions_aot.py`

After this PR, it is now possible to run:
```
python test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py
```

and the test will build the prerequisite wheel before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153274
Approved by: https://github.com/janeyx99, https://github.com/cyyever
ghstack dependencies: #153264
2025-05-20 19:18:09 +00:00
Joel Schlosser
3ecd444004 Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264)
Related: #148920

This PR:
* Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()`
    * Updates libtorch_agnostic tests to use this
* Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py`
    * Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now

After this PR, it is now possible to run:
```
python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py
```

and the test will build the `libtorch_agnostic` extension before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264
Approved by: https://github.com/janeyx99
2025-05-20 19:18:09 +00:00
Tsung-Hsien Lee
f1f54c197d [c10d] Simplify new_subgroups() by using new_subgroups_by_enumeration() (#153843)
Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`.

Test Plan: contbuild & OSS CI

Differential Revision: D75007368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843
Approved by: https://github.com/Skylion007, https://github.com/wz337
2025-05-20 19:15:20 +00:00
Bert Maher
2d20106922 [inductor] Support cutlass backend with remote execution (#153844)
Meta-internal builds need to use RE to build with nvcc, since the
trainers do not have nvcc (and its attendant build toolchain) installed.

This diff enables building using an RE service (via the same code path used for
Triton)

Differential Revision: [D74907192](https://our.internmc.facebook.com/intern/diff/D74907192/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153844
Approved by: https://github.com/henrylhtsang
2025-05-20 19:05:23 +00:00
Dan Zimmerman
e0f8174001 [triton][fb] Move build_paths into triton_utils (#153652)
Summary: TSA, this is just a small cleanup

Test Plan: CI

Differential Revision: D74835506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153652
Approved by: https://github.com/Skylion007
2025-05-20 18:59:50 +00:00
Eddie Yan
f9bb7cf72a [cuBLASLt] relax addmm cuBLASLt constraint (#153675)
`beta == 1.0` doesn't seem to be required anymore

https://github.com/pytorch/pytorch/issues/153590

`self.dim() == 1` restriction seems to still hold but not sure if that's due to a lack of handling on the PyTorch side or the cuBLASLt side, will investigate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153675
Approved by: https://github.com/Skylion007
2025-05-20 18:43:38 +00:00
Svetlana Karslioglu
7c9d94e9bb Redirect mobile_optimizer.rst to executorch (#153664)
Redirect mobile_optimizer.rst to executorch

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153664
Approved by: https://github.com/byjlw, https://github.com/malfet
2025-05-20 18:13:45 +00:00
xinan.lin
0087f5f0af [AOTI][XPU] Embed SPRI-V files into .so (#153924)
Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary.
Fixes #153829
Fixes #153830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924
Approved by: https://github.com/desertfire
2025-05-20 17:38:53 +00:00
Yang Wang
335c89c6f1 [Monitoring] enable local logs and add mac test monitoring (#153454)
Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454
Approved by: https://github.com/huydhn
2025-05-20 17:14:40 +00:00
henrylhtsang
b910d37ec6 [cutlass backend] Reduce log level for cutlass runtime error (#153457)
Want to make sure we always call self.cleanup_run_fn() even if we crash.

I think this is the reason why sometimes we get
```
in _dlclose
TypeError: 'NoneType' object is not callable
```

Differential Revision: [D74629230](https://our.internmc.facebook.com/intern/diff/D74629230/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153457
Approved by: https://github.com/ColinPeppler
2025-05-20 17:03:17 +00:00
Shivam Raikundalia
6b5b69a468 [Memory Snapshot] Fix RecordFunction Callback Handling (#153839)
Fixes #153571
Summary:
1. Set annotation callback to global to include all threads
2. Only init callbacks when enable == true and callbacks are empty under mutex
3. When enable == false, check if callbacks are present and if so remove them and set handle to 0 under mutex

We don't expect memory snapshots to be called from several different threads (almost always called just from main) but we make sure to add thread safety in the off case that users do want to call it from different points of entry

Test Plan: Ran basic snapshot and saw that the callbacks were registered properly

Reviewed By: ngimel

Differential Revision: D74771491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153839
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-05-20 17:01:00 +00:00
angelayi
ddfaab3b56 [aoti] Reset expr when generating cpp code (#153898)
Maybe fixes https://github.com/pytorch/pytorch/issues/153896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153898
Approved by: https://github.com/desertfire
2025-05-20 16:31:25 +00:00
Eddie Yan
5163bf0069 [CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655)
Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone

Tests that should exercise both backends should explicitly parametrize this setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655
Approved by: https://github.com/ngimel
2025-05-20 16:18:35 +00:00
PaulZhang12
a7c01d7f13 [Inductor] Subgraph check output strides (#153755)
Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous.

Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755
Approved by: https://github.com/eellison
ghstack dependencies: #153754
2025-05-20 16:07:18 +00:00
PaulZhang12
63e5d46478 [Inductor] Subgraph support dynamic input expressions (#153754)
Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp

Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754
Approved by: https://github.com/eellison
2025-05-20 16:07:18 +00:00
iupaikov-amd
2e56ce097a Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)
Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jeffdaily
2025-05-20 15:46:21 +00:00
Eddie Yan
ef958fa152 [cuDNN][cuDNN frontend] upgrade cuDNN frontend submodule to 1.12 (#153888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153888
Approved by: https://github.com/Skylion007
2025-05-20 15:08:37 +00:00
PyTorch MergeBot
3102ae6798 Revert "[AOTI] Add an option to specify custom op C shim (#153851)"
This reverts commit 365ac49840.

Reverted https://github.com/pytorch/pytorch/pull/153851 on behalf of https://github.com/malfet due to Looks like it broke fuzzer test, but I could be wrong, see c4d1ff02f8/1 ([comment](https://github.com/pytorch/pytorch/pull/153851#issuecomment-2894619773))
2025-05-20 14:23:50 +00:00
Nikita Shulga
c4d1ff02f8 [Lint] Update clang-format to 19.1.4 (#153889)
All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-20 14:12:46 +00:00
Aaron Gokaslan
d869ea11e0 [BE]: Update fmtlib submodule to 11.2.0 (#153853)
Update fmtlib to 11.2.0 with a lot of miscellaneous fixes for various compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153853
Approved by: https://github.com/malfet
2025-05-20 14:11:18 +00:00
James Wu
4b759d98f8 Recheck autotune cache on static cuda launcher load (#153565)
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.

Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.

Test Plan:
New unit test that failed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
2025-05-20 14:00:43 +00:00
Michael Lazos
d68d4d31f4 [Cutlass] EVT tests update (#153926)
Fixes internal EVT tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153926
Approved by: https://github.com/williamwen42
2025-05-20 10:03:10 +00:00
Michael Lazos
d44074f01a [Dynamo] Fix einops regression (#153925)
Fixes https://github.com/pytorch/pytorch/issues/153476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153925
Approved by: https://github.com/williamwen42
2025-05-20 09:52:42 +00:00
Panagiotis Kourdis
44f19c7179 Record the XPU and XCCL build settings in the compiled binary (#147161)
Fixes #ISSUE_NUMBER

Currently the XPU and XCCL build settings are not recorded in the compiled binary and are not shown using the `torch.__config__.show()` which is a quick way to check if the binary has been built with such support.

Below is the output adding them (see end of last line):

```
Python 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2025.1-Product Build 20250203 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
XPU backend  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=43eb39d7c832b5560f7bfa8d29cc7919ac21c0ca, CXX_COMPILER=/home/pkourdis/compilers/gcc-13.3.0/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-error=redundant-move -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1,
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147161
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-20 09:21:39 +00:00
PyTorch MergeBot
1075bb37d3 Revert "Fix fake tensor caching when output has unbacked (#153034)"
This reverts commit cb5f31a4a1.

Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2893059329))
2025-05-20 06:02:38 +00:00
PyTorch MergeBot
9849c79fa2 Revert "FakeTensorMode dispatch shouldn't include bypass in exception context (#153780)"
This reverts commit aa84c037f0.

Reverted https://github.com/pytorch/pytorch/pull/153780 on behalf of https://github.com/malfet due to Reverting to clearly revert https://github.com/pytorch/pytorch/pull/153034, that seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153780#issuecomment-2893053304))
2025-05-20 05:59:42 +00:00
Bin Bao
365ac49840 [AOTI] Add an option to specify custom op C shim (#153851)
Summary: Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Differential Revision: [D75014177](https://our.internmc.facebook.com/intern/diff/D75014177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153851
Approved by: https://github.com/hl475
2025-05-20 05:12:09 +00:00
Sidharth
89ebd29fdc [Dynamo] added warning message for tracing lru_cache wrapped functions (#153744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744
Approved by: https://github.com/williamwen42
2025-05-20 04:08:29 +00:00
angelayi
5ef90e14a3 [export] Remove unused constants (#153800)
An internal test case ran into a weird issue when exporting, where the model imported a file which creates tensor constants upon importing [(code ptr)](https://fburl.com/code/xwmhxm7n). This causes the tracer to create some tensor constants even though it's not used in the model code. This PR updates the lift_constant_tensors pass to remove constant nodes that are not being used instead of lifting them as tensor constants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153800
Approved by: https://github.com/dolpm, https://github.com/pianpwk
2025-05-20 03:15:27 +00:00
Yanli Zhao
a79e621c1c [DDP] rebuilt bucket order when find_unused_parameters=true (#153404)
Differential Revision: D72437251

Enable to rebuild bucket order when find_unused_parameters=true.

It should be always better than not rebuilding bucket order when find_unused_parameters=True:

1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything

2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases:
    a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain
    b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter

it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2025-05-20 02:45:01 +00:00
Nikita Shulga
8b94d30b26 [Testing] Benchmark more tests for MPSInductor (#153897)
And report HF tests as HF tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153897
Approved by: https://github.com/dcci
2025-05-20 02:41:38 +00:00