Commit Graph

2213 Commits

Author SHA1 Message Date
Laith Sakka
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
Bin Bao
4d8f3d537a [AOTI][refactor] Rename embed_cubin to embed_kernel_binary (#154412)
Summary: Rename as it is not CUDA specific.

Differential Revision: [D75452095](https://our.internmc.facebook.com/intern/diff/D75452095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154412
Approved by: https://github.com/angelayi
2025-05-28 01:20:28 +00:00
Shangdi Yu
eaf355cb11 [BE] Clean up unused parameter input in AOTIModel (#154276)
Summary: As title

Test Plan: CI

Differential Revision: D74691763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154276
Approved by: https://github.com/Skylion007
2025-05-27 22:17:32 +00:00
PyTorch MergeBot
555fc05868 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 6169ca0b65.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/benjaminglass1 due to Appears to have broken main ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2913975736))
2025-05-27 20:39:09 +00:00
Benjamin Glass
6169ca0b65 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-27 19:17:41 +00:00
Aaron Orenstein
6503b4a96e Update to using mypy 1.15 (#154054)
The BC break isn't real - mypy decided to start complaining about the way we were typing that function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154054
Approved by: https://github.com/Skylion007
2025-05-24 04:30:57 +00:00
leslie-fang-intel
7ba6fb69e6 [Inductor][CPP] Enable vectorized fp8 E5M2 quant dequant (#153365)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E5M2` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e5m2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153365
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #152417, #152418, #153364
2025-05-23 23:20:02 +00:00
leslie-fang-intel
b77a6504fa [Inductor][CPP] Enable vectorized fp8 quant dequant (#152418)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E4M3` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e4m3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152418
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/CaoE
ghstack dependencies: #152417
2025-05-23 23:05:17 +00:00
henrylhtsang
e927ba6dbd [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-23 17:12:25 +00:00
angelayi
26471fc203 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-23 05:45:35 +00:00
Benjamin Glass
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
PyTorch MergeBot
025c5cc048 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit d23762974e.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2901916364))
2025-05-22 16:52:04 +00:00
PyTorch MergeBot
47a01f3efb Revert "[aoti] Initial Metal support (#153959)"
This reverts commit 28bcd9eb30.

Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))
2025-05-22 16:17:07 +00:00
Isuru Fernando
f419373dd3 [inductor] lowering for fractional_max_pool3d (#148630)
also a lowering with a reduction for large window_sizes for
fractional_max_pool2d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148630
Approved by: https://github.com/eellison
2025-05-22 16:06:29 +00:00
Michael Lazos
423fc671e9 [Cutlass] Support float8_e4m3fn GEMM (#153890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-22 08:37:33 +00:00
Boyuan Feng
669b176d4c [Graph Partition] support removed arguments, NoneLayout, and mutation (#153899)
Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases:

1. `NoneLayout` is not allocated so it cannot become a partition input or output.
2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`.
3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen.

This PR supports these 3 cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899
Approved by: https://github.com/eellison
2025-05-22 04:24:31 +00:00
PyTorch MergeBot
261897734a Revert "cpp_wrapper: build non-performance-sensitive code at O1 (#148773)"
This reverts commit 3c89cfd460.

Reverted https://github.com/pytorch/pytorch/pull/148773 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that pr_time_benchmark is regressed after this land ([comment](https://github.com/pytorch/pytorch/pull/148773#issuecomment-2899545140))
2025-05-22 00:11:14 +00:00
Max Podkorytov
7ef2c62fd3 [ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices (#152341)
This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices.

Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters.

This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure.

Points of discussion:

* For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime
* CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed
* Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels

** Testing **

Use the existing tests in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152341
Approved by: https://github.com/chenyang78
2025-05-21 23:59:16 +00:00
henrylhtsang
053ca7439a [cutlass backend] Add serializer for cutlass ops (#153894)
Differential Revision: [D74524786](https://our.internmc.facebook.com/intern/diff/D74524786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153894
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-05-21 22:01:40 +00:00
angelayi
28bcd9eb30 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-21 21:55:59 +00:00
Benjamin Glass
3c89cfd460 cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-21 20:23:04 +00:00
henrylhtsang
d23762974e [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-21 17:12:05 +00:00
Bin Bao
72a3c8dfa8 [AOTI][reland] Add an option to specify custom op C shim (#153968)
Summary: Reland https://github.com/pytorch/pytorch/pull/153851 after fixing a fuzzer test issue.

Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153968
Approved by: https://github.com/hl475
2025-05-21 15:57:57 +00:00
Nikita Shulga
58dc80dff6 [MPSInductor] Fix indexing calculation (#153997)
By using `c10:🤘:floor_divie` primitive

Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI)

Before this change following script generated different compile and eager results
```python
import torch

def foo(unsqueeze, unsqueeze_1):
    cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1)
    view = torch.ops.aten.view.default(cat_1, [4])
    slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3)
    rev_1 = torch.ops.aten.flip.default(slice_5, [0])
    return rev_1

if __name__ == "__main__":
    x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1)
    y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1)

    rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y)
    print(kernel)
    print("Compile: ", rc)
    print("Eager: ", foo(x, y))
```
After this change
```
'''
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp0 = (2 + ((-1)*x0)) % (2);
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 1;
        auto tmp5 = tmp1 < tmp4;
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2;
        auto tmp10 = tmp1 < tmp9;
        auto tmp12 = tmp8 ? tmp11 : 0.0;
        auto tmp13 = tmp5 ? tmp7 : tmp12;
        out_ptr0[x0] = static_cast<float>(tmp13);
    }
'''
Compile:  tensor([2., 5., 1.], device='mps:0')
Eager:  tensor([2., 5., 1.], device='mps:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997
Approved by: https://github.com/dcci
ghstack dependencies: #153970, #153971
2025-05-21 00:03:46 +00:00
PyTorch MergeBot
7b7604fdb4 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit 0c04492e3b.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/malfet due to Breaks lint, see 3742b7fb3a/1 ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2896031661))
2025-05-20 23:12:11 +00:00
henrylhtsang
0c04492e3b [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-20 22:19:02 +00:00
Nikita Shulga
03859242ce [Testing] Fix test_deterministic_... on MPS (#153970)
By decorated emitted kernels with `'''` rather than `"""`

To match regex in `torch._inductor.utils.run_and_get_kernels`
This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970
Approved by: https://github.com/dcci, https://github.com/jansel
2025-05-20 21:15:14 +00:00
xinan.lin
0087f5f0af [AOTI][XPU] Embed SPRI-V files into .so (#153924)
Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary.
Fixes #153829
Fixes #153830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924
Approved by: https://github.com/desertfire
2025-05-20 17:38:53 +00:00
angelayi
ddfaab3b56 [aoti] Reset expr when generating cpp code (#153898)
Maybe fixes https://github.com/pytorch/pytorch/issues/153896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153898
Approved by: https://github.com/desertfire
2025-05-20 16:31:25 +00:00
PaulZhang12
a7c01d7f13 [Inductor] Subgraph check output strides (#153755)
Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous.

Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755
Approved by: https://github.com/eellison
ghstack dependencies: #153754
2025-05-20 16:07:18 +00:00
PaulZhang12
63e5d46478 [Inductor] Subgraph support dynamic input expressions (#153754)
Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp

Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754
Approved by: https://github.com/eellison
2025-05-20 16:07:18 +00:00
PyTorch MergeBot
3102ae6798 Revert "[AOTI] Add an option to specify custom op C shim (#153851)"
This reverts commit 365ac49840.

Reverted https://github.com/pytorch/pytorch/pull/153851 on behalf of https://github.com/malfet due to Looks like it broke fuzzer test, but I could be wrong, see c4d1ff02f8/1 ([comment](https://github.com/pytorch/pytorch/pull/153851#issuecomment-2894619773))
2025-05-20 14:23:50 +00:00
Bin Bao
365ac49840 [AOTI] Add an option to specify custom op C shim (#153851)
Summary: Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Differential Revision: [D75014177](https://our.internmc.facebook.com/intern/diff/D75014177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153851
Approved by: https://github.com/hl475
2025-05-20 05:12:09 +00:00
PaulZhang12
dccd19c2ef [Inductor] Construct subgraph with benchmarking args not example_inputs (#153753)
If the inputs to a subgraph has FlexibleLayout, the subgraph does not currently freeze the layouts here. Therefore, the `example_inputs` generated might not be consistent in layout with the `args` based in for benchmarking

Differential Revision: [D74900879](https://our.internmc.facebook.com/intern/diff/D74900879/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153753
Approved by: https://github.com/eellison
2025-05-19 15:58:40 +00:00
Bin Bao
a2d0ef242d [AOTI] Embed cubin files into .so (#150739)
Summary: Embed cubin files so AOTI is one step closer to generate a single binary. Controlled by a flag and off as default.

Differential Revision: [D72535357](https://our.internmc.facebook.com/intern/diff/D72535357)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150739
Approved by: https://github.com/angelayi
2025-05-19 01:11:46 +00:00
Tom Ritchford
8568dbce1d [inductor] Clean typing in codegen/common.py and codecache.py (#150767)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150767
Approved by: https://github.com/aorenste
2025-05-17 13:56:50 +00:00
Michael Lazos
7ebea09986 [Cutlass] Enable fusion with FusedSchedulerNodes (#153588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153588
Approved by: https://github.com/eellison
ghstack dependencies: #152815
2025-05-17 12:29:10 +00:00
Michael Lazos
f604732e2e [Cutlass] E2E Tests for EVT (#152815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152815
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-05-17 12:29:10 +00:00
Benjamin Glass
cda572b053 codecache: Remove cpp_prefix.h duplication per build, then precompile it (#144293)
Prior to this PR, `_inductor/codegen/cpp_prefix.h` was copied into a new temporary directory on every inductor run utilizing the CPP backend (i.e. CPU-only), then included in the output source code. Instead, this PR puts it in an appropriate place in the torch includes, and includes it from there. This allows us to precompile it in cpp_wrapper and AOT inductor mode, saving significant compilation time.

Due to difficulties getting this to work in FBCode, the precompilation itself is only enabled in OSS PyTorch.

Differential Revision: [D69420620](https://our.internmc.facebook.com/intern/diff/D69420620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144293
Approved by: https://github.com/desertfire
2025-05-16 17:41:36 +00:00
PyTorch MergeBot
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
Nikita Shulga
db26aeaec2 [MPSInductor] Support numpy scalars handling (#153598)
By default, numpy computes results in float64 format, but when passed as an argument to MPS function, must be implicitly converted to float32, which naturally occurs in some networks, for example in speech_transformer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153598
Approved by: https://github.com/cyyever, https://github.com/dcci
ghstack dependencies: #153582
2025-05-15 16:48:25 +00:00
Aaron Gokaslan
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
sanchitintel
7482eb217c [Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373)
### Summary

Fixes #148494

Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.

Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.

\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance

### Performance data with BS 1

Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine

| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms|  9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms|  7.01% |

While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen

Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
2025-05-15 11:55:58 +00:00
henrylhtsang
910d2f96af [cutlass backend] forward fix cutlass backend A100 test (#153428)
Forward fix of https://github.com/pytorch/pytorch/pull/153006, which broke a test.

In the long run, we should get rid of CUDATemplateCaller.category.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153428
Approved by: https://github.com/ColinPeppler
2025-05-15 05:45:38 +00:00
henrylhtsang
f5e0806f34 [cutlass backend] Add back descriptive names for epilogue fusion (#153405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153405
Approved by: https://github.com/mlazos
2025-05-15 01:47:52 +00:00
Bin Bao
33a5179269 [AOTI][reland2] Remove typedef for half and bfloat16 (#153467)
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.

Differential Revision: D74398762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever
2025-05-14 02:37:18 +00:00
Sam Larsen
f1de3f9f07 Rename "output_tensor" -> "out" in autotune_process.py (#153169)
Summary: This change is to support remote autotuning. I want to use all the same benchmarking utilities in select_algorithm.py. For remote autotuning, I'll reuse the TritonBenchmarkRequest class used for subprocess autotuning because it's already serializable. That class is also used in standard, in-process autotuning, but via TritonTemplateCaller.benchmark() which sets the output_tensor param when calling the underlying TritonBenchmarkRequest. For remote, I'll be using the TritonBenchmarkRequest request directly so I want the parameter to be named 'out' to avoid "got an unexpected keyword argument 'out'".

Test Plan: Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153169
Approved by: https://github.com/aorenste, https://github.com/eellison
2025-05-13 14:18:29 +00:00
Nikita Shulga
a6c5b59067 [MPSInductor] Fix multistage reduction suffixes (#153362)
By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache.

Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device half* out_ptr1,
    constant half* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[32];
    float tmp_acc_1 = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp0 = static_cast<float>(in_ptr0[r0_0]);
        tmp_acc_1 += tmp0;
    }
    auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024);
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp2 = static_cast<float>(in_ptr0[r0_0]);
        auto tmp3 = tmp2 / tmp1;
        out_ptr1[r0_0] = static_cast<half>(tmp3);
    }
}
```

Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps`

Fixes https://github.com/pytorch/pytorch/issues/152155

Though inductor tests are still failing, need to keep refining the variable invalidation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel
2025-05-13 03:07:53 +00:00
Benjamin Glass
b0f2891e43 [AOTInductor] Fix clang-tidy warnings in wrapper (#153197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153197
Approved by: https://github.com/desertfire
2025-05-12 22:35:59 +00:00
Aaron Gokaslan
3ff22fe2df [BE]: Use shutil.which in inductor codegen (#153377)
Use shutil.which instead of subprocess. Is more secure, has better error handling and is more cross platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153377
Approved by: https://github.com/albanD
2025-05-12 22:11:26 +00:00