Commit Graph

88238 Commits

Author SHA1 Message Date
Simon Fan
fae6f6c9ca [aot] fix deepcopying of aot bwd containing real tensors (#153999)
Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module.

But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs.

FIXES https://github.com/pytorch/pytorch/issues/153993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-05-21 23:30:02 +00:00
William Wen
67f9feeee7 remove TestCustomOp.test_impl_device_cpu from dynamo expected failures (#154049)
Fixes https://github.com/pytorch/pytorch/issues/153763 maybe?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154049
Approved by: https://github.com/StrongerXi
2025-05-21 23:20:30 +00:00
Catherine Lee
5ee1242310 Follow up to #152209, remove compat patch after docker image rename (#152958)
Remove compat patch that lets PRs that haven't rebased base #152209 still have docker images.

Merge ~2 weeks after the above PR was merged.  ~80% of PRs have a merge base that is <2 weeks old

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152958
Approved by: https://github.com/huydhn
2025-05-21 23:11:29 +00:00
ishanjmukherjee
d82610c2af docs: fix "should not to be" typo in register_buffer docstring (#153817)
Corrects a small grammatical error in `register_buffer` docstring, from "...  should not to be ..." to "...  should not be ...". Docs-only change, so no runtime behavior, tests, or APIs are affected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153817
Approved by: https://github.com/mikaylagawarecki
2025-05-21 22:46:50 +00:00
yang
b967b7b11e Update rnn.py, fix torch.nn.RNN document error (#153620)
I found the same issue as #147490 (@jibril-b-coulibaly).

There's an equivalent in the [doc-string](https://docs.pytorch.org/docs/stable/generated/torch.nn.RNN.html#rnn) of `torch.nn.RNN`:

```python
# Efficient implementation equivalent to the following with bidirectional=False
def forward(x, hx=None):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if hx is None:
        hx = torch.zeros(num_layers, batch_size, hidden_size)
    h_t_minus_1 = hx
    h_t = hx
    output = []
    for t in range(seq_len):
        for layer in range(num_layers):
            h_t[layer] = torch.tanh(
                x[t] @ weight_ih[layer].T
                + bias_ih[layer]
                + h_t_minus_1[layer] @ weight_hh[layer].T
                + bias_hh[layer]
            )
        output.append(h_t[-1])
        h_t_minus_1 = h_t
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t

```

However there's something wrong.

1. Like mentioned in #147490, line 499 is wrong

fb55bac3de/torch/nn/modules/rnn.py (L499)

The **input for RNNCell should be different** for different layers.

2. The code contains several hidden **reference-related issues** that may result in unintended modifications to tensors. For example in line 504, this causes all elements in the final output list to point to the same tensor.

fb55bac3de/torch/nn/modules/rnn.py (L504)

3. Some variable is not **defined**. Despite being a relatively minor issue in annotation, it can lead to significant confusion for those who are new to the concept. For example `weight_ih` in line 499

fb55bac3de/torch/nn/modules/rnn.py (L499)

So, i write a runnable version to make it more clear:

```python
# Efficient implementation equivalent to the following with bidirectional=False
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
def forward(x, hx=None, batch_first=False):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if hx is None:
        hx = torch.zeros(rnn.num_layers, batch_size, rnn.hidden_size)
    h_t_minus_1 = hx.clone()
    h_t = hx.clone()
    output = []
    for t in range(seq_len):
        for layer in range(rnn.num_layers):
            input_t = x[t] if layer == 0 else h_t[layer - 1]
            h_t[layer] = torch.tanh(
                input_t @ params[f"weight_ih_l{layer}"].T
                + h_t_minus_1[layer] @ params[f"weight_hh_l{layer}"].T
                + params[f"bias_hh_l{layer}"]
                + params[f"bias_ih_l{layer}"]
            )
        output.append(h_t[-1].clone())
        h_t_minus_1 = h_t.clone()
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t
```

This code can reproduce the computation of torch.nn.RNN.

For example:

```python
import torch
import torch.nn as nn

torch.manual_seed(0)
input_size, hidden_size, num_layers = 3, 5, 2
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
x = torch.randn(10, 4, 3)

official_imp = rnn(x)
my_imp = forward(x)

assert torch.allclose(official_imp[0], my_imp[0])
assert torch.allclose(official_imp[1], my_imp[1])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153620
Approved by: https://github.com/mikaylagawarecki
2025-05-21 22:45:28 +00:00
Bin Bao
5b6e551c0f [AOTI][refactor] Fix an anonymous namespace issue (#154033)
Summary: Remove anonymous namespace in model_container.h to fix the following compiler warning,
```
warning: ‘torch::aot_inductor::AOTInductorModelContainer’ has a field ‘torch::aot_inductor::AOTInductorModelContainer::constant_folded_’ whose type uses the anonymous namespace [-Wsubobject-linkage]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154033
Approved by: https://github.com/chenyang78
2025-05-21 22:29:09 +00:00
Yidi Wu
d356ca2466 [map] add inductor support by lowering to while_loop (#150971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150971
Approved by: https://github.com/zou3519
ghstack dependencies: #151034
2025-05-21 22:19:47 +00:00
Yidi Wu
cf1b38a017 [map] make proxy mode re-dispatch to fake key (#151034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034
Approved by: https://github.com/zou3519
2025-05-21 22:19:47 +00:00
Huy Do
c13eeaa718 Move inductor benchmark jobs to CUDA 12.8 (#154004)
For benchmark jobs, we usually want to run with the latest support CUDA version to get the best performance. This is a request coming from NVIDIA where they are running inductor benchmarks on Blackwell with CUDA 12.8 (min support version) and looking for an apples-to-apples comparison.

This also clean up references to CUDA 12.4 which have been sunset in PyTorch CI.

### Testing

- H100 benchmark https://github.com/pytorch/pytorch/actions/runs/15151424588
- Micro benchmark https://github.com/pytorch/pytorch/actions/runs/15151445957 (I just realize that this is still running on A100, @yanboliang Do you want to run on H100 now that we have capacity there?  It would also solve the problem of GPU memory)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154004
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/seemethere
2025-05-21 22:17:10 +00:00
henrylhtsang
053ca7439a [cutlass backend] Add serializer for cutlass ops (#153894)
Differential Revision: [D74524786](https://our.internmc.facebook.com/intern/diff/D74524786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153894
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-05-21 22:01:40 +00:00
Natalia Gimelshein
401fa87ace make only current thread allocate to pool in NcclPG (#153990)
follow up to #153356 that fixes nccl allocation to pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990
Approved by: https://github.com/kwen2501
2025-05-21 21:57:37 +00:00
angelayi
28bcd9eb30 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-21 21:55:59 +00:00
angelayi
918ae5d361 [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-21 21:55:59 +00:00
Sidharth
0b79a8c1a9 [dynamo] renamed _fn for more clarity and put a comment of user compiler user (#154026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154026
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-21 21:12:51 +00:00
Scott Todd
0e5f2339d0 [ROCm][Windows] Run hipcc with compatibility flags. (#153986)
See also https://github.com/ROCm/TheRock/issues/590. Including the `-Wno-ignored-attributes` flag here avoids 700MB of log warning spam while compiling and the `-fms-extensions` seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html.

Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>
Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153986
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily

Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>
2025-05-21 20:26:52 +00:00
Benjamin Glass
3c89cfd460 cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-21 20:23:04 +00:00
Ryan Guo
4c6f0fe22f [dynamo] Properly handle torch.script.jit under @staticmethod (#153984)
Fixes #153607.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153984
Approved by: https://github.com/williamwen42
2025-05-21 19:45:06 +00:00
James Wu
b184e3da9c [easy] Fix internal only test (#154035)
Internally static cuda launcher isn't enabled, so we need to always enable it

Differential Revision: [D75146584](https://our.internmc.facebook.com/intern/diff/D75146584/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154035
Approved by: https://github.com/Skylion007
ghstack dependencies: #153565
2025-05-21 19:00:55 +00:00
Yidi Wu
8e6e79fc1b [hop_schema] support gen_schema for invoke_subgraph (#152984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152984
Approved by: https://github.com/zou3519
ghstack dependencies: #151067, #152974
2025-05-21 18:55:46 +00:00
Yidi Wu
9c33899196 [hop_schema] add HopSchemaGenerator to make it easier to create hop schema (#152974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152974
Approved by: https://github.com/zou3519
ghstack dependencies: #151067
2025-05-21 18:55:46 +00:00
Yidi Wu
1e0f19e173 auto functionalize base_hop (#151067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151067
Approved by: https://github.com/zou3519
2025-05-21 18:55:46 +00:00
Laith Sakka
11c0ffefcd Cache code generation during triton template expansion and enable it for mm_template. (#151773)
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.

with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.

This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.

How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.

Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
2025-05-21 18:55:41 +00:00
Deysha Rivera
aec7cc60d7 add graph_code_verbose_log artifact for fx passes (#153775)
Fixes #153646

This PR refactors the logging behavior in the FX pass insert_deferred_runtime_asserts and runtime_assert.py to separate verbose/intermediate graph logs from the final output graph log. All verbose logs generated during the FX pass are now routed to a new artifact logger, graph_code_verbose, while only the final output graph remains logged to the original graph_code artifact.

Changes

- Added a new artifact logger: [graph_code_log = torch._logging.getArtifactLogger(__name__, "graph_code_verbose")]
- Updated all verbose/intermediate FX pass logs in [insert_deferred_runtime_asserts] to use the new graph_code_verbose artifact.
- Ensured that only the final output graph is logged to the original graph_code artifact.
- No changes to the FX pass logic or output—only logging behavior is affected.

Notes
This change is backward-compatible and does not affect the functional behavior of FX passes.
No changes to user-facing APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153775
Approved by: https://github.com/williamwen42
2025-05-21 18:31:59 +00:00
Aleksei Nikiforov
d23f4ae7b5 s390x: use qemu issue workaround for runner registration too (#154030)
s390x: use qemu issue workaround for runner registration too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154030
Approved by: https://github.com/seemethere
2025-05-21 18:30:25 +00:00
Tomasz Bohutyn
bb7e30c165 [MegaCache] Make MegaCache generic to allow external plugins registration (#152977)
Implements #152976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152977
Approved by: https://github.com/oulgen
2025-05-21 18:18:47 +00:00
James Wu
c31e239910 [precompile] Add BundledAOTAutogradCacheEntry (#152840)
Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.

This has some advantages:
- No more dependency on FxGraphCache at all
- Clearing FxGraphCache does not result in AOTAutogradCache miss
- Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output

We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this *could* be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.

On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.

Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840
Approved by: https://github.com/zhxchen17
2025-05-21 18:08:42 +00:00
PyTorch MergeBot
3eb8fa081a Revert "[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)"
This reverts commit 32b1baa981.

Reverted https://github.com/pytorch/pytorch/pull/153802 on behalf of https://github.com/malfet due to It breaks ROCM testing, see d23762974e/1 ([comment](https://github.com/pytorch/pytorch/pull/153802#issuecomment-2898695702))
2025-05-21 17:20:31 +00:00
henrylhtsang
d23762974e [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-21 17:12:05 +00:00
Bin Bao
72a3c8dfa8 [AOTI][reland] Add an option to specify custom op C shim (#153968)
Summary: Reland https://github.com/pytorch/pytorch/pull/153851 after fixing a fuzzer test issue.

Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153968
Approved by: https://github.com/hl475
2025-05-21 15:57:57 +00:00
Aaron Gokaslan
b7d08defe9 [BE]: Type previously untyped decorators (#153726)
This fixes decorator typing which unmasks a lot of typing issues in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153726
Approved by: https://github.com/albanD
2025-05-21 15:56:19 +00:00
Nikita Shulga
6c2c527cd6 [BE] Remove extra semicolons from SymmetricMemory.hpp (#154034)
Fixes
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.cpp:1:
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:77:4: warning: extra ';' after member function definition [-Wextra-semi]
   77 |   };
      |    ^
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:81:4: warning: extra ';' after member function definition [-Wextra-semi]
   81 |   };
      |    ^
2 warnings generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154034
Approved by: https://github.com/Skylion007
2025-05-21 14:33:30 +00:00
James Wu
33767eb391 Add option to statically launch user defined triton kernels (#153725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153725
Approved by: https://github.com/oulgen, https://github.com/Mingming-Ding, https://github.com/jansel
ghstack dependencies: #153565
2025-05-21 14:33:15 +00:00
Nikita Shulga
b73d77900e [CI] Run limited h100 tests every 6 hours (#153900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153900
Approved by: https://github.com/Skylion007
2025-05-21 13:40:03 +00:00
Yutao Xu
11f8511455 Update torch-xpu-ops commit pin (#153902)
Update the torch-xpu-ops commit to defce46ae7, includes:

- Resolve the aten::gamma accuracy gap compared to scipy
- Optimize layernom_vectorized_impl by using adaptive wg selection for small shapes
- [Intro async flag and use current stream avoid stream sync](https://github.com/intel/torch-xpu-ops/pull/1546)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153902
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-21 13:29:41 +00:00
ZhiweiYan-96
616e20b4f1 [Intel GPU] scalar tensor case handling in addmm, baddmm (#153051)
# Motivation
This PR adds scalar tensor (`t.elem()==1 and t.sizes().empty() == true and t.dim()=0` )handling in addmm, baddmm. The issue is found during the process of oneDNN upgradation. Found that the new version of oneDNN requires the post-binary (self in addmm) has same dimension as the one of output tensor. Now we need explicitly expand the shape of `self` tensor. Former version dnnl may help us do the broadcasting inside.

This PR could fix issues in https://github.com/intel/torch-xpu-ops/issues/1612 and CI error in https://github.com/pytorch/pytorch/pull/151767.

# Implementation
We treat the scalar tensor as normal tensor by `unsqueeze` it as 1 dimension tensor. Accompanied with the existing shape handle logic, it would be further `unsqueeze` to 2D or 3D shape.

UT testing
```
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmm_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmv_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float16
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153051
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-05-21 12:24:37 +00:00
Irem Yuksel
afd7a13bca Migrate to new Windows Arm64 runners (#152099)
This PR moves the Windows Arm64 nightly jobs to the new runner image, see [arm-windows-11-image](https://github.com/actions/partner-runner-images/blob/main/images/arm-windows-11-image.md )

Fixes #151671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152099
Approved by: https://github.com/seemethere
2025-05-21 09:13:15 +00:00
Aaron Gokaslan
ffd49d538e [BE][Ez]: Improve typing in torch/modules/container.py (#153728)
Adds some missing type annotations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153728
Approved by: https://github.com/albanD
2025-05-21 07:15:00 +00:00
Andy (An) Wang
a636a92ee9 [MTIA ATen Backend] Migrate "_unsafe_view" and "view" ops from out-of-tree to pytorch in-tree (#153670)
Summary:
# Context
The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits:
1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions.
2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc.
3. Eliminate MTIA's own codegen flow, which is unnecessary complexity.
4. Overall make MTIA's aten backend more pytorch native.

Differential Revision: D74672464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-05-21 05:20:45 +00:00
xinan.lin
dcb3edd30d [AOTI][XPU] Refactor AOTInductor runtime API for Intel GPU. (#153929)
Simplify and improve code format for sycl_runtime_wrappers.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153929
Approved by: https://github.com/desertfire
ghstack dependencies: #153924
2025-05-21 03:52:54 +00:00
PyTorch MergeBot
531d8f5fb6 Revert "[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)"
This reverts commit 2b43d635d3.

Reverted https://github.com/pytorch/pytorch/pull/153556 on behalf of https://github.com/eqy due to reverting, will add disable for reduced precision reduction ([comment](https://github.com/pytorch/pytorch/pull/153556#issuecomment-2896257521))
2025-05-21 02:09:11 +00:00
PyTorch MergeBot
1478d0185c Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)"
This reverts commit 8cabd23b3d.

Reverted https://github.com/pytorch/pytorch/pull/151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/151594#issuecomment-2896230131))
2025-05-21 01:45:20 +00:00
Yu, Guangye
daa68e7a93 Update USE_XCCL option if USE_XPU is OFF (#153936)
# Motivation
Disable `USE_XCCL` when `USE_XPU` is turned `OFF` to ensure configuration consistency. This is required because XCCL depends on XPU functionality.
Especially, ensure that `USE_XCCL` is correctly set to `OFF` when [caffe2_update_option(USE_XPU OFF)](1075bb37d3/cmake/Dependencies.cmake (L97)) is invoked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153936
Approved by: https://github.com/Skylion007
2025-05-21 01:32:41 +00:00
PyTorch MergeBot
cf6e5d1881 Revert "[cuBLASLt] relax addmm cuBLASLt constraint (#153675)"
This reverts commit f9bb7cf72a.

Reverted https://github.com/pytorch/pytorch/pull/153675 on behalf of https://github.com/eqy due to incorrect, cuBLASLt doesnt handle beta != 1.0 but this appears untested ([comment](https://github.com/pytorch/pytorch/pull/153675#issuecomment-2896188784))
2025-05-21 01:20:10 +00:00
Frost Mitchell
fe49b11e09 Add memory reporting for XPU to Memory Profiler (#152842)
Adds support for XPU profile_memory in Pytorch Profiler.

Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns:

<details><summary>profiling.py</summary>
<p>

```python
import torch
import torch.nn as nn
import torch.optim as optim

from torch.profiler import profile, ProfilerActivity

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.conv1 = nn.Conv1d(20,20,15,padding="same")
        self.flatten = nn.Flatten()
        self.net1 = nn.Linear(2048, 4096)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(4096, 5)

    def forward(self, x):
        res = self.conv1(x)
        res = self.flatten(res)
        res = self.net1(res)
        return self.net2(self.relu(res))

def demo_basic():
    model = ToyModel().to("xpu")
    loss_fn = nn.MSELoss().to("xpu")
    optimizer = optim.SGD(model.parameters(), lr=0.001)

    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof:
        for epoch in range(10):
            optimizer.zero_grad()
            outputs = model(torch.randn(20, 2048).to("xpu"))
            labels = torch.randn(20, 5).to("xpu")
            loss_fn(outputs, labels).backward()
            optimizer.step()
    print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100))

if __name__ == "__main__":
    demo_basic()
```
</p>
</details>

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            gemm_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.501ms        44.73%       1.501ms      25.024us           0 b           0 b            60
    autograd::engine::evaluate_function: AddmmBackward0         0.12%       1.067ms        30.47%     260.929ms      13.046ms       0.000us         0.00%       1.009ms      50.448us           0 b           0 b            20
                                         AddmmBackward0         0.09%     744.983us        15.99%     136.944ms       6.847ms       0.000us         0.00%     784.640us      39.232us           0 b           0 b            20
                                               aten::mm        15.41%     131.956ms        15.79%     135.167ms       3.379ms     784.640us        23.37%     784.640us      19.616us           0 b           0 b            40
                                           aten::linear         0.02%     156.361us        20.58%     176.187ms       8.809ms       0.000us         0.00%     741.760us      37.088us           0 b           0 b            20
                                            aten::addmm        20.25%     173.371ms        20.52%     175.723ms       8.786ms     741.760us        22.10%     741.760us      37.088us           0 b           0 b            20
                                Optimizer.step#SGD.step         0.40%       3.429ms         5.55%      47.509ms       4.751ms       0.000us         0.00%     488.960us      48.896us           0 b           0 b            10
                                    aten::_foreach_add_         4.81%      41.162ms         5.15%      44.080ms       4.408ms     488.960us        14.57%     488.960us      48.896us           0 b           0 b            10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n...         0.00%       0.000us         0.00%       0.000us       0.000us     422.880us        12.60%     422.880us      42.288us           0 b           0 b            10
autograd::engine::evaluate_function: ConvolutionBack...         0.03%     280.041us         4.36%      37.328ms       3.733ms       0.000us         0.00%     356.320us      35.632us           0 b           0 b            10
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 856.227ms
Self XPU time total: 3.357ms
```

This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg       CPU Mem  Self CPU Mem       XPU Mem  Self XPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            gemm_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.436ms        43.64%       1.436ms      23.939us           0 b           0 b           0 b           0 b            60
    autograd::engine::evaluate_function: AddmmBackward0         0.13%       1.186ms        29.92%     262.875ms      13.144ms       0.000us         0.00%       1.005ms      50.272us           0 b           0 b     320.94 Mb      -4.69 Mb            20
                                         AddmmBackward0         0.09%     815.288us        16.48%     144.802ms       7.240ms       0.000us         0.00%     790.720us      39.536us           0 b           0 b     325.47 Mb           0 b            20
                                               aten::mm        15.86%     139.342ms        16.26%     142.875ms       3.572ms     790.720us        24.03%     790.720us      19.768us           0 b           0 b     325.47 Mb     325.47 Mb            40
                                           aten::linear         0.02%     182.856us        20.46%     179.775ms       8.989ms       0.000us         0.00%     669.440us      33.472us           0 b           0 b       3.13 Mb           0 b            20
                                            aten::addmm        20.10%     176.607ms        20.40%     179.210ms       8.961ms     669.440us        20.34%     669.440us      33.472us           0 b           0 b       3.13 Mb       3.13 Mb            20
                                Optimizer.step#SGD.step         0.42%       3.692ms         5.61%      49.267ms       4.927ms       0.000us         0.00%     486.640us      48.664us           0 b           0 b           0 b           0 b            10
                                    aten::_foreach_add_         4.83%      42.439ms         5.19%      45.574ms       4.557ms     486.640us        14.79%     486.640us      48.664us           0 b           0 b           0 b     -20.00 Kb            10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n...         0.00%       0.000us         0.00%       0.000us       0.000us     420.960us        12.79%     420.960us      42.096us           0 b           0 b           0 b           0 b            10
autograd::engine::evaluate_function: ConvolutionBack...         0.04%     310.719us         4.47%      39.279ms       3.928ms       0.000us         0.00%     339.520us      33.952us           0 b           0 b      -2.89 Mb      -3.12 Mb            10
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 878.627ms
Self XPU time total: 3.291ms
```

These XPU memory numbers match the same profiling results on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-05-21 01:19:19 +00:00
Jane Xu
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
atalman
0959869683 Bump triton pin for the release 3.3.1 of triton (#153951)
Triton is pointing to latest triton pin : https://github.com/triton-lang/triton/tree/release/3.3.x
XPU pointing to latest XPU pin: https://github.com/intel/intel-xpu-backend-for-triton/commits/release/3.3.x/

This version contains the fix for: Compilation Issue for RTX 5090 GPUs with Compute Capability = 120. https://github.com/triton-lang/triton/pull/6771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153951
Approved by: https://github.com/davidberard98
2025-05-21 00:27:39 +00:00
Menglu Yu
32b1baa981 [3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)
Summary:
1. Customers now can test with float8_e4m3fn.
2. To play safe, we set the scaling version as the default.

Test Plan:
### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Buck UI: https://www.internalfb.com/buck2/f679f362-8bf4-454c-87df-a85cbc2ab2a8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549861047443
Network: Up: 16KiB  Down: 3.9MiB  (reSessionID-98badbfd-76f7-487f-ab1c-1ec4f850614d)
Analyzing targets. Remaining     0/281
Executing actions. Remaining     0/5957                                                                                                   7.3s exec time total
Command: test.     Finished 3 local, 1 remote
Time elapsed: 1:29.7s
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D74910193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153802
Approved by: https://github.com/nareshrajkumar866, https://github.com/Hahu803, https://github.com/Mingming-Ding
2025-05-21 00:21:54 +00:00
Nikita Shulga
58dc80dff6 [MPSInductor] Fix indexing calculation (#153997)
By using `c10:🤘:floor_divie` primitive

Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI)

Before this change following script generated different compile and eager results
```python
import torch

def foo(unsqueeze, unsqueeze_1):
    cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1)
    view = torch.ops.aten.view.default(cat_1, [4])
    slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3)
    rev_1 = torch.ops.aten.flip.default(slice_5, [0])
    return rev_1

if __name__ == "__main__":
    x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1)
    y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1)

    rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y)
    print(kernel)
    print("Compile: ", rc)
    print("Eager: ", foo(x, y))
```
After this change
```
'''
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp0 = (2 + ((-1)*x0)) % (2);
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 1;
        auto tmp5 = tmp1 < tmp4;
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2;
        auto tmp10 = tmp1 < tmp9;
        auto tmp12 = tmp8 ? tmp11 : 0.0;
        auto tmp13 = tmp5 ? tmp7 : tmp12;
        out_ptr0[x0] = static_cast<float>(tmp13);
    }
'''
Compile:  tensor([2., 5., 1.], device='mps:0')
Eager:  tensor([2., 5., 1.], device='mps:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997
Approved by: https://github.com/dcci
ghstack dependencies: #153970, #153971
2025-05-21 00:03:46 +00:00
Jane Xu
fc33da410f Add torch/header_only_apis.txt and enforce they're tested (#153635)
This PR adds enforcement of testing header only APIs.

The benefit of torch/header_only_apis.txt is twofold:
1) this gives us a clear view of what we expect to be header only
2) this allows us to enforce testing

The enforcement added in this PR is very basic--we literally string match that a symbol in `torch/header_only_apis.txt` is in a cpp test. This is meant to be a first step in verifying our APIs are properly tested and can get fancier over time. For now, I've added myself as a codeowner to learn what to look out for in terms of proper tests. Over time, I anticipate we can automate more steps, but right now let's just get something out the door.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153635
Approved by: https://github.com/albanD
ghstack dependencies: #153965
2025-05-20 23:42:24 +00:00
Jane Xu
41a9aa6564 Remove janky (though at times useful) dlclose test (#153975)
This test was never the shining star in class but it helped check that we can properly delete a stable library. But now that we are running it in CI this is not a good test to annoy people with as dlclose + parallelism is likely not the move. I will miss it locally though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153975
Approved by: https://github.com/jbschlosser
2025-05-20 23:26:42 +00:00