Commit Graph

85528 Commits

Author SHA1 Message Date
PyTorch MergeBot
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f2.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
Yanan Cao (PyTorch)
a16ada41b9 Fix outdated docstring of torch.export.export regarding strict flag (#149077)
Summary: Fix outdated docstring of torch.export.export regarding strict flag

Test Plan: None, doc only change

Differential Revision: D71068215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077
Approved by: https://github.com/zhxchen17
2025-03-17 22:29:20 +00:00
Sheng Qin
d25617255c Fix AOTI update_constant_buffer issue. (#149243)
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
    @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
    @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/fibers/GuardPageAllocator.cpp:237
    @ 000000000004455f (unknown)
                       /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
                       -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
    @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```

Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par  --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor  --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```

2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```

Reviewed By: hl475

Differential Revision: D71236710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475, https://github.com/jingsh
2025-03-17 22:10:57 +00:00
Isuru Fernando
a3c6e3139a allow extra args for parameterization of tests in inductor (#149154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154
Approved by: https://github.com/amjames, https://github.com/eellison
2025-03-17 22:05:06 +00:00
Davide Italiano
e4f6e4ac84 [MPS] Add inductor support for modified_bessel_i0. (#149342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342
Approved by: https://github.com/malfet
2025-03-17 21:45:51 +00:00
Carlo Bertolli
8bc7bd94a5 [ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)
This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-03-17 20:51:36 +00:00
Benjamin Glass
e8dd58b8cf cpp_wrapper: Precompile device-specific header files (#146928)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests.  Total OpInfo test runtime is down about 2x from this change alone.

Relands #144002, with changes needed by fbcode internals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928
Approved by: https://github.com/desertfire
2025-03-17 20:40:15 +00:00
Sampsa
5e9f792479 [ROCm] Unskip flex attention UTs after triton 3.3 bump (#148327)
Enable `test_flex_attention.py::TestLearnableBiases` unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327
Approved by: https://github.com/jeffdaily
2025-03-17 20:15:14 +00:00
Shunting Zhang
6c7d8419e3 fix two accuracy regression (#149172)
There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check.

- error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316
- error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-17 19:34:00 +00:00
Pat Vignola
769f19bf95 [MTIA] Add _mtia_exchangeDevice to MTIA module (#149322)
Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322
Approved by: https://github.com/chaos5958
2025-03-17 19:31:10 +00:00
angelayi
8d7c430e84 Symintify transpose_ (#149057)
Fixes https://github.com/pytorch/pytorch/issues/148702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057
Approved by: https://github.com/yushangdi
2025-03-17 19:11:54 +00:00
Fadi Arafeh
08a644a4c4 Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585)
This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.

- **For `qlinear_dynamic` (dynamically quantized matmuls):**

This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__(description="huggingface model")
        self.add_argument("--context_length",
                            help="context length - number of input tokens",
                            type=int,
                            default=64
        )
        self.add_argument("--model",
                            help="model checkpoint - i.e. 'bert-base-uncased'",
                            type=str,
                            default=None)
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()
    model_name = args.model
    config = AutoConfig.from_pretrained(model_name)
    batch_size = 1
    model = AutoModel.from_pretrained(model_name)
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    model.eval()
    inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("Model = ", model_name)
    print("Context Length = ", args.context_length)
    print("Min (ms) = ", min(times))
    print("Mean (ms) = ", np.mean(times))
```

- **For `qlinear` (statically quantized matmuls):**

This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.

```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__()
        self.add_argument("--M",
                            help="M dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--K",
                            help="K dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--N",
                            help="N dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--signed_input",
                            help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
                            action="store_true"
        )
        self.add_argument("--seed",
                          help="Random seed",
                          type=int,
                          default=42
        )
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

class LinearModel(nn.Module):
    def __init__(self, K, N):
        super(LinearModel, self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.fc = nn.Linear(K, N)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

def quantize_model(model, args):
    qconfig = QConfig(
            activation=HistogramObserver.with_args(reduce_range=False,
            dtype=torch.qint8 if args.signed_input else torch.quint8),
            weight=default_weight_observer,
    )
    # Prepare the model for static quantization
    # Specify quantization configurations
    model.qconfig = qconfig
    model_prepared = torch.quantization.prepare(model_fp32)

    # Calibrate the model with sample inputs
    # Example input data for calibration
    with torch.no_grad():
        sample_data = torch.randn(args.M, args.K)
        model_prepared(sample_data)
    # Convert the prepared model to a quantized model
    model_quantized = torch.quantization.convert(model_prepared)
    return model_quantized

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()

    set_seed(args.seed)
    model_fp32 = LinearModel(args.K, args.N)
    model_quantized = quantize_model(model_fp32, args)

    inputs = torch.randn(args.M, args.K)
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model_quantized(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model_quantized(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
    print("Min Times (ms) = ", min(times))
    print("Mean Times (ms) = ", np.mean(times))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-17 18:21:10 +00:00
Isuru Fernando
c41c2130be Fix printing INT64_MIN (#149148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148
Approved by: https://github.com/anijain2305
2025-03-17 17:57:18 +00:00
Yichen Yan
8cdb9adc05 do not run test_ck_blas_library on cpu (#148316)
Fix on non-rocm:

```
root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
E
======================================================================
ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test
    raise rte
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test
    result = test(self, **param_kwargs)
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn
    fn(*args, **kwargs)
  File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library
    torch.backends.cuda.preferred_blas_library('ck')
  File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library
    torch._C._set_blas_preferred_backend(_BlasBackends[backend])
RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.346s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316
Approved by: https://github.com/jeffdaily
2025-03-17 17:45:45 +00:00
Catherine Lee
224cd9f055 [ez] Flush trymerge print statements (#149012)
Logs of trymerge don't match up with timestamps, ex
https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591
Ex:
```
2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed)
...
2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min
2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed)
...
2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min
2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed)
```

Either buffering prints or github actions logs are being weird?

Print with flush to see if it helps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012
Approved by: https://github.com/malfet
2025-03-17 17:04:48 +00:00
Rachel Guo
aaa4c3d60b [mm_logs] make aten mm info readable (#148800)
Summary:
as title. make it into a table like

e.g. also see pic in test plan

| Name     | M   | N   | K   | Count |
| aten.mm | 16  | 6   |  16 |     1     |
...

Test Plan: {F1975907876}
<img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" />

Differential Revision: D70825664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800
Approved by: https://github.com/henrylhtsang
2025-03-17 17:00:58 +00:00
Xinya Zhang
2a011ca904 [ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911)
Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911
Approved by: https://github.com/jeffdaily
2025-03-17 16:41:15 +00:00
PyTorch MergeBot
9d37b501db Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit 2e02c07a5d.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))
2025-03-17 16:17:02 +00:00
Yu, Guangye
c7c3e77324 Refine XPU oneDNN context manager API (#147349)
# Motivation
This PR introduces improvements to the XPU oneDNN context manager API:

- `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index.
- `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index.

Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349
Approved by: https://github.com/EikanWang
2025-03-17 14:45:56 +00:00
PyTorch UpdateBot
790f93db3a Update slow tests (#149300)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149300
Approved by: https://github.com/pytorchbot
2025-03-17 11:39:29 +00:00
Sun, Jiayi
b2862f1435 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-17 09:27:01 +00:00
zeshengzong
1cc5f6b623 Optimize MaxPool1d param ceil_mode description (#148869)
Fixes #148123

Add output shape formula based on `ceil_mode` value, according to

00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a)

### After

![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869
Approved by: https://github.com/mikaylagawarecki
2025-03-17 08:50:40 +00:00
soulitzer
916e8979d3 Skip some tests not using gradcheck on slowgradcheck (#149220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149220
Approved by: https://github.com/seemethere
2025-03-17 00:34:52 +00:00
eqy
6048d88afe [ARM64][CUDA] skip string pattern matching in test_workspace_allocation_error (#149236)
`unwind()` on ARM64 seems to elide the strings of interest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149236
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/BoyuanFeng
2025-03-17 00:30:43 +00:00
Aaron Gokaslan
bfee141666 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar
6b1b95ad2a Support subclass constructor capturing in export (#147014)
Notable TODOs:
1. Need to implement AutogradHOP to get rid of subclasses before serializing
2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs

Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014
Approved by: https://github.com/bdhirsh
2025-03-16 18:19:19 +00:00
Animesh Jain
5905bbe745 [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-16 15:56:17 +00:00
Davide Italiano
9f33c6f0a0 [MPS] Add support for modified_bessel_i0 in eager. (#149264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-16 04:45:49 +00:00
Nikita Shulga
f80bee4934 [MPS][BE] Move common binary ops macros to indexing.h (#149263)
And binary op invocation logic to OperationUtils.mm

This is a no-op change, additional sanity checks/logic improvements will be added as followups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263
Approved by: https://github.com/dcci
ghstack dependencies: #149262
2025-03-16 02:06:40 +00:00
Davide Italiano
21c2edfec8 [MPS/metal] Add missing inline to function definitions. (#149265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149265
Approved by: https://github.com/malfet
2025-03-16 00:33:27 +00:00
Nikita Shulga
3e2c4086ad [EZ][BE] Reuse result_of from c10/metal/utils.h (#149262)
No need for one more implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149262
Approved by: https://github.com/dcci
2025-03-16 00:21:28 +00:00
Sam Larsen
acf42b0048 Fix memory leak in subproc_pool future (#149259)
Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259
Approved by: https://github.com/Skylion007
2025-03-15 20:26:30 +00:00
James Wu
a9c55277d7 [Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238)
This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure

Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Probably lots of features of the triton C++ generated code that I haven't handled yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238
Approved by: https://github.com/oulgen
2025-03-15 15:06:46 +00:00
Sam Larsen
c83c711da8 Remove some memory overhead in parallel compile workers (#149168)
Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead.

Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu:
* After importing torch in a subproc: 371M
* Without this PR, after compiling 15k kernels: 825M
* With this PR, after compiling 15k kernels: 531M

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168
Approved by: https://github.com/jansel
2025-03-15 14:20:40 +00:00
Huamin Li
e7e477c1f9 Not generate custom obj json when it's empty (#149246)
Summary: as title.

See internal Diff summary for more context.

Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated

Differential Revision: D71241676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246
Approved by: https://github.com/houseroad

Co-authored-by: Huamin Li <huaminli@meta.com>
2025-03-15 13:00:48 +00:00
Lirong
4482a65fef Add side_effect to avoid dce custom op in CA graph (#149181)
We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181
Approved by: https://github.com/xmfan
2025-03-15 04:15:49 +00:00
Wenjie Yang
115fc98cc0 Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106)
Summary:
Use Sharding Strategy for aten.split.Tensor instead of sharding rule

Test Plan:
pytest test/distributed/tensor/test_dtensor_ops.py -s -k split

Reviewers:
xilunwu

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2025-03-15 04:03:40 +00:00
Jane Xu
740ce0fa5f op should NOT be static in aoti_torch_call_dispatcher (#149208)
aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet
2025-03-15 01:47:11 +00:00
Simon Fan
578160c875 [ca] don't inline accumulate grad op (#149014)
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
2025-03-15 01:10:54 +00:00
Simon Fan
f4368d8872 [ca] clean up aot node deduping (#149064)
rename the AOT nodes as we copy paste them into the CA graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064
Approved by: https://github.com/jansel
2025-03-15 01:10:54 +00:00
Nikita Shulga
96795e9533 [BE] Parametrize TestMPS.test_binops_dtype_precedence (#149234)
No op change, just splits a longer tests into a series of a smaller ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234
Approved by: https://github.com/atalman, https://github.com/dcci
ghstack dependencies: #149216, #149233
2025-03-15 00:37:11 +00:00
Jithun Nair
1c7196f04b Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394)
Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working

Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535
Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460
![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032)
Run with cached docker image:
![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813)
~6 min vs 3 s :)

Thanks @saienduri for the help on the MI300 infra side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394
Approved by: https://github.com/jeffdaily
2025-03-15 00:34:04 +00:00
xinan.lin
9ad6265d04 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)
The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel
2025-03-15 00:30:04 +00:00
yifanmao
7537b19c73 [FSDP2] Update ignored_params docstring and add unit test (#149074)
Fixes https://github.com/pytorch/pytorch/issues/148242

ignored_params won't be moved to devices in full_shard(), update docstring.
Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074
Approved by: https://github.com/awgu
2025-03-15 00:23:09 +00:00
maajidkhann
09f7f62cfe Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)
**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet
2025-03-15 00:02:38 +00:00
Nikita Shulga
08af311fc2 [MPS] Fix type promotion for torch.floor_divide (#149233)
And delete some duplicating glue code by relying on the stub
After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints
Checked by `test_div2` inductor testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233
Approved by: https://github.com/atalman
ghstack dependencies: #149216
2025-03-15 00:00:42 +00:00
bobrenjc93
eb7bf4202d Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-14 23:38:19 +00:00
Stephen Jia
ff58ccec6c [ATen-CPU] Add math.h for Gelu (#149164)
Summary:
## Context

This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198

In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined.

Test Plan:
Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164
Approved by: https://github.com/swolchok
2025-03-14 23:37:25 +00:00
PyTorch MergeBot
f9b4856989 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit c95a6b416b.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))
2025-03-14 23:13:34 +00:00
PyTorch MergeBot
643aaea133 Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)"
This reverts commit 5a843f8973.

Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))
2025-03-14 23:01:26 +00:00