Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well.
Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device`
Reviewed By: chaos5958
Differential Revision: D70072059
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322
Approved by: https://github.com/chaos5958
This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.
Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.
- **For `qlinear_dynamic` (dynamically quantized matmuls):**
This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser
class ModelArgumentParser(ArgumentParser):
def __init__(self) -> None:
super().__init__(description="huggingface model")
self.add_argument("--context_length",
help="context length - number of input tokens",
type=int,
default=64
)
self.add_argument("--model",
help="model checkpoint - i.e. 'bert-base-uncased'",
type=str,
default=None)
self.add_argument("--iters",
help="benchmark iterations",
default=500)
if __name__ == "__main__":
parser = ModelArgumentParser()
args = parser.parse_args()
model_name = args.model
config = AutoConfig.from_pretrained(model_name)
batch_size = 1
model = AutoModel.from_pretrained(model_name)
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
model.eval()
inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
times = []
with torch.no_grad():
# warmup
for _ in range(10):
model(inputs)
# benchmark
for _ in range(args.iters):
s = time.time_ns()
model(inputs)
times.append((time.time_ns() - s) / 1e6)
print("Model = ", model_name)
print("Context Length = ", args.context_length)
print("Min (ms) = ", min(times))
print("Mean (ms) = ", np.mean(times))
```
- **For `qlinear` (statically quantized matmuls):**
This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time
class ModelArgumentParser(ArgumentParser):
def __init__(self) -> None:
super().__init__()
self.add_argument("--M",
help="M dimension",
type=int,
default=64
)
self.add_argument("--K",
help="K dimension",
type=int,
default=64
)
self.add_argument("--N",
help="N dimension",
type=int,
default=64
)
self.add_argument("--signed_input",
help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
action="store_true"
)
self.add_argument("--seed",
help="Random seed",
type=int,
default=42
)
self.add_argument("--iters",
help="benchmark iterations",
default=500)
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
class LinearModel(nn.Module):
def __init__(self, K, N):
super(LinearModel, self).__init__()
self.quant = torch.quantization.QuantStub()
self.fc = nn.Linear(K, N)
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.fc(x)
x = self.dequant(x)
return x
def quantize_model(model, args):
qconfig = QConfig(
activation=HistogramObserver.with_args(reduce_range=False,
dtype=torch.qint8 if args.signed_input else torch.quint8),
weight=default_weight_observer,
)
# Prepare the model for static quantization
# Specify quantization configurations
model.qconfig = qconfig
model_prepared = torch.quantization.prepare(model_fp32)
# Calibrate the model with sample inputs
# Example input data for calibration
with torch.no_grad():
sample_data = torch.randn(args.M, args.K)
model_prepared(sample_data)
# Convert the prepared model to a quantized model
model_quantized = torch.quantization.convert(model_prepared)
return model_quantized
if __name__ == "__main__":
parser = ModelArgumentParser()
args = parser.parse_args()
set_seed(args.seed)
model_fp32 = LinearModel(args.K, args.N)
model_quantized = quantize_model(model_fp32, args)
inputs = torch.randn(args.M, args.K)
times = []
with torch.no_grad():
# warmup
for _ in range(10):
model_quantized(inputs)
# benchmark
for _ in range(args.iters):
s = time.time_ns()
model_quantized(inputs)
times.append((time.time_ns() - s) / 1e6)
print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
print("Min Times (ms) = ", min(times))
print("Mean Times (ms) = ", np.mean(times))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Fix on non-rocm:
```
root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
E
======================================================================
ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
method(*args, **kwargs)
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test
raise rte
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test
result = test(self, **param_kwargs)
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn
return fn(slf, *args, **kwargs)
File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn
fn(*args, **kwargs)
File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library
torch.backends.cuda.preferred_blas_library('ck')
File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library
torch._C._set_blas_preferred_backend(_BlasBackends[backend])
RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.
To execute this test, run the following from the base repo dir:
python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.346s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316
Approved by: https://github.com/jeffdaily
Logs of trymerge don't match up with timestamps, ex
https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591
Ex:
```
2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed)
...
2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min
2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed)
...
2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min
2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed)
```
Either buffering prints or github actions logs are being weird?
Print with flush to see if it helps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012
Approved by: https://github.com/malfet
# Motivation
This PR introduces improvements to the XPU oneDNN context manager API:
- `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index.
- `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index.
Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349
Approved by: https://github.com/EikanWang
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.
The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],
The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],
I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259
Approved by: https://github.com/Skylion007
This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure
Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.
This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.
Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66
Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.
The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.
This diff does not add the launcher to torch, but introduces a basic test suite.
A list of TODOs that are not yet complete:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Probably lots of features of the triton C++ generated code that I haven't handled yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238
Approved by: https://github.com/oulgen
Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead.
Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu:
* After importing torch in a subproc: 371M
* Without this PR, after compiling 15k kernels: 825M
* With this PR, after compiling 15k kernels: 531M
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168
Approved by: https://github.com/jansel
Summary: as title.
See internal Diff summary for more context.
Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated
Differential Revision: D71241676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246
Approved by: https://github.com/houseroad
Co-authored-by: Huamin Li <huaminli@meta.com>
We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181
Approved by: https://github.com/xmfan
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.
**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.
Test plan:
- Allocate `a1.4xlarge` on AWS
- Run following script using wheel produced by this PR
```python
import torch
def f(x):
return x.sin() + x.cos()
print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```
Fixes#146792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet
And delete some duplicating glue code by relying on the stub
After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints
Checked by `test_div2` inductor testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233
Approved by: https://github.com/atalman
ghstack dependencies: #149216
Summary:
## Context
This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198
In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined.
Test Plan:
Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164
Approved by: https://github.com/swolchok