Commit Graph

1800 Commits

Author SHA1 Message Date
Zizeng Meng
b8452e55bc [Kineto x Insight] Update Kineto submodule (#154426)
Summary: We add a new ActivityType::MTIA_INSIGHT in 20f652846f

Test Plan: CI

Differential Revision: D75454945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154426
Approved by: https://github.com/Skylion007
2025-05-27 18:29:29 +00:00
Yu, Guangye
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
Yutao Xu
11f8511455 Update torch-xpu-ops commit pin (#153902)
Update the torch-xpu-ops commit to defce46ae7, includes:

- Resolve the aten::gamma accuracy gap compared to scipy
- Optimize layernom_vectorized_impl by using adaptive wg selection for small shapes
- [Intro async flag and use current stream avoid stream sync](https://github.com/intel/torch-xpu-ops/pull/1546)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153902
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-21 13:29:41 +00:00
Gantaphon Chalumporn
05bc78e64f [submodule] Update fbgemm pinned version (#153950)
Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751

Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec

Test Plan: PyTorch OSS CI

Differential Revision: D75073516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 20:24:27 +00:00
Eddie Yan
ef958fa152 [cuDNN][cuDNN frontend] upgrade cuDNN frontend submodule to 1.12 (#153888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153888
Approved by: https://github.com/Skylion007
2025-05-20 15:08:37 +00:00
Aaron Gokaslan
d869ea11e0 [BE]: Update fmtlib submodule to 11.2.0 (#153853)
Update fmtlib to 11.2.0 with a lot of miscellaneous fixes for various compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153853
Approved by: https://github.com/malfet
2025-05-20 14:11:18 +00:00
cyy
7ae7324ac4 [submodule] Update google benchmark to v1.9.3 (#153676)
And remove `include_directories`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153676
Approved by: https://github.com/Skylion007
2025-05-16 23:31:53 +00:00
cyy
9d3b6ee4c1 [submodule] Update gtest to v1.17.0 (#153618)
And remove some outdated CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618
Approved by: https://github.com/malfet
2025-05-16 01:24:19 +00:00
Tristan Rice
d1dd2c1fc8 gloo: cuda (#153406)
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.

This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441

Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.

The gloo submodule is updated to depend on the new Gloo changes

Test plan:

```py
import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00
cyy
e5e06d9cab [submodule] Update kleidiai to v1.8.0 (#153592)
And cleans up some CMake instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592
Approved by: https://github.com/malfet
2025-05-15 10:14:05 +00:00
chunhuanMeng
1f48bab377 Update torch-xpu-ops commit pin (#153445)
Update the torch-xpu-ops commit to [207105038963e5f9f012f1a0cfd3b9f57b2ab5b0](2071050389), includes:

- Improve the accuracy of `upsample_bilinear2d_backward`
- Enhance the performance of `avg_pool2d`
- Update the implementation of scatter-gather and indexing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153445
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-05-14 15:34:47 +00:00
Tristan Rice
9c3cef437c gloo: support ibverbs in cmake (#153425)
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.

Test plan:

```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```

```py
"""
run with:

torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""

import os

os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

if rank == 0:
    device = "cpu"
else:
    device = "cuda"

print(device)

t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)

dist.all_reduce(t)

torch.testing.assert_close(t, target)

t = torch.full((10, 100), fill_value=(rank+1), device=device)

if rank == 0:
    dist.send(t, dst=1)
else:
    dist.recv(t, src=0)
    torch.testing.assert_close(t, torch.full_like(t, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
cyy
15e08f9571 [submodule] Update ONNX to 1.18 (#152200)
Update ONNX to 1.18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152200
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-05-13 04:18:45 +00:00
Chen, Zejun
76e34e3850 [Kineto] Upgrade the kineto commit to fb36cce (#152007)
XPU intends to upgrade oneAPI version(https://github.com/pytorch/pytorch/issues/151097) to support torch Distributed. However, the PTI within the oneAPI to be upgraded introduces breaking changes. It changed the signature of the APIs as follows.
- ptiViewEnableRuntimeApi
- ptiViewGetApiIdName

To avoid the breaks due to the PTI upcoming non-backward-compatible changes, we refined the XPU PTI integration with the kineto. We check the PTI version and then invoke the PTI API accordingly. It means that the kineto of this PR can overcome the non-backward-compatible issue for the sake of the upcoming oneAPI 2025.1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152007
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/sraikund16, https://github.com/malfet
2025-05-09 18:38:41 +00:00
Aaron Gokaslan
07a29dbe81 [BE]: Update cutlass submodule to 3.9.2 (#152779)
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
2025-05-06 16:08:24 +00:00
cyy
ac792a0dca [submodule] Bump ITTAPI to 3.25.5 (#150263)
It hasn't been updated for 3 years. And also to remove CMake 4 workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150263
Approved by: https://github.com/sraikund16
2025-05-06 01:02:18 +00:00
Zhengxu Chen
361bf056a7 [nativert] Add moodycamel/concurrentqueue as third-party dependency (#152033)
nativert RFC:  https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

moodycamel/concurrentqueue is a high performence mpmc queue implementation and single header only. We want to add this to third_party to be used with upcoming Torch Native Runtime.

The source code is imported from commit hash 2f09da73d22a47dc8a89cdd4fc4c3bfae07f4284 from https://github.com/cameron314/concurrentqueue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152033
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-04-30 21:37:20 +00:00
xinan.lin
f05d3e5019 [torch-xpu-ops] Update torch-xpu-ops commit pin. (#152321)
Update the torch-xpu-ops commit to [655fa9bc7f88ab5bd3766b5f2fd5b43989c2caca](655fa9bc7f), including:

- Fixes batch_norm numeric error by adding additional boundary check
- Enable two operators: fft & jagged_to_padded_dense
- XCCL relevant changes:
- Cache cclStream to improve performance.
- Add support for complex datatypes in allgather and broadcast.
- Support coalescing operations and batch_isend_irecv.
- Introduce additional logging; use export TORCH_CPP_LOG_LEVEL=INFO.
- Fix #152296
- Fix #152020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152321
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
2025-04-29 04:00:09 +00:00
Eddie Yan
a6d38051ee [CUDA][CUTLASS] CUTLASS 3.9 submodule upgrade (#151253)
Originally authored by Jack Kosaian, likely needs #ifdefs if we want to preserve compat with 3.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151253
Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-28 23:10:14 +00:00
PyTorch MergeBot
8172397025 Revert "Update torch-xpu-ops commit pin (#150827)"
This reverts commit 776aa68221.

Reverted https://github.com/pytorch/pytorch/pull/150827 on behalf of https://github.com/etaf due to Inductor UT regression ([comment](https://github.com/pytorch/pytorch/pull/150827#issuecomment-2825857903))
2025-04-24 00:41:06 +00:00
Nikita Shulga
4d2d833976 [CI] Update sleef submodule to v3.8 (#151955)
Should help with RISC-V cross-compilation.
3.9.0 migration is blocked by sleef project switching to C++20
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151955
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/Skylion007
2025-04-23 23:56:05 +00:00
Shivam Raikundalia
4bf09562e4 [EZ/Profiler] Update Submodule (#151843)
Summary: Update to d82680bbd4

Test Plan: CI

Differential Revision: D73397323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151843
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2025-04-22 18:19:43 +00:00
Yutao Xu
776aa68221 Update torch-xpu-ops commit pin (#150827)
Update the torch-xpu-ops commit to [b51dd3ef4f4d0f6b44c59e61431c5d29354dcaf6](b51dd3ef4f), including:
- Update commit pin to xpu-ops main branch
- Fixes batch_norm numeric error by adding additional boundary check
- Enable two operators: fft & jagged_to_padded_dense
- XCCL relevant changes:
1. Cache `cclStream` to improve performance.
2. Add support for complex datatypes in `allgather` and `broadcast`.
3. Support `coalescing` operations and `batch_isend_irecv`.
4. Introduce additional logging; use `export TORCH_CPP_LOG_LEVEL=INFO`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150827
Approved by: https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-18 10:12:59 +00:00
Shivam Raikundalia
ad5e9065ac [Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068)
Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too.

Test Plan: CI

Reviewed By: sanrise

Differential Revision: D72672186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068
Approved by: https://github.com/davidberard98
2025-04-11 18:50:25 +00:00
Tristan Rice
df4e5294a6 Reapply "ProcessGroupGloo: support lazy_init (#150801)" (#151031)
This reverts commit 73f3d6d9aa.

Reapplies #150801

Test plan:

See #150801

submodule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031
Approved by: https://github.com/fduwjj
2025-04-11 01:58:35 +00:00
PyTorch MergeBot
73f3d6d9aa Revert "ProcessGroupGloo: support lazy_init (#150801)"
This reverts commit f237ee54bf.

Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))
2025-04-10 13:44:31 +00:00
Tristan Rice
f237ee54bf ProcessGroupGloo: support lazy_init (#150801)
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801
Approved by: https://github.com/fduwjj
2025-04-09 19:29:50 +00:00
Shivam Raikundalia
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
Stepan Hruda
2e23768d25 Expose symbols on macos in the xplat pytorch stack (#150487)
Summary:
X-link: https://github.com/pytorch/executorch/pull/9819

Had to revert D71321310 because it affected way too many targets and build sizes.

These changes should expose just enough symbols to be buildable in arvr mode on macOS. Could potentially make narrow it down even more by avoiding eg `get_pt_compiler_flags`

Differential Revision: D72255474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150487
Approved by: https://github.com/drisspg
2025-04-04 23:03:16 +00:00
Wang, Chuanqi
0198e44f37 Update torch-xpu-ops commit pin to 98c808d (#150554)
Update the torch-xpu-ops commit to [98c808dea6de7330c415aa777d6921944cf79887](98c808dea6), include

- Fixes #150001 by removing pre-CXX11 ABI logic from build script for XPU
- Fixes #150430
- Fixes XCCL build issue caused by PR #150398

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150554
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-04-02 22:42:18 +00:00
Nikita Shulga
91666eef60 Update gloo submodule (#150320)
That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320
Approved by: https://github.com/atalman
2025-03-31 22:40:27 +00:00
Wang, Chuanqi
f74d5d576a Update torch-xpu-ops commit pin to 3ee2bd2 (#150300)
Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](3ee2bd2f13)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150300
Approved by: https://github.com/EikanWang
2025-03-31 13:36:11 +00:00
Aaron Gokaslan
e91f84c87d [BE]: Update cudnn frontend submodule to 1.11.0 (#149759)
Update CUDNN frontend submodule to 11.1.0. Adds some new features like score_mod from flex_attention and adds a lot of bugfixes and new feature knobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149759
Approved by: https://github.com/jansel
2025-03-30 17:14:26 +00:00
Tristan Rice
87bfd66c3c gloo: update to latest version (#149985)
This updates submodule Gloo to the latest version and brings a number of benefits:

* connection retries d2609ab5e8
* better error messages 5ca057d6cc
* multi_get support for larger scale jobs 4ff6edf45f
* metadata exchange optimizations  20dc202dd8
* miscellaneous other fixes

Old commit: 5354032ea0

Test plan:

This is already being used in production environments at scale.

PyTorch CI

```
pytest -v test/distributed/test_c10d_gloo.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985
Approved by: https://github.com/fduwjj, https://github.com/malfet
2025-03-26 19:19:31 +00:00
Ozan Aydin
ce54c430c0 [Submodule] [cpuinfo] cpuinfo update (#149305)
Updating `cpuinfo` module.

Relevant:
https://github.com/pytorch/cpuinfo/issues/270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305
Approved by: https://github.com/malfet
2025-03-25 22:44:50 +00:00
Stepan Hruda
6bcf9c6ce3 [xnnpack] Expose subgraph symbols (#149397)
Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking

Test Plan: CI / used for a macOS build on top of the stack.

Differential Revision: D71315023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397
Approved by: https://github.com/digantdesai
2025-03-19 01:14:46 +00:00
Shivam Raikundalia
e84cc4c052 Update Kineto Submodule (#149089)
Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large

Test Plan: CI

Differential Revision: D71082829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089
Approved by: https://github.com/Skylion007
2025-03-13 17:18:16 +00:00
chunhuanMeng
e9c12e819d Update torch-xpu-ops commit pin (#148881)
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c), includes:

- Enable AOT for LNL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
Jiang, Yanbing
f2f25a5444 Upgrade submodule oneDNN to v3.7.1 (#148293)
This PR is to upgrade submodule oneDNN to v3.7.1.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima, https://github.com/atalman
2025-03-04 13:56:45 +00:00
Aaron Gokaslan
6d70b42810 [BE][Ez]: Update fmt submodule to 11.1.4 (#148264)
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-02 19:00:00 +00:00
drisspg
3a69dee955 [Submodule][FlashAttention] Bump to 2.7.4 (#148147)
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
Yutao Xu
21bd5fe203 Update torch-xpu-ops commit pin (#147968)
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d), includes:

- Bugfix (PT2E/BatchNorm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
Yutao Xu
7bd2e3bca1 Update torch-xpu-ops commit pin (#147743)
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e), includes:

- Bugfix (LayerNorm/Nonzeros)
- Update AOT target

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
PyTorch MergeBot
e72b4c61bf Revert "Upgrade submodule oneDNN to v3.7 (#147498)"
This reverts commit 576ed1e400.

Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))
2025-02-24 22:57:39 +00:00
Jiang, Yanbing
576ed1e400 Upgrade submodule oneDNN to v3.7 (#147498)
This PR is to upgrade submodule oneDNN to v3.7.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
2025-02-24 14:32:51 +00:00
drisspg
db15cb0988 [Submodule] [Cutlass] Update to 3.8.0 tag (#147655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang, https://github.com/eqy
2025-02-22 20:05:31 +00:00
atalman
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
PyTorch MergeBot
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
Andy Lugo
5d675de754 Update ck (#144799)
Updates the CK version and re-implements kernel generation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799
Approved by: https://github.com/jianyuh
2025-02-18 17:00:27 +00:00
Yutao Xu
6edc419d69 Update torch-xpu-ops commit pin (#147358)
Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](a14d1eaa83), includes:

- SparseCSR XPU support
- Refine build system

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358
Approved by: https://github.com/EikanWang
2025-02-18 16:05:25 +00:00