Commit Graph

744 Commits

Author SHA1 Message Date
Aidyn-A
ce048de608 [ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)
This pull request adds the following ops for sparse matrices using Eigen library:
```python
    add(a_csr, b_csr)
    add(a_csc, b_csc)

    addmm(c_csr, a_csr, b_csr)
    addmm(c_csr, a_csr, b_csc)
    addmm(c_csr, a_csc, b_csc)
    addmm(c_csr, a_csc, b_csr)

    addmm(c_csc, a_csr, b_csr)
    addmm(c_csc, a_csr, b_csc)
    addmm(c_csc, a_csc, b_csc)
    addmm(c_csc, a_csc, b_csr)
```

Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.

This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
2025-08-20 15:44:54 +00:00
Nikita Shulga
a06ec54d40 [MPS] Add API to query GPU core count (#160414)
Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service
Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor

Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType|head -n10`
```
% python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"
Apple M1 Pro 16
% system_profiler SPDisplaysDataType|head -n10
Graphics/Displays:

    Apple M1 Pro:

      Chipset Model: Apple M1 Pro
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 16
      Vendor: Apple (0x106b)
      Metal Support: Metal 3
```

This would significantly improve occupancy for torch.compile generated kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414
Approved by: https://github.com/dcci
2025-08-14 00:05:17 +00:00
Scott Todd
cae2b5e3d2 [ROCm][Windows] Enable USE_ROCM, disable USE_RCCL on Windows. (#159079)
This allows setting `USE_ROCM` on Windows. A few other patches are still required to build (see https://github.com/ROCm/TheRock/issues/589), but we have instructions using open source code and rocm python packages available at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#build-pytorch-with-rocm-support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159079
Approved by: https://github.com/jeffdaily
2025-08-12 01:28:20 +00:00
cyy
c184cb3852 [submodule] Bump fbgemm to latest (#158210)
Merge the recent commits of FBGEMM and remove unnecessary CMake code.
Specifically, we
1. enable `fbgemm_autovec` since the target is now correctly handled.
2. remove option `USE_FAKELOWP` which is not used.
3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210
Approved by: https://github.com/q10
2025-08-11 13:48:02 +00:00
Andres Lugo
5f5f508aa8 [ROCm] Ck backend UX refactor (#152951)
Refactors how the enablement/disablement of CK Gemms and SDPA works.

- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-08-08 18:40:17 +00:00
albanD
c5ec5458a5 Don't build nccl when distributed is disabled (#160086)
Because distributed doesn't build on recent compilers, I have to disable distributed, but this makes it still fail as nccl is still built
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160086
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-08-08 17:19:16 +00:00
cyy
72c69e731f set MSVC debug information only on debug builds (#159533)
Fixes: https://github.com/pytorch/pytorch/issues/159515
To reduce the binary size increment in release builds by removing debug information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159533
Approved by: https://github.com/atalman
2025-07-31 12:57:33 +00:00
Chris Thi
c400c8e2e0 [ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075)
Summary:

In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950.

The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4.

The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds.

Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm.

Test Plan:

**Hipify & build**
```
python tools/amd_build/build_amd.py
USE_FBGEMM_GENAI=1 python setup.py develop
```

**Unit tests**
```
python test/test_matmul_cuda.py -- TestFP8MatmulCUDA
Ran 488 tests in 32.969s
OK (skipped=454)
```

**Performance Sample**
| G  | M | N | K | Runtime Ms | GB/S | TFLOPS |
| --  | -- | -- | -- | -- | -- | -- |
| 128 | 1 | 2048 | 5120 | 0.37| 3590 | 7.17 |
| 128 | 64 | 2048 | 5120 | 0.51| 2792 | 338.34 |
| 128 | 128 | 2048 | 5120 | 0.66| 2272 | 522.72 |
| 128 | 1 | 5120 | 1024 | 0.21| 3224 | 6.43 |
| 128 | 64 | 5120 | 1024 | 0.29| 2590 | 291.40 |
| 128 | 128 | 5120 | 1024 | 0.40| 2165 | 434.76 |
| 128 | 1 | 4096 | 4096 | 0.69| 3126 | 6.25 |
| 128 | 64 | 4096 | 4096 | 0.85| 2655 | 324.66 |
| 128 | 128 | 4096 | 4096 | 1.10| 2142 | 501.40 |
| 128 | 1 | 8192 | 8192 | 2.45| 3508 | 7.01 |
| 128 | 64 | 8192 | 8192 | 3.27| 2692 | 336.74 |
| 128 | 128 | 8192 | 8192 | 4.04| 2224 | 543.76 |
| 16 | 1 | 2048 | 5120 | 0.04| 3928 | 7.85 |
| 16 | 64 | 2048 | 5120 | 0.05| 3295 | 399.29 |
| 16 | 128 | 2048 | 5120 | 0.07| 2558 | 588.69 |
| 16 | 1 | 5120 | 1024 | 0.03| 3119 | 6.23 |
| 16 | 64 | 5120 | 1024 | 0.03| 2849 | 320.62 |
| 16 | 128 | 5120 | 1024 | 0.05| 2013 | 404.11 |
| 16 | 1 | 4096 | 4096 | 0.06| 4512 | 9.02 |
| 16 | 64 | 4096 | 4096 | 0.09| 3124 | 381.95 |
| 16 | 128 | 4096 | 4096 | 0.13| 2340 | 547.67 |
| 16 | 1 | 8192 | 8192 | 0.32| 3374 | 6.75 |
| 16 | 64 | 8192 | 8192 | 0.42| 2593 | 324.28 |
| 16 | 128 | 8192 | 8192 | 0.53| 2120 | 518.36 |

- Using ROCm 6.4.1
- Collected through `triton.testing.do_bench_cudagraph`

**Binary size with gfx942 arch**
Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so
After:  118860960 Jul 23 14:29 build/lib/libtorch_hip.so
The difference is 2757104 bytes (~2.6 MiB).

Reviewers: @drisspg @ngimel @jwfromm @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075
Approved by: https://github.com/drisspg
2025-07-30 23:53:58 +00:00
Yu, Guangye
cbe1cb7018 [CMake] Move xpu flag to xpu.cmake (#158542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542
Approved by: https://github.com/gujinghui, https://github.com/ezyang
2025-07-21 17:19:59 +00:00
Jane Xu
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
PyTorch MergeBot
b1d62febd0 Revert "Use official CUDAToolkit module in CMake (#154595)"
This reverts commit 08dae945ae.

Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))
2025-06-23 21:15:31 +00:00
cyy
08dae945ae Use official CUDAToolkit module in CMake (#154595)
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with `torch::` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595
Approved by: https://github.com/albanD
2025-06-22 05:44:29 +00:00
cyy
95cb42c45d Use CMAKE_COLOR_DIAGNOSTICS (#154583)
`CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583
Approved by: https://github.com/ezyang
2025-06-17 04:52:31 +00:00
Xuehai Pan
013dfeabb4 [BE] fix typos in top-level files (#156067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067
Approved by: https://github.com/malfet
ghstack dependencies: #156066
2025-06-16 14:56:07 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
Han, Chao1
cb9b479f4f XPU enable XCCL by default (#154963)
Enable USE_XCCL=ON by default when building PyTorch XPU binary, which is on par with NCCL for PyTorch CUDA binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154963
Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-10 17:56:13 +00:00
Mengwei Liu
386aa72003 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Differential Revision: [D75985423](https://our.internmc.facebook.com/intern/diff/D75985423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-07 06:54:12 +00:00
Ke Wen
3685b10170 Turn on compile with NVSHMEM (#154538)
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154538
Approved by: https://github.com/ngimel
2025-06-03 15:24:24 +00:00
PyTorch MergeBot
67067512a1 Revert "[BE] Cleanup old ExecuTorch codegen and runtime code (#154165)"
This reverts commit 515c19a385.

Reverted https://github.com/pytorch/pytorch/pull/154165 on behalf of https://github.com/seemethere due to This is failing when attempting to test against executorch main internally, author has acknowledged that this should be reverted ([comment](https://github.com/pytorch/pytorch/pull/154165#issuecomment-2931489616))
2025-06-02 16:28:46 +00:00
Mengwei Liu
515c19a385 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-02 01:47:02 +00:00
PyTorch MergeBot
bd10ea4e6c Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit ad26ec6abe.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))
2025-05-31 02:14:24 +00:00
cyy
ad26ec6abe Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 01:54:35 +00:00
PyTorch MergeBot
108422ac26 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 78624679a8.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))
2025-05-31 00:28:03 +00:00
cyy
78624679a8 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 00:01:52 +00:00
PyTorch MergeBot
7e8532077f Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 1ece53b157.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))
2025-05-30 13:16:33 +00:00
cyy
1ece53b157 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-30 11:25:30 +00:00
PyTorch MergeBot
53b0f6f543 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 4613081b72.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/malfet due to It broke windows debug builds, see ef1d45b12d/1 ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2919897160))
2025-05-29 16:14:28 +00:00
cyy
4613081b72 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-29 00:52:44 +00:00
Yu, Guangye
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
PyTorch MergeBot
084c4aa614 Revert "Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)"
This reverts commit 7ed377f577.

Reverted https://github.com/pytorch/pytorch/pull/153656 on behalf of https://github.com/larryliu0820 due to Still being used internally so can't remove ([comment](https://github.com/pytorch/pytorch/pull/153656#issuecomment-2887665403))
2025-05-16 21:00:11 +00:00
Mengwei Liu
7ed377f577 Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)
This reverts commit ae0e8f0c73.

Keep android/libs/fbjni because it's being used by other components of
PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656
Approved by: https://github.com/malfet
2025-05-16 04:35:42 +00:00
hanchao
0ca91af6b8 Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593)
### Motivation:

Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`.
 By default, `USE_XCCL` is OFF and allowed set to ON explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593
Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever
2025-05-15 05:39:00 +00:00
Tristan Rice
9c3cef437c gloo: support ibverbs in cmake (#153425)
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.

Test plan:

```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```

```py
"""
run with:

torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""

import os

os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

if rank == 0:
    device = "cpu"
else:
    device = "cuda"

print(device)

t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)

dist.all_reduce(t)

torch.testing.assert_close(t, target)

t = torch.full((10, 100), fill_value=(rank+1), device=device)

if rank == 0:
    dist.send(t, dst=1)
else:
    dist.recv(t, src=0)
    torch.testing.assert_close(t, torch.full_like(t, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
Nikita Shulga
73f11e3365 [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature.
XPU still uses it, so exclude XPU builds  until https://github.com/intel/torch-xpu-ops/pull/1615 is merged

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-26 01:15:53 +00:00
PyTorch MergeBot
0f765773e3 Revert "[BE] Do not allow PyTorch codebase to use c10::optional (#150464)"
This reverts commit 490ef768cf.

Reverted https://github.com/pytorch/pytorch/pull/150464 on behalf of https://github.com/clee2000 due to broke xpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/14674243034/job/41187443432) [HUD commit link](490ef768cf)? ([comment](https://github.com/pytorch/pytorch/pull/150464#issuecomment-2831608162))
2025-04-25 23:34:56 +00:00
Nikita Shulga
490ef768cf [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-25 22:03:48 +00:00
cyy
79e8a69257 Enable move warnings for torch targets (#149923)
This PR enables more move warnings for torch targets and fixes some code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923
Approved by: https://github.com/malfet
2025-03-26 08:38:13 +00:00
Nikita Shulga
5a7588f183 [Build] Remove pre-CXX11 ABI logic from build script (#149888)
Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888
Approved by: https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #149887
2025-03-25 03:17:16 +00:00
Avanish.Tiwari
26f8d81037 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-07 23:35:47 +00:00
PyTorch MergeBot
cf9efbdf16 Revert "Enable onednn in pytorch for ppc64le architecture (#143743)"
This reverts commit d4cf0e5af4.

Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](d4cf0e5af4) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))
2025-03-06 20:47:57 +00:00
Tiwari-Avanish
d4cf0e5af4 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet, https://github.com/albanD
2025-03-06 18:00:55 +00:00
Xiao Wang
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
drisspg
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
Nikhil Gupta
1f8ff6812d [Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836)
Fixes: https://github.com/pytorch/pytorch/issues/146740

Description:
1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases.

Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836
Approved by: https://github.com/malfet
2025-02-13 17:49:26 +00:00
Mikayla Gawarecki
861bf892fb Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748
Approved by: https://github.com/atalman
2025-02-11 15:49:01 +00:00
Nikhil Gupta
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
albanD
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
Xu Han
2645fc45b1 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-15 23:43:41 +00:00
PyTorch MergeBot
aa14fcd96c Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit e141cb9c34.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))
2025-01-06 18:15:52 +00:00