Commit Graph

786 Commits

Author SHA1 Message Date
Yu, Guangye
cbe1cb7018 [CMake] Move xpu flag to xpu.cmake (#158542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542
Approved by: https://github.com/gujinghui, https://github.com/ezyang
2025-07-21 17:19:59 +00:00
Jane Xu
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
PyTorch MergeBot
b1d62febd0 Revert "Use official CUDAToolkit module in CMake (#154595)"
This reverts commit 08dae945ae.

Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))
2025-06-23 21:15:31 +00:00
cyy
08dae945ae Use official CUDAToolkit module in CMake (#154595)
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with `torch::` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595
Approved by: https://github.com/albanD
2025-06-22 05:44:29 +00:00
cyy
95cb42c45d Use CMAKE_COLOR_DIAGNOSTICS (#154583)
`CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583
Approved by: https://github.com/ezyang
2025-06-17 04:52:31 +00:00
Xuehai Pan
013dfeabb4 [BE] fix typos in top-level files (#156067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067
Approved by: https://github.com/malfet
ghstack dependencies: #156066
2025-06-16 14:56:07 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
Han, Chao1
cb9b479f4f XPU enable XCCL by default (#154963)
Enable USE_XCCL=ON by default when building PyTorch XPU binary, which is on par with NCCL for PyTorch CUDA binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154963
Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-10 17:56:13 +00:00
Mengwei Liu
386aa72003 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Differential Revision: [D75985423](https://our.internmc.facebook.com/intern/diff/D75985423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-07 06:54:12 +00:00
Ke Wen
3685b10170 Turn on compile with NVSHMEM (#154538)
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154538
Approved by: https://github.com/ngimel
2025-06-03 15:24:24 +00:00
PyTorch MergeBot
67067512a1 Revert "[BE] Cleanup old ExecuTorch codegen and runtime code (#154165)"
This reverts commit 515c19a385.

Reverted https://github.com/pytorch/pytorch/pull/154165 on behalf of https://github.com/seemethere due to This is failing when attempting to test against executorch main internally, author has acknowledged that this should be reverted ([comment](https://github.com/pytorch/pytorch/pull/154165#issuecomment-2931489616))
2025-06-02 16:28:46 +00:00
Mengwei Liu
515c19a385 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-02 01:47:02 +00:00
PyTorch MergeBot
bd10ea4e6c Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit ad26ec6abe.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))
2025-05-31 02:14:24 +00:00
cyy
ad26ec6abe Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 01:54:35 +00:00
PyTorch MergeBot
108422ac26 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 78624679a8.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))
2025-05-31 00:28:03 +00:00
cyy
78624679a8 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 00:01:52 +00:00
PyTorch MergeBot
7e8532077f Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 1ece53b157.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))
2025-05-30 13:16:33 +00:00
cyy
1ece53b157 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-30 11:25:30 +00:00
PyTorch MergeBot
53b0f6f543 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 4613081b72.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/malfet due to It broke windows debug builds, see ef1d45b12d/1 ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2919897160))
2025-05-29 16:14:28 +00:00
cyy
4613081b72 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-29 00:52:44 +00:00
Yu, Guangye
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
PyTorch MergeBot
084c4aa614 Revert "Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)"
This reverts commit 7ed377f577.

Reverted https://github.com/pytorch/pytorch/pull/153656 on behalf of https://github.com/larryliu0820 due to Still being used internally so can't remove ([comment](https://github.com/pytorch/pytorch/pull/153656#issuecomment-2887665403))
2025-05-16 21:00:11 +00:00
Mengwei Liu
7ed377f577 Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)
This reverts commit ae0e8f0c73.

Keep android/libs/fbjni because it's being used by other components of
PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656
Approved by: https://github.com/malfet
2025-05-16 04:35:42 +00:00
hanchao
0ca91af6b8 Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593)
### Motivation:

Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`.
 By default, `USE_XCCL` is OFF and allowed set to ON explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593
Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever
2025-05-15 05:39:00 +00:00
Tristan Rice
9c3cef437c gloo: support ibverbs in cmake (#153425)
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.

Test plan:

```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```

```py
"""
run with:

torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""

import os

os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

if rank == 0:
    device = "cpu"
else:
    device = "cuda"

print(device)

t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)

dist.all_reduce(t)

torch.testing.assert_close(t, target)

t = torch.full((10, 100), fill_value=(rank+1), device=device)

if rank == 0:
    dist.send(t, dst=1)
else:
    dist.recv(t, src=0)
    torch.testing.assert_close(t, torch.full_like(t, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
Nikita Shulga
73f11e3365 [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature.
XPU still uses it, so exclude XPU builds  until https://github.com/intel/torch-xpu-ops/pull/1615 is merged

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-26 01:15:53 +00:00
PyTorch MergeBot
0f765773e3 Revert "[BE] Do not allow PyTorch codebase to use c10::optional (#150464)"
This reverts commit 490ef768cf.

Reverted https://github.com/pytorch/pytorch/pull/150464 on behalf of https://github.com/clee2000 due to broke xpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/14674243034/job/41187443432) [HUD commit link](490ef768cf)? ([comment](https://github.com/pytorch/pytorch/pull/150464#issuecomment-2831608162))
2025-04-25 23:34:56 +00:00
Nikita Shulga
490ef768cf [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-25 22:03:48 +00:00
cyy
79e8a69257 Enable move warnings for torch targets (#149923)
This PR enables more move warnings for torch targets and fixes some code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923
Approved by: https://github.com/malfet
2025-03-26 08:38:13 +00:00
Nikita Shulga
5a7588f183 [Build] Remove pre-CXX11 ABI logic from build script (#149888)
Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888
Approved by: https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #149887
2025-03-25 03:17:16 +00:00
Avanish.Tiwari
26f8d81037 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-07 23:35:47 +00:00
PyTorch MergeBot
cf9efbdf16 Revert "Enable onednn in pytorch for ppc64le architecture (#143743)"
This reverts commit d4cf0e5af4.

Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](d4cf0e5af4) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))
2025-03-06 20:47:57 +00:00
Tiwari-Avanish
d4cf0e5af4 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet, https://github.com/albanD
2025-03-06 18:00:55 +00:00
Xiao Wang
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
drisspg
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
Nikhil Gupta
1f8ff6812d [Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836)
Fixes: https://github.com/pytorch/pytorch/issues/146740

Description:
1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases.

Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836
Approved by: https://github.com/malfet
2025-02-13 17:49:26 +00:00
Mikayla Gawarecki
861bf892fb Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748
Approved by: https://github.com/atalman
2025-02-11 15:49:01 +00:00
Nikhil Gupta
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
albanD
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
Xu Han
2645fc45b1 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-15 23:43:41 +00:00
PyTorch MergeBot
aa14fcd96c Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit e141cb9c34.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))
2025-01-06 18:15:52 +00:00
Xu Han
e141cb9c34 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2025-01-03 05:41:06 +00:00
cyy
379bbef23c Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-27 05:46:57 +00:00
PyTorch MergeBot
9255ffc841 Revert "Enable more C++ warnings (#143355)"
This reverts commit daa3ffe0eb.

Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))
2024-12-26 17:13:10 +00:00
cyy
daa3ffe0eb Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-21 09:19:02 +00:00
PyTorch MergeBot
e15442a9b2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 6733045a4a.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))
2024-12-21 08:06:19 +00:00
Nikhil Gupta
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
Xu Han
6733045a4a export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-20 11:42:09 +00:00
PyTorch MergeBot
8136daff5a Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit 4b82251011.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))
2024-12-19 23:33:17 +00:00
Nikhil Gupta
4b82251011 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-19 18:51:26 +00:00
PyTorch MergeBot
14fe1f7190 Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit d3ff2d42c2.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))
2024-12-19 01:05:11 +00:00
Nikhil Gupta
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
cyy
201cb8834f Enable more C++ warnings (#143099)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099
Approved by: https://github.com/albanD
2024-12-17 02:03:39 +00:00
cyy
2903cf0ad8 Re-enable some C++ warnings (#142332)
It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332
Approved by: https://github.com/albanD, https://github.com/eqy
2024-12-12 04:02:12 +00:00
PyTorch MergeBot
90fc2b42e3 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 82544bd3a2.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still has failures internally when building, D66923759 ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716))
2024-12-09 17:04:20 +00:00
Xu Han
82544bd3a2 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-07 15:23:38 +00:00
Michal Gallus
4cbb3b4bd2 [ROCm] Enable finding HIP and ROCm libraries on Windows (#137279)
This PR introduces support for finding HIP-SDK Libraries on Windows.

Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed:
- The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block`
- Windows steps follow in the `else()` clause

The separation was needed, because of several factors:
- HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.),
- The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx)
- MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var.
- Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279
Approved by: https://github.com/jeffdaily
2024-12-03 03:26:01 +00:00
atalman
c17ba69ba5 [submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) (#141901)
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590

Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-12-03 00:16:35 +00:00
Matthew Sterrett
7e65060410 Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel
2024-11-02 02:14:01 +00:00
Xu Han
6d5944c9f1 turn off USE_MIMALLOC_ON_MKL temporary. (#139204)
Fixes #138994

We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused https://github.com/pytorch/pytorch/issues/138994

For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: https://github.com/pytorch/pytorch/pull/138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139204
Approved by: https://github.com/ezyang
2024-10-30 21:09:21 +00:00
Xu Han
96b30dcb25 [Windows][cpu] mkl use mimalloc as allocator on Windows (#138419)
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.

Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.

For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-24 05:29:47 +00:00
Jeongseok Lee
3cfd244495 Add USE_SYSTEM_NVTX option (#138287)
## Summary

We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.

This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).

## Test Plan

The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD
2024-10-19 04:26:01 +00:00
Aleksei Nikiforov
949b6f685d Enable -Werror on s390x (#136527)
Enable -Werror on s390x

Example of original issue on s390x:
https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704

Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow.

`-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe.

`-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527
Approved by: https://github.com/malfet
2024-10-19 01:18:42 +00:00
nihui
b3ae1b1b73 [CMake] remove duplicated cmake options for Gloo and C10D (#138318)
just a trival fix  :P
cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318
Approved by: https://github.com/janeyx99
2024-10-19 00:26:25 +00:00
abhishek-fujitsu
b825848d85 Fix aarch64 debug build with GCC (#136990)
Fixes #136440

**Issue:**
When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations.

**Fix:**
Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations.

The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt:
- For aarch64 GCC debug builds: use -Og
- For all other debug builds: retain -O0

This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged.

**Testing:**
Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-30 21:11:50 +00:00
PyTorch MergeBot
0e19522122 Revert "Adds support for accelerated sorting with x86-simd-sort (#127936)"
This reverts commit 239a9ad65e.

Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](239a9ad65e) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))
2024-09-23 14:52:23 +00:00
cyy
75f141be62 Avoid unnecessary CMake warnings on Windows (#136393)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393
Approved by: https://github.com/ezyang
2024-09-23 06:42:59 +00:00
Matthew Sterrett
239a9ad65e Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-20 21:19:33 +00:00
Aditya Tewari
4842f0fac6 Enable torch build with SLEEF on ARM by default (#133339)
**Scope:** Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform.

Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm  improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend.
On models  like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a **~1.2x speedup** (with 16 threads).

The below results are run with `Batch_Size=1` and `Cores=8, 16`

![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339
Approved by: https://github.com/malfet, https://github.com/kimishpatel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-20 16:02:32 +00:00
Xinya Zhang
5fd670e0ef [ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866)
Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly

Fixes #125230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet
2024-08-27 18:24:29 +00:00
cyy
2ee201a7d0 [CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434)
Since BUILDING_WITH_TORCH_LIBS is not used now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434
Approved by: https://github.com/ezyang
2024-08-27 01:48:21 +00:00
Zitong Zhan
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
cyy
fb9d2dc641 Remove Wno-invalid-partial-specialization from CMake (#133398)
The code base is clean enough that Winvalid-partial-specialization can be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398
Approved by: https://github.com/ezyang
2024-08-18 04:06:21 +00:00
Mikayla Gawarecki
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
cyy
32be3e942c Remove -Wno-error=pedantic from CMake (#133074)
The codebase is largely clean so that we can turn it on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133074
Approved by: https://github.com/ezyang
2024-08-10 13:11:21 +00:00
Aidyn-A
f217b470cc [CMAKE] Avoid double setting of LDFLAGS (#130370)
It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set:
1bf4a44b33/CMakeLists.txt (L541-L542)
This flag collision causes the build failures at the `cmake` stage.
This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370
Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet
2024-07-30 18:16:04 +00:00
PyTorch MergeBot
6cf493158e Revert "Enable FlashAttention on Windows (#131906)"
This reverts commit b90bc66766.

Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))
2024-07-29 16:49:23 +00:00
Yan Zhiwei
2a02b5cd22 [Intel GPU] Dispatch Stub support (#130019)
# Motivation
Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way.

We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for  Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs.

This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-07-29 02:18:52 +00:00
Luca Wehrstedt
b90bc66766 Enable FlashAttention on Windows (#131906)
Let's just give this a try.

Reland of https://github.com/pytorch/pytorch/pull/131875.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906
Approved by: https://github.com/drisspg
2024-07-26 21:41:56 +00:00
PyTorch MergeBot
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9d.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
cyy
eac83479cc Enable Wunused-function and Wunused-result globally (#131596)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596
Approved by: https://github.com/zou3519
2024-07-25 23:50:12 +00:00
Mikayla Gawarecki
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
Nikita Shulga
62e566b345 [BE] Remove suppression of inconsistent missing overrides (#131524)
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204

- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
2024-07-24 10:07:36 +00:00
PyTorch MergeBot
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
Mikayla Gawarecki
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
cyy
85b8503621 [Caffe2] Remove Caffe2 documentation (#130089)
Due to the removal of Caffe2 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089
Approved by: https://github.com/r-barnes, https://github.com/albanD
2024-07-10 00:52:16 +00:00
cyy
a6345d3477 [CMake] [3/N] Remove unused code (#130322)
Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322
Approved by: https://github.com/r-barnes
2024-07-09 19:33:33 +00:00
Xinya Zhang
d34075e0bd Add Efficient Attention support on ROCM (#124885)
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation

Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.

This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129

`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change.  [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
2024-06-08 22:41:05 +00:00
sdp
b4a0161449 Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390)
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.

-------
As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU.

This PR  enables XPU build on Windows as the first step of #126719:

- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang
2024-06-06 01:41:06 +00:00
cyy
ce4436944c Fix IOS builds (#127985)
IOS builds fail these days, fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985
Approved by: https://github.com/ezyang
2024-06-05 02:14:43 +00:00
Ting Lu
1b704a160f Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514)
Original PR - https://github.com/pytorch/pytorch/pull/127220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514
Approved by: https://github.com/Aidyn-A, https://github.com/atalman
2024-06-04 20:51:44 +00:00
Xiangyang (Mark) Guo
db9d457a3f Use sleef on macOS Apple silicon by default (#126509)
Use sleef ~~for aarch64~~ on macOS Apple silicon by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-06-03 19:33:06 +00:00
cyy
d44daebdbc [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-31 01:20:45 +00:00
PyTorch MergeBot
67739d8c6f Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 699db7988d.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))
2024-05-30 01:16:57 +00:00
cyy
699db7988d [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-29 11:58:03 +00:00
PaliC
a25b28a753 [Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328)
Creates an option to just build the libtorch portion of pytorch such that we have the necessary .so files.  Then it builds a torch package using the libtorch wheel. These options are enabled using ` BUILD_LIBTORCH_WHL` and `BUILD_PYTHON_ONLY`.

We run

```
 BUILD_LIBTORCH_WHL=1 python setup.py install
python setup.py clean
BUILD_PYTHON_ONLY=1 python setup.py install
```

to produce

```
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ls /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/                                                                                                                (pytorch-3.10)
libshm.so*  libtorch_global_deps.so*  libtorch_python.so*
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ldd build/lib/libtorch_python.so                                                                                                                                                                (pytorch-3.10)
        linux-vdso.so.1 (0x00007ffdc2d37000)
        libtorch.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch.so (0x00007f539fe99000)
        libshm.so => /home/sahanp/pytorch/build/lib/libshm.so (0x00007f539fe90000)
        libcudnn.so.8 => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f539e800000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f539e400000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f539e000000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f539fda5000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f539ebe5000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f539dc00000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f539fea0000)
        libtorch_cpu.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cpu.so (0x00007f5392400000)
        libtorch_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cuda.so (0x00007f5380000000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f539fd9e000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f539fd99000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f539fd94000)
        libc10.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10.so (0x00007f539eb07000)
        libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007f537ec00000)
        libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007f537ce00000)
        libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007f5378800000)
        libomp.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/libomp.so (0x00007f539e707000)
        libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x00007f5377e00000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5377a00000)
        libc10_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10_cuda.so (0x00007f539ea6a000)
        libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5368400000)
        libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f535ee00000)
        libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f534c800000)
        libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5346200000)
        libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f533f800000)
        libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f531e800000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f539ea63000)
        libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f531b800000)
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ldd build/lib/libtorch_global_deps.so                                                                                                                                                           (pytorch-3.10)
        linux-vdso.so.1 (0x00007ffc265df000)
        libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007fa93fc00000)
        libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007fa93de00000)
        libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007fa939800000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa940f05000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007fa939400000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa939000000)
        libgomp.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgomp.so.1 (0x00007fa93fb07000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa938c00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa940efe000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa940ef9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa940ff5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa940ef2000)
        libstdc++.so.6 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libstdc++.so.6 (0x00007fa93921d000)
        libgcc_s.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgcc_s.so.1 (0x00007fa93faec000)
        ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126328
Approved by: https://github.com/atalman
2024-05-29 04:33:56 +00:00
PyTorch MergeBot
cdbb2c9acc Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 4fdbaa794f.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))
2024-05-29 03:02:35 +00:00
cyy
4fdbaa794f [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-27 03:54:03 +00:00