Commit Graph

1398 Commits

Author SHA1 Message Date
Su, Tong
60523540f1 Force build to conform C++ standard on windows by adding /permissive- flag (#149035)
Fixes #147366

1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard.
2. Fix the error when trying to assign a string literal to a non-const ptr.

The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks),
>  By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions.
> The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option.

Thus, it is reasonable to add this flag to the existing project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:51:46 +00:00
maajidkhann
09f7f62cfe Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)
**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet
2025-03-15 00:02:38 +00:00
Natalia Gimelshein
53a1a022a9 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 21:49:46 +00:00
PyTorch MergeBot
c983e1124c Revert "[WIP] Initial implementation of Grouped Gemm API (#148531)"
This reverts commit ff29791ed8.

Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))
2025-03-11 14:40:58 +00:00
Natalia Gimelshein
ff29791ed8 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 02:41:09 +00:00
Michal Gallus
5bbca7d328 [ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097)
When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097
Approved by: https://github.com/jeffdaily
2025-03-10 22:47:15 +00:00
Michal Gallus
b706044cca [ROCm][Windows] Enable hipblaslt for Windows (#148563)
This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563
Approved by: https://github.com/jeffdaily
2025-03-10 21:07:16 +00:00
Fadi Arafeh
d1f21d8ec3 Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584)
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without  oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
2025-03-10 18:29:51 +00:00
Daniel Vega-Myhre
148eb735ee Change nvcc arch flags for sm100 (#148774)
### Summary
- Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012

### Test plan
- Verified building from source w/ B200s is successful
- Verified B200 tensorcores are still being utilized properly via benchmarking script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774
Approved by: https://github.com/Skylion007
2025-03-08 19:05:53 +00:00
cyy
f7c0c230b0 Fix compile errors (#148758)
Fix
```
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
     91 |         static_assert(sizeof(_Tp)>0,
        |                       ^~~~~~~~~~~
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
    399 |           get_deleter()(std::move(__ptr));
        |           ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
    200 | AliasDb::~AliasDb() = default;
        |          ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
    200 | AliasDb::~AliasDb() = default;
        |                       ^
  ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
    298 |   struct WriteRegistry;
        |          ^
  1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
2025-03-08 04:56:42 +00:00
Xinya Zhang
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
ZhiweiYan-96
4075646bd8 Use oneDNN v3.7.1 for Intel GPU (#148403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403
Approved by: https://github.com/EikanWang

Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-03-07 08:03:49 +00:00
PyTorch MergeBot
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec8.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
cyy
1433bc1455 Remove CAFFE2_USE_EXCEPTION_PTR (#147247)
The check is for older compilers and is now aways true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247
Approved by: https://github.com/janeyx99
2025-03-06 02:56:23 +00:00
Xinya Zhang
8af79b7ec8 [ROCm] Bump AOTriton to 0.9.1b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-05 19:11:57 +00:00
Daniel Vega-Myhre
ac99fc7e57 Updates to build rowwise scaled mm kernel on SM10.0a (#148274)
## Summary
Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch.

**NOTE**: performance optimization will be done in separate follow up PRs

## Steps to verify build
1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi`
2. Install CUDA tookit 12.8
    - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local)
3. Verify CUDA toolkit installation
    - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output
4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a`
4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source))
5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton`
6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source
7. Run tests shown in test plan below

**NOTE**: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly.

## Test plan
- `python test/distributed/tensor/test_matrix_ops.py  -k scaled_mm`: OK
- `python test/test_matmul_cuda.py -k rowwise`: OK
- `python test/test_flop_counter.py -k scaled_mm`: OK
- `python test/inductor/test_aot_inductor.py -k fp8`: OK
- `python test/inductor/test_fp8.py`: OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274
Approved by: https://github.com/drisspg
2025-03-04 05:23:41 +00:00
ZhiweiYan-96
af720cd5a7 [Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926)
# Motivation
Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team.

This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`.

# Detail

At compilation time, we will `git clone` oneDNN via  URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command.
Following is a build log example:
```bash
[11/60] Performing download step (git clone) for 'xpu_mkldnn_proj'
Cloning into 'xpu_mkldnn_proj'...
HEAD is now at 5e92240360 meta: updated citation file
[12/60] Performing update step for 'xpu_mkldnn_proj'
-- Already at requested tag: v3.7
[13/60] No patch step for 'xpu_mkldnn_proj'
```
The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj`

# Runtime verification
Running UT for CPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine
```

Runnint UT for Intel GPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2
```

We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926
Approved by: https://github.com/EikanWang

Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>
2025-02-28 07:42:06 +00:00
Wang, Eikan
2c35af4def [Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969)
XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`.
f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)

 So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends.

* __->__ #147969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969
Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman
2025-02-27 14:22:17 +00:00
Xiao Wang
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
Peter Yeh
81dccd706b [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-24 22:47:52 +00:00
PyTorch MergeBot
3e2d9d079e Revert "[ROCm] OCP FP8 Support for new GPUs (#146632)"
This reverts commit f95ab46797.

Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))
2025-02-23 12:04:50 +00:00
Peter Yeh
f95ab46797 [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 23:44:08 +00:00
Ding, Yi1
af1072ffb6 [Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608)
For preparation of OneDNN based XPU SDPA enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-02-21 16:12:30 +00:00
atalman
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
PyTorch MergeBot
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
cyy
8daa742e8b Remove code for Python < 3.9 (#147181)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181
Approved by: https://github.com/albanD
2025-02-15 06:43:26 +00:00
atalman
eecee5863e Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 21:23:19 +00:00
PyTorch MergeBot
e06ee4aa9f Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit 06f4a5c0e5.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))
2025-02-14 16:44:46 +00:00
atalman
06f4a5c0e5 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
Nikita Shulga
df5e232563 [BE] Delete NCCL slimming (#146943)
It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so

Besides,  It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-02-12 00:35:55 +00:00
Xu Han
b1ff90ae8a remove Windows XPU build workaround. (#144644)
From the RFC: https://github.com/pytorch/pytorch/issues/141946
Fixes https://github.com/pytorch/pytorch/issues/134989

After we land these fixing PRs:
1. https://github.com/pytorch/pytorch/pull/142245
2. https://github.com/pytorch/pytorch/pull/141943

We can remove the Windows XPU workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144644
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/gujinghui, https://github.com/atalman
2025-02-11 20:39:51 +00:00
Michal Gallus
3f5ed05688 [Windows][ROCm] Fix c10 hip tests (#146599)
- Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake.
- Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors.
- Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599
Approved by: https://github.com/jeffdaily
2025-02-06 23:41:25 +00:00
Ryo Suzuki
49082f9dba parallelize sort (#142391)
- use __gnu_parallel::sort for gcc compilations
- add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

The performance is measured using the following script:
```python
import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391
Approved by: https://github.com/malfet
2025-02-06 18:06:40 +00:00
Taras
6ff3383157 Enable CUPTI on Windows (#141454)
Fixes:
- https://github.com/pytorch/pytorch/issues/93855

The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events.
Additionally, the changes can be verified using the following script:

```
import torch
from torch.profiler import profile, ProfilerActivity

def check_cupti_enabled():
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available on this system.")
        return False

    # Create a simple CUDA tensor
    x = torch.randn(1000, 1000, device="cuda")
    y = torch.randn(1000, 1000, device="cuda")

    try:
        # Use PyTorch profiler to perform a basic check
        with profile(activities=[ProfilerActivity.CUDA]) as prof:
            z = x @ y  # Simple CUDA operation

        # Print profiling results
        print("CUPTI is enabled and profiling works.")
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
        return True
    except RuntimeError as e:
        # If profiling fails, CUPTI is likely not set up correctly
        print("Error: CUPTI might not be enabled or accessible.")
        print(f"Details: {e}")
        return False

if __name__ == "__main__":
    if check_cupti_enabled():
        print("CUPTI is properly configured in PyTorch.")
    else:
        print("CUPTI is not configured correctly. Check your CUDA installation.")
```

Sample output:
```
CUPTI is enabled and profiling works.
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
     sgemm_128x128x8_NN_vec         0.00%       0.000us         0.00%       0.000us       0.000us       2.086ms       100.00%       2.086ms       2.086ms             1
                   cudaFree         9.67%       9.816ms         9.67%       9.816ms       9.816ms       0.000us         0.00%       0.000us       0.000us             1
     cudaDeviceGetAttribute         0.01%      10.000us         0.01%      10.000us       0.476us       0.000us         0.00%       0.000us       0.000us            21
    cudaGetDriverEntryPoint         0.00%       1.700us         0.00%       1.700us       0.850us       0.000us         0.00%       0.000us       0.000us             2
       cudaGetSymbolAddress        85.15%      86.438ms        85.15%      86.438ms      86.438ms       0.000us         0.00%       0.000us       0.000us             1
                 cudaMalloc         0.43%     433.300us         0.43%     433.300us     144.433us       0.000us         0.00%       0.000us       0.000us             3
           cudaLaunchKernel         2.61%       2.648ms         2.61%       2.648ms       2.648ms       0.000us         0.00%       0.000us       0.000us             1
      cudaDeviceSynchronize         2.13%       2.163ms         2.13%       2.163ms       2.163ms       0.000us         0.00%       0.000us       0.000us             1
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 101.511ms
Self CUDA time total: 2.086ms

CUPTI is properly configured in PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454
Approved by: https://github.com/malfet
2025-02-06 15:58:20 +00:00
Aleksandar Samardžić
2b00d211f0 Build RowwiseScaledMM.cu for SM89 (#145676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy
2025-02-01 11:44:58 +00:00
Nikita Shulga
0d5f0a81c5 [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or by searching in `/opt/homebrew/opt/libomp` folder

Modify libomp bundling logic in setup.py to change absolute path to libomp.dylib to a relative one if necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #145871
2025-01-30 03:19:51 +00:00
PyTorch MergeBot
b80482988f Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870)"
This reverts commit c26bb9ba5b.

Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
Nikita Shulga
c26bb9ba5b [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or just searching in that folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007
2025-01-28 23:09:37 +00:00
Nikita Shulga
8d91bfd965 [BE] Include CheckFunctionExists in FindBLAS.cmake (#145849)
It's used in the script, so it must be included
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145849
Approved by: https://github.com/Skylion007
2025-01-28 19:47:05 +00:00
Xinya Zhang
c32bafeb0b [ROCm] Bump AOTriton to 0.8.2b (#145508)
We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.

Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases,  but it is considered experimental and will not be enabled right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508
Approved by: https://github.com/jeffdaily
2025-01-28 18:34:25 +00:00
Stefan-Alin Pahontu
0674ab7e33 solve apl dependency issue (#145215)
According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries.

When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls.

However, the current APL implementation uses the libomp.dll (LLVM) variant.

As a result, there are unexpected behaviors at runtime.

---

For Example:

```python
import torch

# Create a sparse tensor
# Input (Sparse Tensor):
# [[0, 1],
#  [1, 0]]
indices = torch.tensor([[0, 1], [1, 0]])
values = torch.tensor([1, 1], dtype=torch.float32)
size = torch.Size([2, 2])

sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

# Convert sparse tensor to dense tensor
dense_tensor = sparse_tensor.to_dense()

# Expected Output (Dense Tensor):
# [[0, 1],
#  [1, 0]]
print("\nDense Tensor:")
print(dense_tensor)
```

However, it prints unexpected outputs such as:

```python
# [[0, 11],
#  [10, 0]]
```

The issue arises because the following code does not function as expected at runtime:

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30

```c++
// returns 1 , however since OpenMP is enabled it should return total number of threads
int64_t num_threads = omp_get_num_threads();
```

---

In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours.

So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215
Approved by: https://github.com/ozanMSFT, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-27 13:02:16 +00:00
Johnny
732c4998f3 [NVIDIA] Full Family Blackwell Support codegen (#145436)
More references:
https://github.com/NVIDIA/nccl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436
Approved by: https://github.com/ezyang, https://github.com/drisspg
2025-01-24 04:36:00 +00:00
Nikhil Gupta
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
Johnny
a57133e3c7 [NVIDIA] Jetson Thor Blackwell Support codegen (#145395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-22 20:13:19 +00:00
albanD
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
johnnynunez
35f5668f7e [NVIDIA] RTX50 Blackwell Support codegen (#145270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270
Approved by: https://github.com/ezyang
2025-01-21 21:10:05 +00:00
Nikita Shulga
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
Jeff Daily
6ac0616504 [ROCm] hipblaslt rowwise f8 gemm (#144432)
hipblaslt added rowwise f8 gemm support.  Integrate with scaled_mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144432
Approved by: https://github.com/drisspg
2025-01-15 18:23:44 +00:00
Xu Han
bd1f5d1c32 update xnnpack for disable libm on Windows [submodule XNNPACK] (#141943)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456, https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs.
2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943
Approved by: https://github.com/atalman
2025-01-10 00:47:41 +00:00
Xinya Zhang
bc576355a2 Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443)
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.

This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.

Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-09 00:00:02 +00:00