- Only search for magma if it is used (GPU builds)
- Don't report it was not found when it isn't searched for
- Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117858
Approved by: https://github.com/malfet
Fixes#125272
## About
(This is a re-spin of PR #106617)
Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616). Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link.
The change includes-
* Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY`
* WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed
See https://github.com/pytorch/kineto/pull/724
## Example usage of CUPTI profiler
The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics
```
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=trace_handler,
experimental_config=torch.profiler._ExperimentalConfig(
profiler_metrics=[
"kineto__tensor_core_insts",
"dram__bytes_read.sum",
"dram__bytes_write.sum"],
profiler_measure_per_kernel=False),
) as prof:
res = train_batch(modeldef)
prof.step()
```
For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist.
### Details of how to configure CUPTI profielr
The` _Experimental` config structure can be used to pass metrics to profiler
```
profiler_metrics : a list of CUPTI profiler metrics used
to measure GPU performance events. Any metric supported by CUPTI can be used, see here=
https://docs.nvidia.com/cupti/r_main.html#r_profiler
There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting.
profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
or for the entire measurement duration.
```
## Testing
Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724)
```
$> USE_CUDA=1 python setup.py install
-- CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so
-- CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so
```
Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc.
```
>$ export KINETO_LOG_LEVEL=1
>$ python xor.py
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040
Log file: /tmp/libkineto_activities_1683060.json
Trace start time: 2023-02-11 19:11:47 Trace duration: 500ms
Warmup duration: 0s
Max GPU buffer size: 128MB
Enabled activities: cuda_profiler_range
Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum
Cupti Profiler measure per kernel : 0
Cupti Profiler max ranges : 10
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] sm__inst_executed_pipe_tensor.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_read.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_write.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s
STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685
Approved by: https://github.com/sraikund16
Fixes two build problems on ROCM 6.1 + Ubuntu 22.04
### Inconsistency value of CMAKE_PREFIX_PATH between `.ci/pytorch/build.sh` and Build Instructions
Current `CMAKE_PREFIX_PATH` points to the base environment of the conda (commonly `/opt/conda`). However the conda environment used in the CI should be `/opt/conda/envs/py_<VRESION>`, which is supplied by `$CONDA_PREFIX`.
This divergence may cause libstdc++ version conflicts because the base conda environment may ship a different libstdc++ than the `pv_<VERSION>`, and/or the system default environment. One notable issue is on our internal CI system this script failed to build AOTriton library on Ubuntu 22.04 due to libstdc++ version conflicts between HIP compiler and conda base environment.
This PR fixes this and make sure the CI script follows the official build instruction.
### Incorrect `tinfo` was linked on Ubuntu 22.04 due to flaws in parsing of `os-release`
The code to parse /etc/os-release is incorrect and the distribution info was parsed as `PRETTY_Ubuntu` instead of `Ubuntu`. `libtinfo` will not be linked into the binary due to this flaw. Thus, cpp unit tests failed to build because of missing symbols from `libtinfo`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118216
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet, https://github.com/atalman
Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision.
Test Plan: Regression should cover this change
Reviewed By: aaronenyeshi
Differential Revision: D56444570
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734
Approved by: https://github.com/aaronenyeshi
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55925068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55087993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in future release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/huydhn
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be
not working correctly, e.g.
```
$ cmake --version
cmake version 3.26.4
$ cat CMakeList.txt
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(FOO)
if(NOT ENV{ROCM_SOURCE_DIR})
message(INFO ": not defined 1")
else()
message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}")
endif()
if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "")
message(INFO ": not defined 2")
else()
message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}")
endif()
$ ROCM_SOURCE_DIR=/tmp cmake .
INFO: not defined 1
INFO: defined 2: /tmp
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yangche/tmp/tmp
```
This PR replace it with a STREQUAL check. Note that the choice
of STREQUAL is to avoid cases like:
```
$ ROCM_SOURCE_DIR= cmake .
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858
Approved by: https://github.com/jianyuh, https://github.com/jeffdaily
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides.
See the README.md for additional details.
TunableOp was ported from onnxruntime starting from commit 08dce54266. The content was significantly modified and reorganized for use within PyTorch. The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following:
- onnxruntime/core/framework/tunable.h -> Tunable.h
- onnxruntime/core/framework/tuning_context.h -> Tunable.h
- onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h
- onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h
- onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h
- onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h
- onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h
- onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.
# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
- `c10::xpu::device_count`
- `c10::xpu::set_device`
- ...
# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
Note about the Updates:
This PR:
1. skips more flash attention related UTs on MI200
2. Fix additional ATen compiling errors after hipification
3. Fix the author "root" of a specific commit
4. Includes the patch from Nikita in favor of block level static initialization.
CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge.
Original PR (https://github.com/pytorch/pytorch/pull/114309) Note:
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- Only supports power of two sequence lengths.
- No support for varlen APIs.
- Only support head dimension 16,32,64,128.
- Performance is still being optimized.
Fixes#112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981
Approved by: https://github.com/malfet
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.
# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
- `c10::xpu::device_count`
- `c10::xpu::set_device`
- ...
# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***
This change is everything remaining requiring XNNPACK version to work.
@allow-large-files
Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/)
---
submodule
(unblock merge to make ShipIt happy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714
Approved by: https://github.com/digantdesai
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.
Fixes https://github.com/pytorch/pytorch/issues/112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309
Approved by: https://github.com/jeffdaily, https://github.com/malfet
---------
Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
Related to #103973#110532#108404#94891
**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.
Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)
**Changes:**
This pull request serves the following purposes:
* Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.
Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++
**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled:
```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.
```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)
```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```
Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```
(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```
Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```
Run the assert test, and expected return code not equal to 0.
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
>>> r
-6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - 6/n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***
This change is everything remaining requiring XNNPACK version to work.
Differential Revision: [D52044420](https://our.internmc.facebook.com/intern/diff/D52044420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115587
Approved by: https://github.com/digantdesai
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```
Fixes https://github.com/pytorch/pytorch/issues/101839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```
Fixes https://github.com/pytorch/pytorch/issues/101839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
Previously a crash in PyTorch on power systems was fixed with #110708.
Even with the fix, the torch_test.py test throws the following error
for one of the tests.
"Error in cpuinfo: processor architecture is not supported in cpuinfo"
This is a follow up patch to fix this error.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112707
Approved by: https://github.com/albanD
- rename `__HIP_PLATFORM_HCC__` to `__HIP_PLATFORM_AMD__`
- rename `HIP_HCC_FLAGS` to `HIP_CLANG_FLAGS`
- rename `PYTORCH_HIP_HCC_LIBRARIES` to `PYTORCH_HIP_LIBRARIES`
- workaround in tools/amd_build/build_amd.py until submodules are updated
These symbols have had a long deprecation cycle and will finally be removed in ROCm 6.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111975
Approved by: https://github.com/ezyang, https://github.com/hongxiayang
Summary:
This stack of PR's integrates cuSPARSELt into PyTorch.
This PR adds support for cuSPARSELt into the build process.
It adds in a new flag, USE_CUSPARSELT that defaults to false.
When USE_CUSPASRELT=1 is specified, the user can also specify
CUSPASRELT_ROOT, which defines the path to the library.
Compiling pytorch with cusparselt support can be done as follows:
``
USE_CUSPARSELT=1
CUSPARSELT_ROOT=/path/to/cusparselt
python setup.py develop
```
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700
Approved by: https://github.com/albanD
- BatchLinearAlgebraLib.cpp is now split into one additional file
- BatchLinearAlgebraLib.cpp uses only cusolver APIs
- BatchLinearAlgebraLibBlas.cpp uses only cublas APIs
- hipify operates at the file level and cannot mix cusolver and cublas APIs within the same file
- cmake changes to link against hipblas instead of rocblas
- hipify mappings changes to map cublas -> hipblas instead of rocblas
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105881
Approved by: https://github.com/albanD
Removes the outdated HIP flags appended to HIP_CXX_FLAGS
The will help remove the following warnings in the pytorch build log
```
[6238/6889] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/cudnn/hip/Conv_v8.cpp.o
cc1plus: warning: command line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
cc1plus: warning: unrecognized command line option ‘-Wno-unused-command-line-argument’
cc1plus: warning: unrecognized command line option ‘-Wno-exceptions’
cc1plus: warning: unrecognized command line option ‘-Wno-inconsistent-missing-override’
cc1plus: warning: unrecognized command line option ‘-Wno-macro-redefined’
```
This also updates the gloo submodule commit to include the similar change made to gloo.
597accfd79
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102271
Approved by: https://github.com/malfet
Enables the hipSolver backend for ROCm builds
--------------------------------------------------------------------------
- Minimum ROCm version requirement - 5.3
- Introduces new macro USE_LINALG_SOLVER the controls enablement of both cuSOLVER and hipSOLVER
- Adds hipSOLVER API to hipification process
- combines hipSOLVER and hipSPARSE mappings into single SPECIAL map that takes priority among normal mappings
- Torch api to be moved to hipsolver backend (as opposed to magma) include: torch.svd(), torch.geqrf(), torch.orgqr(), torch.ormqr()
- Will enable 100+ linalg unit tests for ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97370
Approved by: https://github.com/malfet
We have noticed that on BERT_pytorch in torchbenchmark majority of time is spent in running GEMM in aten:addmm. At the moment this calls into BLAS routine, but on AArch64 it will be faster if it calls into mkldnn_matmul. Performance wise compared to build with OpenBLAS it runs faster 1.2x faster on 16 cores with batch size of 8 on Graviton3, while if fast math mode (mkldnn_matmul exposes through oneDNN and Arm Compute Library option to run GEMM with FP32 inputs using BBF16 operations) is enabled then it is 2.3x
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91763
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/malfet
This is reland of PR #94402 that tries to solve the additional link issues.
The PR #94402 failed because caffe2::mkl had been converted to private dependency while libtorch_cuda_linalg hadn't linked to it explicitly. This is fixed in commit 4373bf0ae3dee32afc178f9d51a4154d6c5904c6
We also replace more references of MKL_LIBRARIES by caffe2::mkl in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94924
Approved by: https://github.com/malfet
If python development library is missing when building pytorch from source, cmake will raise the error like:
```
CMake Error at cmake/Dependencies.cmake:1079 (if):
if given arguments:
"VERSION_LESS" "3"
Unknown arguments specified
```
it's quite a misleading information that user would consider it's a syntax error or cmake version problem.
This PR add a check to ensure `PYTHONLIBS_VERSION_STRING` exist before using.
Related #87993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96642
Approved by: https://github.com/kit1980
This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
With the release of ROCm 5.3 hip now supports a hipGraph implementation.
All necessary backend work and hipification is done to support the same functionality as cudaGraph.
Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
We greatly simplify the handing of OpenMP in CMake by using caffe2::openmp target thoroughly. We follow the old behavior by defaulting to MKL OMP library and detecting OMP flags otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91576
Approved by: https://github.com/malfet
The main changes are:
1. Remove outdated checks for old compiler versions because they can't support C++17.
2. Remove outdated CMake checks because it now requires 3.18.
3. Remove outdated CUDA checks because we are moving to CUDA 11.
Almost all changes are in CMake files for easy audition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90599
Approved by: https://github.com/soumith
Continuation of https://github.com/pytorch/pytorch/pull/88207
A compile time guard was preventing ActivityType::CUDA from being available on rocm. This caused both the GPU_FALLBACK and CUDA modes to be active at the same time. So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times. This caused incorrect (and often negative) cuda times, in e.g. table().
Previously a cmake variable was not being propagated to a '-D', causing an issue on Windows, which uses cuda but not cupti.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89785
Approved by: https://github.com/jeffdaily, https://github.com/malfet
We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
Approved by: https://github.com/dreiss
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on
Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`
Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere