Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:
* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
+ However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:
* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
+ However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.
Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases, but it is considered experimental and will not be enabled right now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508
Approved by: https://github.com/jeffdaily
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.
This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.
Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)) docker images. Building aotriton from source can take ~45 minutes.
This PR fixes the issue by downloading the aotriton tarball in such scenarios, *unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603
Approved by: https://github.com/atalman
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
This PR enables using AOTriton as a shared library dependency instead of a static one.
Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094
Approved by: https://github.com/malfet
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation
Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.
This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
Mitigates #126111
AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.
This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.
Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)
Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts.
The actual changes from Triton can be found at: 9b73a543a5
Fixes the following building error
```
[2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton
cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ...
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ...
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module>
download_and_copy(
File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy
ftpstream = urllib.request.urlopen(url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 524:
ninja: build stopped: subcommand failed.
```
Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982
Approved by: https://github.com/jansel
This PR fixes the two major issues that was discovered after the initial merge of PR #121561
1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem.
2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in future release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/huydhn
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
Note about the Updates:
This PR:
1. skips more flash attention related UTs on MI200
2. Fix additional ATen compiling errors after hipification
3. Fix the author "root" of a specific commit
4. Includes the patch from Nikita in favor of block level static initialization.
CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge.
Original PR (https://github.com/pytorch/pytorch/pull/114309) Note:
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- Only supports power of two sequence lengths.
- No support for varlen APIs.
- Only support head dimension 16,32,64,128.
- Performance is still being optimized.
Fixes#112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981
Approved by: https://github.com/malfet
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.
Fixes https://github.com/pytorch/pytorch/issues/112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309
Approved by: https://github.com/jeffdaily, https://github.com/malfet
---------
Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
Should fix#13362 and fix#83790
I think I've discovered the root cause of the intermittent nccl link
failures. If we look at the variable name in the redefinition error:
```
_02021d91_11_sendrecv_cu_0bc7b9c8_11152
```
this is the name of the file being compiled + some form of unique ID.
As part of NCCL's build process, the same file is compiled multiple
times with different macro definitions depending on which operator and
dtype are being compiled, e.g.
```
nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o
```
Since the filename parts are the same, then if the unique IDs also
happen to collide then the entire identifier will collide and the link
fails. So the fix here is to generate a unique `.cu` file for each
object file. I've implemented this as a `.patch` file that gets
applied from our cmake code, but if we instead fork nccl that would be
cleaner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245
Approved by: https://github.com/janeyx99, https://github.com/malfet
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.
This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696
Approved by: https://github.com/malfet
- Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system
- Install UCX and UCC in CUDA dockers
- Build PyTorch with `USE_UCC=1` in pipelines
- Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583
Approved by: https://github.com/vtlam, https://github.com/malfet
By extending regex to match any character other than not just version
On Ubuntu version string looks as follows:
```
$ objcopy --version
GNU objcopy (GNU Binutils for Ubuntu) 2.30
```
And on some CentOSes it looks as
```
$ objcopy --version
GNU objcopy (GNU Binutils) 2.37
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82774
Approved by: https://github.com/ngimel
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.
Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.
Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:
$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded
Differential Revision: D36973688
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55814
I don't really know if the original issue is resolved but let's just
check and see if this passes CI so that we can potentially get some
speed up on our builds
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D27715734
Pulled By: seemethere
fbshipit-source-id: a8f90774dfd25b0abf8e57283fe3591a8d8f3c4b
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Solves `the '-j' option requires a positive integer argument` error on some systems when MAX_JOBS is not defined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44557
Reviewed By: vkuzo
Differential Revision: D23653511
Pulled By: malfet
fbshipit-source-id: 7d86fb7fb6c946c34afdc81bf2c3168a74d00a1f