pytorch/cmake/External
Xinya Zhang 67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
..
aotriton.cmake [ROCm] Bump AOTriton to 0.9.2b (#148433) 2025-03-07 22:10:07 +00:00
EigenBLAS.cmake
nccl.cmake Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073) 2025-02-19 03:52:26 +00:00
nnpack.cmake Remove legacy Caffe2 pthreadpool from CMake (#134936) 2024-10-17 05:22:08 +00:00
rccl.cmake [ROCm] LoadHIP CMake cleanup (#137112) 2024-10-13 00:06:41 +00:00
ucc.cmake