pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	2699f5410b	Revert "[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454 )" This reverts commit `fd68d409ad`. Reverted https://github.com/pytorch/pytorch/pull/162454 on behalf of https://github.com/atalman due to internal build failure ([comment](https://github.com/pytorch/pytorch/pull/162454#issuecomment-3475009089))	2025-10-31 21:58:52 +00:00
Jeff Daily	c3b71d5499	[ROCm][CI] remove relaxed tolerance for tf32 tests (#166478 ) Instead of relaxing tolerances for certain unit tests that exercise TF32 on MI300, skip the tests until hipblaslt accuracy is improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166478 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>	2025-10-31 16:15:42 +00:00
fengqing.lu	fd68d409ad	[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454 ) This is the second PR split from https://github.com/pytorch/pytorch/pull/156272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162454 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-10-31 11:20:38 +00:00
Yuanyuan Chen	0d50e5d8d4	[3/N] Fix unused loop variables (#166509 ) This PR removes unused loop variables in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166509 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-10-30 20:13:51 +00:00
linhaifeng	695cb0d342	[2/N][Fix] Fix typo in test folder (#166374 ) Fix typo in test folder. _typos.toml ```bash [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166374 Approved by: https://github.com/cyyever, https://github.com/ezyang	2025-10-29 03:02:07 +00:00
Angel Li	08ae55021e	support batch size=0 for flash attention (#166318 ) Fixes #165944 Summary Today, if we attempt to run flash attention with batch_size 0, we get error `Runtime Error: batch size must be positive`. This PR fixes this by returning early with empty tensors in the fwd and bwd. Test plan `python test/test_transformers.py -k test_scaled_dot_product_attention` - added case for batch_size=0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166318 Approved by: https://github.com/drisspg	2025-10-28 21:53:48 +00:00
Deng, Daisy	81fa4a204c	Enable Intel GPU on 4 unit test cases (#165405 ) For https://github.com/pytorch/pytorch/issues/114850, we will port some aten unit tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests 2. Added allow_xpu=True for supported test class in test parameterization. 3. Use torch.accelerator to extend cude specific test to XPU if needed. 4. Enabled 'xpu' for some test pathes Pull Request resolved: https://github.com/pytorch/pytorch/pull/165405 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-10-27 06:06:07 +00:00
Nikita Shulga	5211f4c108	[MPS] Fix SDPA fp16 overflow (#165961 ) Do not cast intermediate result back to lower precision data data until softmax is finished, otherwise it might produce NaN Adjust the test to use 256 as filler value rather than 64 Fixes https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165961 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #165960	2025-10-22 01:29:42 +00:00
Nikita Shulga	0bd12c1168	[CI] Extend test_transfomers to MPS (#165960 ) Just skip grad_checks as they need float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165960 Approved by: https://github.com/Skylion007	2025-10-21 19:27:44 +00:00
Eddie Yan	cd62a73dcb	[cuDNN][SDPA] Handle noncontig nested tensors in cuDNN SDPA (#164958 ) Previously we hardcoded the assumption in cuDNN that the inputs would be dense which breaks when e.g., the user is chunking tensors yielding noncontig inputs New test added to check this when `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` is set in `test/test_transformers.py` One issue I noticed was that the old gating of nested tensor in `sdp_utils.cpp` seems to be a no-op? All of the inputs are reported as "dense" by the time that function is called in the nested tensor tests in `test/test_nestedtensor.py -k sdpa` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164958 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-10-09 21:58:54 +00:00
Eddie Yan	ed3085814a	[cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903 ) cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman	2025-09-26 23:50:09 +00:00
Xinya Zhang	3cbfbbd691	[ROCm] Transformer/SDPA unit test parity (#163745 ) ## Major Changes * Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes. - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required. * Fix `atomic_counter` handling in varlen FA API * Unskips a few unit tests. Fixes #157120 Fixes #157121 Fixes #157122 Fixes #157167 Fixes #155217 Fixes #157043 Fixes #157060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163745 Approved by: https://github.com/jeffdaily	2025-09-25 17:14:19 +00:00
Xinya Zhang	e769026bcb	[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 ) A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes #157094 Fixes #157093 Fixes #157092 Fixes #157091 Fixes #157064 Fixes #157063 Fixes #157062 Fixes #157061 Fixes #157042 Fixes #157041 Fixes #157039 Fixes #157004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-18 13:53:48 +00:00
Eddie Yan	9b7a8c4d05	[cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104 ) For https://github.com/pytorch/torchtitan/issues/1713 Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time `1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104 Approved by: https://github.com/drisspg	2025-09-17 15:48:54 +00:00
PyTorch MergeBot	66308fb470	Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 )" This reverts commit `cef815dc2c`. Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))	2025-09-16 20:39:41 +00:00
Xinya Zhang	cef815dc2c	[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998 ) A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-16 12:48:45 +00:00
Nikita Shulga	015423bef8	Add fp16-overflow regression test (#162401 ) Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-08 20:33:23 +00:00
eqy	6f7608d603	[cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073 ) for 2.9 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162073 Approved by: https://github.com/drisspg	2025-09-04 18:46:28 +00:00
Xinya Zhang	98efc9e93d	[ROCm] Bump AOTriton to 0.11b (#161754 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR #156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-09-03 20:45:44 +00:00
fengqing.lu	50fc22dedf	[Intel GPU] Fix XPU SDPA default priority_order UT fail (#161690 ) Fixes #161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by `eec876deb6/aten/src/ATen/native/mkldnn/xpu/Attention.cpp (L98-L112)` However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161690 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-09-03 04:43:27 +00:00
Dmitry Nikolaev	b76f6d117a	[ROCm] fix numpy version detection and adjust fudge_factors for MI355 (#161429 ) This PR fixes: - Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions) ``` test_quantization.py::TestDynamicQuantizedOps::test_qlinear test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy test_quantization.py::TestQuantizedLinear::test_qlinear test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu test_quantization.py::TestQuantizedLinear::test_qlinear_relu test_quantization.py::TestQuantizedLinear::test_qlinear_tanh test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32 ``` - A couple of SDPA tests on MI355 by adjusting fudge_factors: ``` test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32 test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429 Approved by: https://github.com/jeffdaily	2025-08-28 19:32:09 +00:00
PyTorch MergeBot	9f6e1b8730	Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )" This reverts commit `3caddd4daa`. Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))	2025-08-26 20:03:59 +00:00
Eddie Yan	2835cc5e91	[cuDNN] head dim > 128 works on H100 again in cuDNN SDPA? (#161210 ) reference: https://github.com/pytorch/torchtitan/pull/1610 9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161210 Approved by: https://github.com/Skylion007	2025-08-22 21:21:53 +00:00
Andy Lugo	3caddd4daa	[ROCm] SDPA fix mem fault when dropout is enabled (#154864 ) Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-21 14:23:13 +00:00
fengqing.lu	db763b1717	[Intel GPU] Support SDPA backend selection and priority setting on XPU (#159464 ) Currentlly SPDA XPU use own `priority_order` instead of the one from global context. Hence it does not support `with sdpa_kernel(order, set_priority=True)` with set_priority=True. This PR enables this feature. To make default `priority_order` from global context works for XPU, I also move MATH backend to lowest priority, otherwise `cudnn attention` and `overrideable attention` will never be selected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159464 Approved by: https://github.com/guangyey, https://github.com/drisspg Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: mayuyuace <qiming1.zhang@intel.com>	2025-08-14 08:55:31 +00:00
Eddie Yan	1128f4c2a8	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-08-08 22:22:48 +00:00
CaoE	ef38edb284	Add stride check for attn_mask on non-cpu device (#158424 ) Fixes #158374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158424 Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman	2025-07-18 01:10:58 +00:00
PyTorch MergeBot	bfe5674e22	Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 )" This reverts commit `0797b2b6a8`. Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))	2025-07-17 16:55:55 +00:00
Valentine233	1f57e0e04d	[CPU] Support GQA for flash attention (#157893 ) As many models require GQA, we support it in flash attention for CPU path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157893 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-13 09:49:02 +00:00
Eddie Yan	0797b2b6a8	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-07-11 16:07:54 +00:00
Meng, Chunhuan	ba0d0de5e6	Enable set SDPA backend by torch.nn.attention.sdpa_kernel on XPU (#156669 ) Introduces support for a new `OVERRIDEABLE` backend in the SDPA module, improves backend selection logic, and adds corresponding tests. In addition, a fallback mechanism was added when a specific backend is unavailable, enhancing user configurability. ### Backend Support and Selection Enhancements: * Added `at::SDPBackend::overrideable` to the list of available SDPA backends in the `Context` class (`aten/src/ATen/Context.h`). * Updated the backend selection logic in `select_sdp_backend_xpu` to include the `OVERRIDEABLE` backend and added a fallback mechanism for unsupported `FLASH_ATTENTION` on XPU. * Adjusted error messaging in `_fused_sdp_choice_xpu` to reflect the inclusion of the `OVERRIDEABLE` backend. (`aten/src/ATen/native/mkldnn/xpu/Attention.cpp`) ### Test Additions for Backend Fallback and Selection: * Added new unit tests to validate fallback behavior for `FLASH_ATTENTION` to `OVERRIDEABLE` and to verify correct backend selection when `MATH` is enabled. (`test/test_transformers.py`,) ### Codebase Updates for Backend Integration: * Introduced `OVERRIDEABLE` as a new member of the `_SDPBackend` enum. (`torch/_C/__init__.pyi.in`) * Extended `_backend_names` and updated related methods to handle the `OVERRIDEABLE` backend. (`torch/nn/attention/__init__.py`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156669 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-07-10 06:52:22 +00:00
Xuehai Pan	fc0376e8b1	[BE][2/6] fix typos in test/ (test/test_*.py) (#157636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636 Approved by: https://github.com/yewentao256, https://github.com/mlazos ghstack dependencies: #156311, #156609	2025-07-09 11:02:23 +00:00
Valentine233	f56bfb3030	[CPU] Fix memory access for sbgemm bf16 (#156585 ) Fixes #156022. 1. The original dtype conversion overwrites the whole `n_ldc_` instead of `n_m_` with stride `ldc_`, causing the potential memory issue. 2. Fix the None value issue in attention backward UT, as the sbgemm bf16 could be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156585 Approved by: https://github.com/mingfeima, https://github.com/aditew01, https://github.com/ezyang	2025-07-08 02:36:28 +00:00
Abhishek Nandy	b221be9140	Fix typo: 'intial_query_grad' → 'initial_query_grad' in test_transformers.py (#157306 ) This is a minor typo fix in `test/test_transformers.py`: - Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming. There are no functional or logic changes — this PR is aimed purely at improving readability and maintaining code quality. Thanks to the PyTorch team for their work and review time Please feel free to suggest if this needs any adjustment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306 Approved by: https://github.com/Skylion007	2025-07-03 14:08:12 +00:00
Xinya Zhang	d9577df312	[ROCm] Bump AOTriton to 0.10b (#156499 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2025-06-25 07:09:03 +00:00
fengqing.lu	04178d347c	[Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340 ) Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903). Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query. This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code. The function `alloc_with_matching_layout` is copied from cudnn `8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-06-24 06:09:59 +00:00
Isalia20	9e132b770e	[CUDA] Skip test on low vram machines (#156548 ) I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines. CC: @eqy @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548 Approved by: https://github.com/eqy, https://github.com/malfet	2025-06-21 22:32:57 +00:00
Isalia20	1cfdcb975a	[CUDA] fix illegal memory access in attention (#155397 ) Fixes https://github.com/pytorch/pytorch/issues/150054 CI seemed to be messed up in the old one, old PR: https://github.com/pytorch/pytorch/pull/155145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397 Approved by: https://github.com/ngimel	2025-06-21 12:32:00 +00:00
PyTorch MergeBot	1036f6d114	Revert "[ROCm] Bump AOTriton to 0.10b (#156290 )" This reverts commit `34d8e64ef6`. Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))	2025-06-20 15:35:25 +00:00
Xinya Zhang	34d8e64ef6	[ROCm] Bump AOTriton to 0.10b (#156290 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-06-19 21:13:58 +00:00
LuFengqing	092aed1b18	[Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992 ) In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v. In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR. SDPA UTs pass in local test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992 Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-17 11:09:51 +00:00
fengqing.lu@intel.com	65b9c13cce	[Intel GPU] Enable safe softmax for XPU SDPA (#151999 ) Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975 When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption. With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-13 08:53:47 +00:00
PyTorch MergeBot	d3c8f36ba0	Revert "[Intel GPU] Make SDPA output has the same stride as Query. (#154340 )" This reverts commit `0f10df71a6`. Reverted https://github.com/pytorch/pytorch/pull/154340 on behalf of https://github.com/etaf due to This PR breaks hugging face E2E run on XPU. ([comment](https://github.com/pytorch/pytorch/pull/154340#issuecomment-2942954192))	2025-06-05 06:46:24 +00:00
fengqing.lu@intel.com	0f10df71a6	[Intel GPU] Make SDPA output has the same stride as Query. (#154340 ) Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903). Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query. This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code. The function `alloc_with_matching_layout` is copied from cudnn `8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340 Approved by: https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/guangyey	2025-06-04 07:16:56 +00:00
fengqing.lu@intel.com	7b074346e0	[Intel GPU] Support f32 intermediate dtype, headdim size <=576 and f32 causal mask for SDPA (#152091 ) In OneDNN v3.7, SDPA has below defects: 1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy. 2. Only support headdim size <= 256. 3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops. In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg	2025-06-04 05:18:36 +00:00
Isalia20	d6edefefbf	[CUDA] Fixes for backwards in memefficient attn for large tensors (#154663 ) followup to #154029. @ngimel Backwards had the same problem as well so this PR fixes it and adds support for logsumexp computation in the forward pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154663 Approved by: https://github.com/ngimel	2025-05-30 19:30:07 +00:00
Isalia20	e313152a33	SDPA fix memory efficient attention for large batch dim (#154029 ) Fixes #146704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154029 Approved by: https://github.com/ngimel	2025-05-28 16:53:53 +00:00
PyTorch MergeBot	f363a3f51a	Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 )" This reverts commit `9386701b51`. Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))	2025-05-14 20:53:49 +00:00
fengqing.lu	de92296bbb	[Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976 ) Fix https://github.com/pytorch/pytorch/issues/152290. The model hubert uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN. This PR handles this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-05-14 16:09:03 +00:00
eqy	9386701b51	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg	2025-05-14 01:39:24 +00:00

1 2 3 4 5 ...

300 Commits