pytorch/torch/csrc/distributed/c10d/cuda
Chien-Chin Huang 5b90e85112 [AsyncTP] Fixes AsyncMM (#162040)
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
2025-09-08 10:53:59 +00:00
..
cutlass/gemm/kernel [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321) 2025-06-23 02:57:50 +00:00
AsyncMM.cu [AsyncTP] Fixes AsyncMM (#162040) 2025-09-08 10:53:59 +00:00
AsyncMM.cuh
CUDAEventCache.cpp [cca] [c10d] Refactor CUDAEventCache into separate files (#158616) 2025-07-19 02:51:28 +00:00
CUDAEventCache.hpp [cca] [c10d] Refactor CUDAEventCache into separate files (#158616) 2025-07-19 02:51:28 +00:00
StreamBlock.cpp Work: block_current_stream API (#156883) 2025-07-08 23:55:46 +00:00
StreamBlock.cu [c10d] block_current_stream: correctness fixes (#158757) 2025-07-21 22:23:44 +00:00
StreamBlock.cuh [c10d] block_current_stream: correctness fixes (#158757) 2025-07-21 22:23:44 +00:00
StreamBlock.hpp Work: block_current_stream API (#156883) 2025-07-08 23:55:46 +00:00
utils.cpp [SymmMem] Find NVSHMEM from system installation (#157513) 2025-07-04 03:34:44 +00:00
utils.hpp