pytorch/torch/csrc/distributed/c10d/cuda
Chien-Chin Huang 5b90e85112 [AsyncTP] Fixes AsyncMM (#162040)
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
2025-09-08 10:53:59 +00:00
..
cutlass/gemm/kernel
AsyncMM.cu [AsyncTP] Fixes AsyncMM (#162040) 2025-09-08 10:53:59 +00:00
AsyncMM.cuh
CUDAEventCache.cpp
CUDAEventCache.hpp
StreamBlock.cpp
StreamBlock.cu [c10d] block_current_stream: correctness fixes (#158757) 2025-07-21 22:23:44 +00:00
StreamBlock.cuh [c10d] block_current_stream: correctness fixes (#158757) 2025-07-21 22:23:44 +00:00
StreamBlock.hpp
utils.cpp
utils.hpp