pytorch/test/cpp/c10d
Nariaki Tateiwa 23a3cef5d9 [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162)
This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication.

### Context

As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI.

By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather.

### Testing

We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162
Approved by: https://github.com/H-Huang
2025-04-14 19:31:38 +00:00
..
example
BackoffTest.cpp [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404) 2024-10-10 18:05:34 +00:00
CMakeLists.txt [CMake] Remove pthread linking (#134436) 2024-10-29 23:14:40 +00:00
CUDATest.cu
CUDATest.hpp
FileStoreTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
HashStoreTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
ProcessGroupGlooAsyncTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
ProcessGroupGlooTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
ProcessGroupMPITest.cpp [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162) 2025-04-14 19:31:38 +00:00
ProcessGroupNCCLErrorsTest.cpp [Reland] Launch kernel on current stream & remove record_stream entirely (#150398) 2025-04-01 16:46:07 +00:00
ProcessGroupNCCLTest.cpp [c10d] Fix CudaEventCache for dangling references (#144496) 2025-01-15 05:11:48 +00:00
ProcessGroupUCCTest.cpp [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404) 2024-10-10 18:05:34 +00:00
StoreTestCommon.hpp
TCPStoreTest.cpp [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987) 2025-04-10 18:48:57 +00:00
TestUtils.hpp