pytorch/test/cpp/c10d
fduwjj 5b4c864672 [c10d] Enable CudaEventCache by default and add multi device support (#140975)
We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.

Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-11-26 18:42:45 +00:00
..
example Refactor distribuetd to use absolute header path (#85780) 2022-09-30 05:13:50 +00:00
BackoffTest.cpp [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404) 2024-10-10 18:05:34 +00:00
CMakeLists.txt [CMake] Remove pthread linking (#134436) 2024-10-29 23:14:40 +00:00
CUDATest.cu [ROCm] Update clock intrinsic handling for AMD gfx11 family (#97005) 2023-03-24 18:29:49 +00:00
CUDATest.hpp
FileStoreTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
HashStoreTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
ProcessGroupGlooAsyncTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
ProcessGroupGlooTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
ProcessGroupMPITest.cpp [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404) 2024-10-10 18:05:34 +00:00
ProcessGroupNCCLErrorsTest.cpp [PGNCCL] Slimming watchdog loop (#139834) 2024-11-07 17:22:44 +00:00
ProcessGroupNCCLTest.cpp [c10d] Enable CudaEventCache by default and add multi device support (#140975) 2024-11-26 18:42:45 +00:00
ProcessGroupUCCTest.cpp [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404) 2024-10-10 18:05:34 +00:00
StoreTestCommon.hpp Refactor distribuetd to use absolute header path (#85780) 2022-09-30 05:13:50 +00:00
TCPStoreTest.cpp C10_UNUSED to [[maybe_unused]] (#6357) (#138364) 2024-10-19 13:17:43 +00:00
TestUtils.hpp