pytorch/test/cpp/c10d
Hongyi Jia 146a7f68e2 Enable desync root cause analysis for NCCL (#68310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68310

Enable desync root cause analysis by recording the last footprint of collective calls. When timeout we parse the store trace and figure out the root cause of the desync issue. This feature is built based on async error handling.

Test Plan:
Standalone test
* Typical desync - P467288969
* Mismatched collectives - P467288916
* Mismatched broadcast size - P467288873

DDP benchmark
* DDP benchmark desync - P467433483, P467520195

No perf regression:
* w/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs
* w/ this diff https://www.internalfb.com/intern/fblearner/details/308534088?tab=Outputs

Reviewed By: mingzhe09088

Differential Revision: D32348647

fbshipit-source-id: 43e7e96e3fa2be0ac66c1325bceb639b461a8b3a
2021-11-17 20:29:03 -08:00
..
example
CMakeLists.txt Update CMake and use native CUDA language support (#62445) 2021-10-11 09:05:48 -07:00
CUDATest.cu
CUDATest.hpp
FileStoreTest.cpp
HashStoreTest.cpp
ProcessGroupGlooAsyncTest.cpp
ProcessGroupGlooTest.cpp use irange for loops 5 (#66744) 2021-10-18 21:59:50 -07:00
ProcessGroupMPITest.cpp
ProcessGroupNCCLErrorsTest.cpp Enable desync root cause analysis for NCCL (#68310) 2021-11-17 20:29:03 -08:00
ProcessGroupNCCLTest.cpp [PG NCCL] Disable NCCL health check (#67668) 2021-11-02 16:21:59 -07:00
StoreTestCommon.hpp
TCPStoreTest.cpp
TestUtils.hpp