mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68310 Enable desync root cause analysis by recording the last footprint of collective calls. When timeout we parse the store trace and figure out the root cause of the desync issue. This feature is built based on async error handling. Test Plan: Standalone test * Typical desync - P467288969 * Mismatched collectives - P467288916 * Mismatched broadcast size - P467288873 DDP benchmark * DDP benchmark desync - P467433483, P467520195 No perf regression: * w/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs * w/ this diff https://www.internalfb.com/intern/fblearner/details/308534088?tab=Outputs Reviewed By: mingzhe09088 Differential Revision: D32348647 fbshipit-source-id: 43e7e96e3fa2be0ac66c1325bceb639b461a8b3a |
||
|---|---|---|
| .. | ||
| example | ||
| CMakeLists.txt | ||
| CUDATest.cu | ||
| CUDATest.hpp | ||
| FileStoreTest.cpp | ||
| HashStoreTest.cpp | ||
| ProcessGroupGlooAsyncTest.cpp | ||
| ProcessGroupGlooTest.cpp | ||
| ProcessGroupMPITest.cpp | ||
| ProcessGroupNCCLErrorsTest.cpp | ||
| ProcessGroupNCCLTest.cpp | ||
| StoreTestCommon.hpp | ||
| TCPStoreTest.cpp | ||
| TestUtils.hpp | ||