pytorch/test/cpp/c10d
Will Constable 7562b45454 Reland "[C10D] Use future for flight recorder dump (#115176)" (#115332)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

This reverts commit ac7d14baad.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
2023-12-07 21:20:58 +00:00
..
example
CMakeLists.txt Clean up CMake target linking (#109959) 2023-09-25 01:37:14 +00:00
CUDATest.cu [ROCm] Update clock intrinsic handling for AMD gfx11 family (#97005) 2023-03-24 18:29:49 +00:00
CUDATest.hpp
FileStoreTest.cpp
HashStoreTest.cpp [RESUBMIT] Standardize on error types for distributed errors. (#108191) 2023-08-30 21:47:39 +00:00
ProcessGroupGlooAsyncTest.cpp
ProcessGroupGlooTest.cpp Add Bfloat16 scalar support to gloo backend (#113557) 2023-11-17 21:16:54 +00:00
ProcessGroupMPITest.cpp
ProcessGroupNCCLErrorsTest.cpp Reland "[C10D] Use future for flight recorder dump (#115176)" (#115332) 2023-12-07 21:20:58 +00:00
ProcessGroupNCCLTest.cpp Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880) 2023-12-01 20:08:23 +00:00
ProcessGroupUCCTest.cpp
StoreTestCommon.hpp
TCPStoreTest.cpp [RESUBMIT] Standardize on error types for distributed errors. (#108191) 2023-08-30 21:47:39 +00:00
TestUtils.hpp