pytorch/caffe2
Tristan Rice 4c28a0eb0b c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-24 00:27:39 +00:00
..
core [Caffe2] Remove unused AVX512 code (#133160) 2024-08-23 23:16:16 +00:00
perfkernels [Caffe2] Remove unused AVX512 code (#133160) 2024-08-23 23:16:16 +00:00
serialize Revert "Make c10::string_view an alias of std::string_view (#130417)" 2024-07-12 00:37:04 +00:00
utils [BE] Remove suppression of inconsistent missing overrides (#131524) 2024-07-24 10:07:36 +00:00
.clang-format
CMakeLists.txt c10d/logging: add C10D_LOCK_GUARD (#134131) 2024-08-24 00:27:39 +00:00
unexported_symbols.lds
version_script.lds