This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.
This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.
This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.
Test plan:
existing CI for regressions
will add unit tests on `C10D_LOCK_GUARD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73167
This PR adds `C10D_DEBUG` and `C10D_TRACE` macros to enable fine grained logging in c10d. It also updates some log statements of `socket` to make its output less noisy.
ghstack-source-id: 149778567
Test Plan: Manual testing with different socket conditions.
Reviewed By: rohan-varma
Differential Revision: D34371426
fbshipit-source-id: a852b05ec353b18b0540ce5f803666c3da21ddd7
(cherry picked from commit 4519b06ac57f177dfc086bc10e8e1a746ba0870d)