pytorch/torch/csrc/distributed/c10d
Tristan Rice ba214ab56c TCPStore: soft fail bind when agent store active (#147465)
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.

This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.

This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.

https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
2025-02-21 03:02:26 +00:00
..
control_collectives
control_plane [19/N] Fix extra warnings brought by clang-tidy-17 (#144448) 2025-01-09 15:58:05 +00:00
cuda [AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011) 2025-01-31 00:48:51 +00:00
quantization
Backend.cpp
Backend.hpp [DDP] Use NCCL allocated memory for gradient bucket (#146589) 2025-02-10 05:23:11 +00:00
Backoff.cpp
Backoff.hpp
c10d.h
comm.cpp
comm.hpp
CudaDMAConnectivity.cpp
CUDASymmetricMemory-inl.h
CUDASymmetricMemory.cu Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)" 2025-02-03 22:04:28 +00:00
CUDASymmetricMemory.hpp
CUDASymmetricMemoryOps.cu [SymmetricMemory] introduce multimem_all_gather (#142810) 2024-12-17 01:07:27 +00:00
debug.cpp
debug.h
default_comm_hooks.cpp
default_comm_hooks.hpp
DMAConnectivity.cpp [19/N] Fix extra warnings brought by clang-tidy-17 (#144448) 2025-01-09 15:58:05 +00:00
DMAConnectivity.hpp
error.h
exception.h Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806) 2025-01-24 12:22:13 +00:00
FakeProcessGroup.hpp
FileStore.cpp Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806) 2025-01-24 12:22:13 +00:00
FileStore.hpp
FlightRecorder.cpp [c10d] Flush file in file recorder (#145458) 2025-01-27 23:15:52 +00:00
FlightRecorder.hpp
Functional.cpp Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467) 2025-02-10 19:15:49 +00:00
Functional.hpp
GlooDeviceFactory.cpp [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593) 2025-01-28 20:51:49 +00:00
GlooDeviceFactory.hpp
GroupRegistry.cpp Remove some NOLINT (#146610) 2025-02-07 01:50:06 +00:00
GroupRegistry.hpp
HashStore.cpp
HashStore.hpp
init.cpp PyWork: preserve Python reference counting when used in functional collectives (#146376) 2025-02-07 18:07:53 +00:00
intra_node_comm.cpp [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371) 2025-01-09 21:49:17 +00:00
intra_node_comm.cu
intra_node_comm.hpp
logger.cpp [4/N] Remove unnecessary once flag usage (#146783) 2025-02-11 13:55:06 +00:00
logger.hpp [fr][c10d] log trace capture enabled or not in flight recorder (#143865) 2024-12-27 03:07:55 +00:00
logging.cpp
logging.h Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806) 2025-01-24 12:22:13 +00:00
NanCheck.cu
NanCheck.hpp
NCCLUtils.cpp [DDP] Use NCCL allocated memory for gradient bucket (#146589) 2025-02-10 05:23:11 +00:00
NCCLUtils.hpp [DDP] Use NCCL allocated memory for gradient bucket (#146589) 2025-02-10 05:23:11 +00:00
Ops.cpp
ParamCommsUtils.cpp
ParamCommsUtils.hpp
PrefixStore.cpp
PrefixStore.hpp
ProcessGroup.cpp
ProcessGroup.hpp Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735) 2025-01-14 08:37:48 +00:00
ProcessGroupGloo.cpp Use task submitter TLS in gloo working threads (#142184) 2024-12-06 17:03:17 +00:00
ProcessGroupGloo.hpp Use task submitter TLS in gloo working threads (#142184) 2024-12-06 17:03:17 +00:00
ProcessGroupMPI.cpp [2/N] Remove unnecessary once flag usage (#145057) 2025-01-23 09:48:46 +00:00
ProcessGroupMPI.hpp Cleanup CallOnce.h (#146700) 2025-02-07 16:44:45 +00:00
ProcessGroupNCCL.cpp [PGNCCL] Associate tensor allocation support with NCCL version (#146842) 2025-02-11 02:52:52 +00:00
ProcessGroupNCCL.hpp [PGNCCL] Associate tensor allocation support with NCCL version (#146842) 2025-02-11 02:52:52 +00:00
ProcessGroupUCC.cpp Cleanup CallOnce.h (#146700) 2025-02-07 16:44:45 +00:00
ProcessGroupUCC.hpp [c10d][UCC] Add _reduce_scatter_base to c10d::ProcessGroupUCC (#138021) 2024-12-09 16:02:24 +00:00
ProcessGroupWrapper.cpp
ProcessGroupWrapper.hpp
PyProcessGroup.hpp PyWork: preserve Python reference counting when used in functional collectives (#146376) 2025-02-07 18:07:53 +00:00
python_comm_hook.cpp
python_comm_hook.h
RankLocal.hpp
reducer_cuda.cpp
reducer_timer.hpp
reducer.cpp [DDP] Use NCCL allocated memory for gradient bucket (#146589) 2025-02-10 05:23:11 +00:00
reducer.hpp Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806) 2025-01-24 12:22:13 +00:00
sequence_num.cpp [4/N] Apply bugprone-unchecked-optional-access (#142832) 2024-12-12 04:33:32 +00:00
sequence_num.hpp
socket_fmt.h
socket.cpp Remove unnecessary once flag usage (#143255) 2025-01-16 02:36:11 +00:00
socket.h
Store.cpp
Store.hpp
SymmetricMemory.cpp [SymmetricMemory] introduce multimem_all_gather (#142810) 2024-12-17 01:07:27 +00:00
SymmetricMemory.hpp
TCPStore.cpp TCPStore: soft fail bind when agent store active (#147465) 2025-02-21 03:02:26 +00:00
TCPStore.hpp
TCPStoreBackend.cpp [19/N] Fix extra warnings brought by clang-tidy-17 (#144448) 2025-01-09 15:58:05 +00:00
TCPStoreBackend.hpp
TCPStoreLibUvBackend.cpp TCPStore: soft fail bind when agent store active (#147465) 2025-02-21 03:02:26 +00:00
TraceUtils.h
Types.hpp Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806) 2025-01-24 12:22:13 +00:00
UCCTracing.cpp
UCCTracing.hpp
UCCUtils.cpp
UCCUtils.hpp
UnixSockUtils.hpp
Utils.cpp
Utils.hpp [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593) 2025-01-28 20:51:49 +00:00
WinSockUtils.hpp
Work.cpp Enable more readability-redundant checks (#143963) 2024-12-30 14:49:33 +00:00
Work.hpp