pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Tristan Rice ba214ab56c TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj		2025-02-21 03:02:26 +00:00
..
control_collectives
control_plane	[19/N] Fix extra warnings brought by clang-tidy-17 (#144448 )	2025-01-09 15:58:05 +00:00
cuda	[AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011 )	2025-01-31 00:48:51 +00:00
quantization
Backend.cpp
Backend.hpp	[DDP] Use NCCL allocated memory for gradient bucket (#146589 )	2025-02-10 05:23:11 +00:00
Backoff.cpp
Backoff.hpp
c10d.h
comm.cpp
comm.hpp
CudaDMAConnectivity.cpp
CUDASymmetricMemory-inl.h
CUDASymmetricMemory.cu	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )"	2025-02-03 22:04:28 +00:00
CUDASymmetricMemory.hpp
CUDASymmetricMemoryOps.cu	[SymmetricMemory] introduce multimem_all_gather (#142810 )	2024-12-17 01:07:27 +00:00
debug.cpp
debug.h
default_comm_hooks.cpp
default_comm_hooks.hpp
DMAConnectivity.cpp	[19/N] Fix extra warnings brought by clang-tidy-17 (#144448 )	2025-01-09 15:58:05 +00:00
DMAConnectivity.hpp
error.h
exception.h	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )	2025-01-24 12:22:13 +00:00
FakeProcessGroup.hpp
FileStore.cpp	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )	2025-01-24 12:22:13 +00:00
FileStore.hpp
FlightRecorder.cpp	[c10d] Flush file in file recorder (#145458 )	2025-01-27 23:15:52 +00:00
FlightRecorder.hpp
Functional.cpp	Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467 )	2025-02-10 19:15:49 +00:00
Functional.hpp
GlooDeviceFactory.cpp	[Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593 )	2025-01-28 20:51:49 +00:00
GlooDeviceFactory.hpp
GroupRegistry.cpp	Remove some NOLINT (#146610 )	2025-02-07 01:50:06 +00:00
GroupRegistry.hpp
HashStore.cpp
HashStore.hpp
init.cpp	PyWork: preserve Python reference counting when used in functional collectives (#146376 )	2025-02-07 18:07:53 +00:00
intra_node_comm.cpp	[codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371 )	2025-01-09 21:49:17 +00:00
intra_node_comm.cu
intra_node_comm.hpp
logger.cpp	[4/N] Remove unnecessary once flag usage (#146783 )	2025-02-11 13:55:06 +00:00
logger.hpp	[fr][c10d] log trace capture enabled or not in flight recorder (#143865 )	2024-12-27 03:07:55 +00:00
logging.cpp
logging.h	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )	2025-01-24 12:22:13 +00:00
NanCheck.cu
NanCheck.hpp
NCCLUtils.cpp	[DDP] Use NCCL allocated memory for gradient bucket (#146589 )	2025-02-10 05:23:11 +00:00
NCCLUtils.hpp	[DDP] Use NCCL allocated memory for gradient bucket (#146589 )	2025-02-10 05:23:11 +00:00
Ops.cpp
ParamCommsUtils.cpp
ParamCommsUtils.hpp
PrefixStore.cpp
PrefixStore.hpp
ProcessGroup.cpp
ProcessGroup.hpp	Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735 )	2025-01-14 08:37:48 +00:00
ProcessGroupGloo.cpp	Use task submitter TLS in gloo working threads (#142184 )	2024-12-06 17:03:17 +00:00
ProcessGroupGloo.hpp	Use task submitter TLS in gloo working threads (#142184 )	2024-12-06 17:03:17 +00:00
ProcessGroupMPI.cpp	[2/N] Remove unnecessary once flag usage (#145057 )	2025-01-23 09:48:46 +00:00
ProcessGroupMPI.hpp	Cleanup CallOnce.h (#146700 )	2025-02-07 16:44:45 +00:00
ProcessGroupNCCL.cpp	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 )	2025-02-11 02:52:52 +00:00
ProcessGroupNCCL.hpp	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 )	2025-02-11 02:52:52 +00:00
ProcessGroupUCC.cpp	Cleanup CallOnce.h (#146700 )	2025-02-07 16:44:45 +00:00
ProcessGroupUCC.hpp	[c10d][UCC] Add `_reduce_scatter_base` to `c10d::ProcessGroupUCC` (#138021 )	2024-12-09 16:02:24 +00:00
ProcessGroupWrapper.cpp
ProcessGroupWrapper.hpp
PyProcessGroup.hpp	PyWork: preserve Python reference counting when used in functional collectives (#146376 )	2025-02-07 18:07:53 +00:00
python_comm_hook.cpp
python_comm_hook.h
RankLocal.hpp
reducer_cuda.cpp
reducer_timer.hpp
reducer.cpp	[DDP] Use NCCL allocated memory for gradient bucket (#146589 )	2025-02-10 05:23:11 +00:00
reducer.hpp	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )	2025-01-24 12:22:13 +00:00
sequence_num.cpp	[4/N] Apply bugprone-unchecked-optional-access (#142832 )	2024-12-12 04:33:32 +00:00
sequence_num.hpp
socket_fmt.h
socket.cpp	Remove unnecessary once flag usage (#143255 )	2025-01-16 02:36:11 +00:00
socket.h
Store.cpp
Store.hpp
SymmetricMemory.cpp	[SymmetricMemory] introduce multimem_all_gather (#142810 )	2024-12-17 01:07:27 +00:00
SymmetricMemory.hpp
TCPStore.cpp	TCPStore: soft fail bind when agent store active (#147465 )	2025-02-21 03:02:26 +00:00
TCPStore.hpp
TCPStoreBackend.cpp	[19/N] Fix extra warnings brought by clang-tidy-17 (#144448 )	2025-01-09 15:58:05 +00:00
TCPStoreBackend.hpp
TCPStoreLibUvBackend.cpp	TCPStore: soft fail bind when agent store active (#147465 )	2025-02-21 03:02:26 +00:00
TraceUtils.h
Types.hpp	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )	2025-01-24 12:22:13 +00:00
UCCTracing.cpp
UCCTracing.hpp
UCCUtils.cpp
UCCUtils.hpp
UnixSockUtils.hpp
Utils.cpp
Utils.hpp	[Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593 )	2025-01-28 20:51:49 +00:00
WinSockUtils.hpp
Work.cpp	Enable more readability-redundant checks (#143963 )	2024-12-30 14:49:33 +00:00
Work.hpp