pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Kurman Karabukaev d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k		2024-05-22 18:24:11 +00:00
..
agent_diagram.jpg
agent.rst	Adding health check server hook in torch elastic (#122750 ) (#123504 )	2024-04-11 19:10:56 +00:00
customization.rst
errors.rst
etcd_rdzv_diagram.png
events.rst
examples.rst
kubernetes.rst
metrics.rst
multiprocessing.rst
quickstart.rst
rendezvous.rst	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 )	2024-05-22 18:24:11 +00:00
run.rst
subprocess_handler.rst	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 )	2024-03-08 01:37:34 +00:00
timer.rst	[Torch][Timer] Adding debug info logging interface for expired timers (#123883 )	2024-04-25 01:15:52 +00:00
train_script.rst