mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary:
1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
- Depending on the implementation they can either:
- point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
- build args that `torch.distributed.init_process_group` can bootstrap by creating new store.
Additional points:
- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.
Why:
- Reduce moving parts
- easier to swap implementation
- improve tractability
- addressing perf/debug-ability will benefit all usecases
-
Test Plan: CI
Differential Revision: D57055235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
|
||
|---|---|---|
| .. | ||
| agent_diagram.jpg | ||
| agent.rst | ||
| customization.rst | ||
| errors.rst | ||
| etcd_rdzv_diagram.png | ||
| events.rst | ||
| examples.rst | ||
| kubernetes.rst | ||
| metrics.rst | ||
| multiprocessing.rst | ||
| quickstart.rst | ||
| rendezvous.rst | ||
| run.rst | ||
| subprocess_handler.rst | ||
| timer.rst | ||
| train_script.rst | ||