mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary:
1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
- Depending on the implementation they can either:
- point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
- build args that `torch.distributed.init_process_group` can bootstrap by creating new store.
Additional points:
- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.
Why:
- Reduce moving parts
- easier to swap implementation
- improve tractability
- addressing perf/debug-ability will benefit all usecases
-
Test Plan: CI
Differential Revision: D57055235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
126 lines
2.8 KiB
ReStructuredText
126 lines
2.8 KiB
ReStructuredText
.. _rendezvous-api:
|
|
|
|
Rendezvous
|
|
==========
|
|
|
|
.. automodule:: torch.distributed.elastic.rendezvous
|
|
|
|
Below is a state diagram describing how rendezvous works.
|
|
|
|
.. image:: etcd_rdzv_diagram.png
|
|
|
|
Registry
|
|
--------
|
|
|
|
.. autoclass:: RendezvousParameters
|
|
:members:
|
|
|
|
.. autoclass:: RendezvousHandlerRegistry
|
|
|
|
.. automodule:: torch.distributed.elastic.rendezvous.registry
|
|
|
|
Handler
|
|
-------
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous
|
|
|
|
.. autoclass:: RendezvousHandler
|
|
:members:
|
|
|
|
Dataclasses
|
|
-----------
|
|
.. autoclass:: RendezvousInfo
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.api
|
|
|
|
.. autoclass:: RendezvousStoreInfo
|
|
|
|
.. automethod:: build(rank, store)
|
|
|
|
Exceptions
|
|
----------
|
|
.. autoclass:: RendezvousError
|
|
.. autoclass:: RendezvousClosedError
|
|
.. autoclass:: RendezvousTimeoutError
|
|
.. autoclass:: RendezvousConnectionError
|
|
.. autoclass:: RendezvousStateError
|
|
.. autoclass:: RendezvousGracefulExitError
|
|
|
|
Implementations
|
|
---------------
|
|
|
|
Dynamic Rendezvous
|
|
******************
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous
|
|
|
|
.. autofunction:: create_handler
|
|
|
|
.. autoclass:: DynamicRendezvousHandler()
|
|
:members: from_backend
|
|
|
|
.. autoclass:: RendezvousBackend
|
|
:members:
|
|
|
|
.. autoclass:: RendezvousTimeout
|
|
:members:
|
|
|
|
C10d Backend
|
|
^^^^^^^^^^^^
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend
|
|
|
|
.. autofunction:: create_backend
|
|
|
|
.. autoclass:: C10dRendezvousBackend
|
|
:members:
|
|
|
|
Etcd Backend
|
|
^^^^^^^^^^^^
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend
|
|
|
|
.. autofunction:: create_backend
|
|
|
|
.. autoclass:: EtcdRendezvousBackend
|
|
:members:
|
|
|
|
Etcd Rendezvous (Legacy)
|
|
************************
|
|
|
|
.. warning::
|
|
The ``DynamicRendezvousHandler`` class supersedes the ``EtcdRendezvousHandler``
|
|
class, and is recommended for most users. ``EtcdRendezvousHandler`` is in
|
|
maintenance mode and will be deprecated in the future.
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous
|
|
|
|
.. autoclass:: EtcdRendezvousHandler
|
|
|
|
Etcd Store
|
|
**********
|
|
|
|
The ``EtcdStore`` is the C10d ``Store`` instance type returned by
|
|
``next_rendezvous()`` when etcd is used as the rendezvous backend.
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store
|
|
|
|
.. autoclass:: EtcdStore
|
|
:members:
|
|
|
|
Etcd Server
|
|
***********
|
|
|
|
The ``EtcdServer`` is a convenience class that makes it easy for you to
|
|
start and stop an etcd server on a subprocess. This is useful for testing
|
|
or single-node (multi-worker) deployments where manually setting up an
|
|
etcd server on the side is cumbersome.
|
|
|
|
.. warning:: For production and multi-node deployments please consider
|
|
properly deploying a highly available etcd server as this is
|
|
the single point of failure for your distributed jobs.
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server
|
|
|
|
.. autoclass:: EtcdServer
|