pytorch/docs/source/elastic/rendezvous.rst
Kurman Karabukaev d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743)
Summary:

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases
   -
Test Plan: CI

Differential Revision: D57055235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
2024-05-22 18:24:11 +00:00

126 lines
2.8 KiB
ReStructuredText

.. _rendezvous-api:
Rendezvous
==========
.. automodule:: torch.distributed.elastic.rendezvous
Below is a state diagram describing how rendezvous works.
.. image:: etcd_rdzv_diagram.png
Registry
--------
.. autoclass:: RendezvousParameters
:members:
.. autoclass:: RendezvousHandlerRegistry
.. automodule:: torch.distributed.elastic.rendezvous.registry
Handler
-------
.. currentmodule:: torch.distributed.elastic.rendezvous
.. autoclass:: RendezvousHandler
:members:
Dataclasses
-----------
.. autoclass:: RendezvousInfo
.. currentmodule:: torch.distributed.elastic.rendezvous.api
.. autoclass:: RendezvousStoreInfo
.. automethod:: build(rank, store)
Exceptions
----------
.. autoclass:: RendezvousError
.. autoclass:: RendezvousClosedError
.. autoclass:: RendezvousTimeoutError
.. autoclass:: RendezvousConnectionError
.. autoclass:: RendezvousStateError
.. autoclass:: RendezvousGracefulExitError
Implementations
---------------
Dynamic Rendezvous
******************
.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous
.. autofunction:: create_handler
.. autoclass:: DynamicRendezvousHandler()
:members: from_backend
.. autoclass:: RendezvousBackend
:members:
.. autoclass:: RendezvousTimeout
:members:
C10d Backend
^^^^^^^^^^^^
.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: C10dRendezvousBackend
:members:
Etcd Backend
^^^^^^^^^^^^
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: EtcdRendezvousBackend
:members:
Etcd Rendezvous (Legacy)
************************
.. warning::
The ``DynamicRendezvousHandler`` class supersedes the ``EtcdRendezvousHandler``
class, and is recommended for most users. ``EtcdRendezvousHandler`` is in
maintenance mode and will be deprecated in the future.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous
.. autoclass:: EtcdRendezvousHandler
Etcd Store
**********
The ``EtcdStore`` is the C10d ``Store`` instance type returned by
``next_rendezvous()`` when etcd is used as the rendezvous backend.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store
.. autoclass:: EtcdStore
:members:
Etcd Server
***********
The ``EtcdServer`` is a convenience class that makes it easy for you to
start and stop an etcd server on a subprocess. This is useful for testing
or single-node (multi-worker) deployments where manually setting up an
etcd server on the side is cumbersome.
.. warning:: For production and multi-node deployments please consider
properly deploying a highly available etcd server as this is
the single point of failure for your distributed jobs.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server
.. autoclass:: EtcdServer