mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160 This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend. ghstack-source-id: 128783809 Test Plan: Visually verified the correct Reviewed By: tierex Differential Revision: D28384996 fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
116 lines
2.6 KiB
ReStructuredText
116 lines
2.6 KiB
ReStructuredText
.. _rendezvous-api:
|
|
|
|
Rendezvous
|
|
==========
|
|
|
|
.. automodule:: torch.distributed.elastic.rendezvous
|
|
|
|
Below is a state diagram describing how rendezvous works.
|
|
|
|
.. image:: etcd_rdzv_diagram.png
|
|
|
|
Registry
|
|
--------
|
|
|
|
.. autoclass:: RendezvousParameters
|
|
:members:
|
|
|
|
.. autoclass:: RendezvousHandlerRegistry
|
|
:members:
|
|
|
|
.. automodule:: torch.distributed.elastic.rendezvous.registry
|
|
|
|
Handler
|
|
-------
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous
|
|
|
|
.. autoclass:: RendezvousHandler
|
|
:members:
|
|
|
|
Exceptions
|
|
----------
|
|
.. autoclass:: RendezvousError
|
|
.. autoclass:: RendezvousClosedError
|
|
.. autoclass:: RendezvousTimeoutError
|
|
.. autoclass:: RendezvousConnectionError
|
|
.. autoclass:: RendezvousStateError
|
|
|
|
Implementations
|
|
---------------
|
|
|
|
Dynamic Rendezvous
|
|
******************
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous
|
|
|
|
.. autofunction:: create_handler
|
|
|
|
.. autoclass:: DynamicRendezvousHandler()
|
|
:members: from_backend
|
|
|
|
.. autoclass:: RendezvousBackend
|
|
:members:
|
|
|
|
.. autoclass:: RendezvousTimeout
|
|
:members:
|
|
|
|
C10d Backend
|
|
^^^^^^^^^^^^
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend
|
|
|
|
.. autofunction:: create_backend
|
|
|
|
.. autoclass:: C10dRendezvousBackend
|
|
:members:
|
|
|
|
Etcd Backend
|
|
^^^^^^^^^^^^
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend
|
|
|
|
.. autofunction:: create_backend
|
|
|
|
.. autoclass:: EtcdRendezvousBackend
|
|
:members:
|
|
|
|
Etcd Rendezvous (Legacy)
|
|
************************
|
|
|
|
.. warning::
|
|
The ``DynamicRendezvousHandler`` class supersedes the ``EtcdRendezvousHandler``
|
|
class, and is recommended for most users. ``EtcdRendezvousHandler`` is in
|
|
maintenance mode and will be deprecated in the future.
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous
|
|
|
|
.. autoclass:: EtcdRendezvousHandler
|
|
|
|
Etcd Store
|
|
**********
|
|
|
|
The ``EtcdStore`` is the C10d ``Store`` instance type returned by
|
|
``next_rendezvous()`` when etcd is used as the rendezvous backend.
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store
|
|
|
|
.. autoclass:: EtcdStore
|
|
:members:
|
|
|
|
Etcd Server
|
|
***********
|
|
|
|
The ``EtcdServer`` is a convenience class that makes it easy for you to
|
|
start and stop an etcd server on a subprocess. This is useful for testing
|
|
or single-node (multi-worker) deployments where manually setting up an
|
|
etcd server on the side is cumbersome.
|
|
|
|
.. warning:: For production and multi-node deployments please consider
|
|
properly deploying a highly available etcd server as this is
|
|
the single point of failure for your distributed jobs.
|
|
|
|
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server
|
|
|
|
.. autoclass:: EtcdServer
|