pytorch/docs/source/elastic/quickstart.rst
Can Balioglu 028f2f62ac [torch/elastic] Update the rendezvous docs (#58160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160

This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend.
ghstack-source-id: 128783809

Test Plan: Visually verified the correct

Reviewed By: tierex

Differential Revision: D28384996

fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
2021-05-12 16:54:28 -07:00

52 lines
1.7 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Quickstart
===========
To launch a **fault-tolerant** job, run the following on all nodes.
.. code-block:: bash
python -m torch.distributed.run
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
and at most ``MAX_SIZE`` nodes.
.. code-block:: bash
python -m torch.distributed.run
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
specifies the node and the port on which the C10d rendezvous backend should be
instantiated and hosted. It can be any node in your training cluster, but
ideally you should pick a node that has a high bandwidth.
.. note::
If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400.
.. note::
The ``--standalone`` option can be passed to launch a single node job with a
sidecar rendezvous backend. You dont have to pass ``--rdzv_id``,
``--rdzv_endpoint``, and ``--rdzv_backend`` when the ``--standalone`` option
is used.
.. note::
Learn more about writing your distributed training script
`here <train_script.html>`_.
If ``torch.distributed.run`` does not meet your requirements you may use our
APIs directly for more powerful customization. Start by taking a look at the
`elastic agent <agent.html>`_ API).