pytorch/docs/source/elastic/quickstart.rst
Kiuk Chung 3900509b7d (torchelastic) make --max_restarts explicit in the quickstart and runner docs (#65838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65838

closes https://github.com/pytorch/pytorch/pull/65675

The default `--max_restarts` for `torch.distributed.run` was changed to `0` from `3` to make things backwards compatible with `torch.distributed.launch`. Since the default `--max_restarts` used to be greater than `0` we never documented passing `--max_restarts` explicitly in any of our example code.

Test Plan: N/A doc change only

Reviewed By: d4l3k

Differential Revision: D31279544

fbshipit-source-id: 98b31e6a158371bc56907552c5c13958446716f9
2021-09-29 19:29:01 -07:00

62 lines
2.1 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Quickstart
===========
To launch a **fault-tolerant** job, run the following on all nodes.
.. code-block:: bash
torchrun
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--max_restarts=NUM_ALLOWED_FAILURES
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
and at most ``MAX_SIZE`` nodes.
.. code-block:: bash
torchrun
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--max_restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
.. note::
TorchElastic models failures as membership changes. When a node fails,
this is treated as a "scale down" event. When the failed node is replaced by
the scheduler, it is a "scale up" event. Hence for both fault tolerant
and elastic jobs, ``--max_restarts`` is used to control the total number of
restarts before giving up, regardless of whether the restart was caused
due to a failure or a scaling event.
``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
specifies the node and the port on which the C10d rendezvous backend should be
instantiated and hosted. It can be any node in your training cluster, but
ideally you should pick a node that has a high bandwidth.
.. note::
If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400.
.. note::
The ``--standalone`` option can be passed to launch a single node job with a
sidecar rendezvous backend. You dont have to pass ``--rdzv_id``,
``--rdzv_endpoint``, and ``--rdzv_backend`` when the ``--standalone`` option
is used.
.. note::
Learn more about writing your distributed training script
`here <train_script.html>`_.
If ``torchrun`` does not meet your requirements you may use our APIs directly
for more powerful customization. Start by taking a look at the
`elastic agent <agent.html>`_ API.