mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65838 closes https://github.com/pytorch/pytorch/pull/65675 The default `--max_restarts` for `torch.distributed.run` was changed to `0` from `3` to make things backwards compatible with `torch.distributed.launch`. Since the default `--max_restarts` used to be greater than `0` we never documented passing `--max_restarts` explicitly in any of our example code. Test Plan: N/A doc change only Reviewed By: d4l3k Differential Revision: D31279544 fbshipit-source-id: 98b31e6a158371bc56907552c5c13958446716f9
62 lines
2.1 KiB
ReStructuredText
62 lines
2.1 KiB
ReStructuredText
Quickstart
|
||
===========
|
||
|
||
To launch a **fault-tolerant** job, run the following on all nodes.
|
||
|
||
.. code-block:: bash
|
||
|
||
torchrun
|
||
--nnodes=NUM_NODES
|
||
--nproc_per_node=TRAINERS_PER_NODE
|
||
--max_restarts=NUM_ALLOWED_FAILURES
|
||
--rdzv_id=JOB_ID
|
||
--rdzv_backend=c10d
|
||
--rdzv_endpoint=HOST_NODE_ADDR
|
||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||
|
||
|
||
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
|
||
and at most ``MAX_SIZE`` nodes.
|
||
|
||
.. code-block:: bash
|
||
|
||
torchrun
|
||
--nnodes=MIN_SIZE:MAX_SIZE
|
||
--nproc_per_node=TRAINERS_PER_NODE
|
||
--max_restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
|
||
--rdzv_id=JOB_ID
|
||
--rdzv_backend=c10d
|
||
--rdzv_endpoint=HOST_NODE_ADDR
|
||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||
|
||
.. note::
|
||
TorchElastic models failures as membership changes. When a node fails,
|
||
this is treated as a "scale down" event. When the failed node is replaced by
|
||
the scheduler, it is a "scale up" event. Hence for both fault tolerant
|
||
and elastic jobs, ``--max_restarts`` is used to control the total number of
|
||
restarts before giving up, regardless of whether the restart was caused
|
||
due to a failure or a scaling event.
|
||
|
||
``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
|
||
specifies the node and the port on which the C10d rendezvous backend should be
|
||
instantiated and hosted. It can be any node in your training cluster, but
|
||
ideally you should pick a node that has a high bandwidth.
|
||
|
||
.. note::
|
||
If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400.
|
||
|
||
.. note::
|
||
The ``--standalone`` option can be passed to launch a single node job with a
|
||
sidecar rendezvous backend. You don’t have to pass ``--rdzv_id``,
|
||
``--rdzv_endpoint``, and ``--rdzv_backend`` when the ``--standalone`` option
|
||
is used.
|
||
|
||
|
||
.. note::
|
||
Learn more about writing your distributed training script
|
||
`here <train_script.html>`_.
|
||
|
||
If ``torchrun`` does not meet your requirements you may use our APIs directly
|
||
for more powerful customization. Start by taking a look at the
|
||
`elastic agent <agent.html>`_ API.
|