pytorch/docs/source/elastic/quickstart.rst
Can Balioglu 65e6194aeb Introduce the torchrun entrypoint (#64049)
Summary:
This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049

Reviewed By: cbalioglu

Differential Revision: D30584041

Pulled By: kiukchung

fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d
2021-08-26 20:17:48 -07:00

52 lines
1.6 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Quickstart
===========
To launch a **fault-tolerant** job, run the following on all nodes.
.. code-block:: bash
torchrun
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
and at most ``MAX_SIZE`` nodes.
.. code-block:: bash
torchrun
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
specifies the node and the port on which the C10d rendezvous backend should be
instantiated and hosted. It can be any node in your training cluster, but
ideally you should pick a node that has a high bandwidth.
.. note::
If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400.
.. note::
The ``--standalone`` option can be passed to launch a single node job with a
sidecar rendezvous backend. You dont have to pass ``--rdzv_id``,
``--rdzv_endpoint``, and ``--rdzv_backend`` when the ``--standalone`` option
is used.
.. note::
Learn more about writing your distributed training script
`here <train_script.html>`_.
If ``torchrun`` does not meet your requirements you may use our APIs directly
for more powerful customization. Start by taking a look at the
`elastic agent <agent.html>`_ API.