Introduce the torchrun entrypoint (#64049)

Summary: This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049 Reviewed By: cbalioglu Differential Revision: D30584041 Pulled By: kiukchung fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d
2025-12-06 12:20:52 +01:00 · 2021-08-26 20:16:10 -07:00 · 2021-08-26 20:16:10 -07:00 · 65e6194aeb
commit 65e6194aeb
parent 510d2ece81
6 changed files with 45 additions and 44 deletions
--- a/docs/source/elastic/quickstart.rst
+++ b/docs/source/elastic/quickstart.rst
@ -5,13 +5,13 @@ To launch a **fault-tolerant** job, run the following on all nodes.
 .. code-block:: bash
-    python -m torch.distributed.run
+    torchrun
-            --nnodes=NUM_NODES
+       --nnodes=NUM_NODES
-            --nproc_per_node=TRAINERS_PER_NODE
+       --nproc_per_node=TRAINERS_PER_NODE
-            --rdzv_id=JOB_ID
+       --rdzv_id=JOB_ID
-            --rdzv_backend=c10d
+       --rdzv_backend=c10d
-            --rdzv_endpoint=HOST_NODE_ADDR
+       --rdzv_endpoint=HOST_NODE_ADDR
-            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
+       YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
 To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
@ -19,13 +19,13 @@ and at most ``MAX_SIZE`` nodes.
 .. code-block:: bash
-    python -m torch.distributed.run
+    torchrun
-            --nnodes=MIN_SIZE:MAX_SIZE
+        --nnodes=MIN_SIZE:MAX_SIZE
-            --nproc_per_node=TRAINERS_PER_NODE
+        --nproc_per_node=TRAINERS_PER_NODE
-            --rdzv_id=JOB_ID
+        --rdzv_id=JOB_ID
-            --rdzv_backend=c10d
+        --rdzv_backend=c10d
-            --rdzv_endpoint=HOST_NODE_ADDR
+        --rdzv_endpoint=HOST_NODE_ADDR
-            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
+        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
 ``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
 specifies the node and the port on which the C10d rendezvous backend should be
@ -46,6 +46,6 @@ ideally you should pick a node that has a high bandwidth.
   Learn more about writing your distributed training script
   `here <train_script.html>`_.
-If ``torch.distributed.run`` does not meet your requirements you may use our
+If ``torchrun`` does not meet your requirements you may use our APIs directly
-APIs directly for more powerful customization. Start by taking a look at the
+for more powerful customization. Start by taking a look at the
-`elastic agent <agent.html>`_ API).
+`elastic agent <agent.html>`_ API.
--- a/docs/source/elastic/run.rst
+++ b/docs/source/elastic/run.rst
@ -1,6 +1,6 @@
 .. _launcher-api:
-torch.distributed.run (Elastic Launch)
+torchrun (Elastic Launch)
 ======================================
 .. automodule:: torch.distributed.run
--- a/docs/source/elastic/train_script.rst
+++ b/docs/source/elastic/train_script.rst
@ -4,7 +4,7 @@ Train script
 -------------
 If your train script works with ``torch.distributed.launch`` it will continue
-working with ``torch.distributed.run`` with these differences:
+working with ``torchrun`` with these differences:
 1. No need to manually pass ``RANK``, ``WORLD_SIZE``,
   ``MASTER_ADDR``, and ``MASTER_PORT``.
--- a/setup.py
+++ b/setup.py
@ -854,6 +854,7 @@ def configure_extension_build():
        'console_scripts': [
            'convert-caffe2-to-onnx = caffe2.python.onnx.bin.conversion:caffe2_to_onnx',
            'convert-onnx-to-caffe2 = caffe2.python.onnx.bin.conversion:onnx_to_caffe2',
            'torchrun = torch.distributed.run:main',
        ]
    }
--- a/torch/distributed/launch.py
+++ b/torch/distributed/launch.py
@ -4,7 +4,7 @@ training processes on each of the training nodes.
 .. warning::
-    This module is going to be deprecated in favor of :ref:`torch.distributed.run <launcher-api>`.
+    This module is going to be deprecated in favor of :ref:`torchrun <launcher-api>`.
 The utility can be used for single-node distributed training, in which one or
 more processes per node will be spawned. The utility can be used for either
@ -177,8 +177,8 @@ def launch(args):
 def main(args=None):
    warnings.warn(
        "The module torch.distributed.launch is deprecated\n"
-        "and will be removed in future. Use torch.distributed.run.\n"
+        "and will be removed in future. Use torchrun.\n"
-        "Note that --use_env is set by default in torch.distributed.run.\n"
+        "Note that --use_env is set by default in torchrun.\n"
        "If your script expects `--local_rank` argument to be set, please\n"
        "change it to read from `os.environ['LOCAL_RANK']` instead. See \n"
        "https://pytorch.org/docs/stable/distributed.html#launch-utility for \n"
--- a/torch/distributed/run.py
+++ b/torch/distributed/run.py
@ -7,7 +7,7 @@
 # LICENSE file in the root directory of this source tree.
 """
-``torch.distributed.run`` provides a superset of the functionality as ``torch.distributed.launch``
+``torchrun`` provides a superset of the functionality as ``torch.distributed.launch``
 with the following additional functionalities:
 1. Worker failures are handled gracefully by restarting all workers.
@ -18,33 +18,33 @@ with the following additional functionalities:
-Transitioning from torch.distributed.launch to torch.distributed.run
+Transitioning from torch.distributed.launch to torchrun
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-``torch.distributed.run`` supports the same arguments as ``torch.distributed.launch`` **except**
+``torchrun`` supports the same arguments as ``torch.distributed.launch`` **except**
 for ``--use_env`` which is now deprecated. To migrate from ``torch.distributed.launch``
-to ``torch.distributed.run`` follow these steps:
+to ``torchrun`` follow these steps:
 1.  If your training script is already reading ``local_rank`` from the ``LOCAL_RANK`` environment variable.
    Then you need simply omit the ``--use_env`` flag, e.g.:
-    +--------------------------------------------------------------------+------------------------------------------------------+
+    +--------------------------------------------------------------------+--------------------------------------------+
-    |         ``torch.distributed.launch``                               |            ``torch.distributed.run``                 |
+    |         ``torch.distributed.launch``                               |                ``torchrun``                |
-    +====================================================================+======================================================+
+    +====================================================================+============================================+
-    |                                                                    |                                                      |
+    |                                                                    |                                            |
-    | .. code-block:: shell-session                                      | .. code-block:: shell-session                        |
+    | .. code-block:: shell-session                                      | .. code-block:: shell-session              |
-    |                                                                    |                                                      |
+    |                                                                    |                                            |
-    |    $ python -m torch.distributed.launch --use_env train_script.py  |    $ python -m torch.distributed.run train_script.py |
+    |    $ python -m torch.distributed.launch --use_env train_script.py  |    $ torchrun train_script.py              |
-    |                                                                    |                                                      |
+    |                                                                    |                                            |
-    +--------------------------------------------------------------------+------------------------------------------------------+
+    +--------------------------------------------------------------------+--------------------------------------------+
 2.  If your training script reads local rank from a ``--local_rank`` cmd argument.
    Change your training script to read from the ``LOCAL_RANK`` environment variable as
    demonstrated by the following code snippet:
    +-------------------------------------------------------+----------------------------------------------------+
-    |         ``torch.distributed.launch``                  |            ``torch.distributed.run``               |
+    |         ``torch.distributed.launch``                  |                    ``torchrun``                    |
    +=======================================================+====================================================+
    |                                                       |                                                    |
    | .. code-block:: python                                | .. code-block:: python                             |
@ -59,12 +59,12 @@ to ``torch.distributed.run`` follow these steps:
    |                                                       |                                                    |
    +-------------------------------------------------------+----------------------------------------------------+
-The aformentioned changes suffice to migrate from ``torch.distributed.launch`` to ``torch.distributed.run``.
+The aformentioned changes suffice to migrate from ``torch.distributed.launch`` to ``torchrun``.
-To take advantage of new features such as elasticity, fault-tolerance, and error reporting of ``torch.distributed.run``
+To take advantage of new features such as elasticity, fault-tolerance, and error reporting of ``torchrun``
 please refer to:
-* :ref:`elastic_train_script` for more information on authoring training scripts that are ``torch.distributed.run`` compliant.
+* :ref:`elastic_train_script` for more information on authoring training scripts that are ``torchrun`` compliant.
-* the rest of this page for more information on the features of ``torch.distributed.run``.
+* the rest of this page for more information on the features of ``torchrun``.
@ -75,7 +75,7 @@ Usage
 ::
-    >>> python -m torch.distributed.run
+    >>> torchrun
        --standalone
        --nnodes=1
        --nproc_per_node=$NUM_TRAINERS
@ -85,7 +85,7 @@ Usage
 ::
-    >>> python -m torch.distributed.run
+    >>> torchrun
        --nnodes=$NUM_NODES
        --nproc_per_node=$NUM_TRAINERS
        --rdzv_id=$JOB_ID
@ -104,7 +104,7 @@ node in your training cluster, but ideally you should pick a node that has a hig
 ::
-    >>> python -m torch.distributed.run
+    >>> torchrun
        --nnodes=1:4
        --nproc_per_node=$NUM_TRAINERS
        --rdzv_id=$JOB_ID
@ -186,7 +186,7 @@ The following environment variables are made available to you in your script:
   of the worker is specified in the ``WorkerSpec``.
 5. ``LOCAL_WORLD_SIZE`` - The local world size (e.g. number of workers running locally); equals to
-   ``--nproc_per_node`` specified on ``torch.distributed.run``.
+   ``--nproc_per_node`` specified on ``torchrun``.
 6. ``WORLD_SIZE`` - The world size (total number of workers in the job).