mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
62 lines
2.1 KiB
ReStructuredText
62 lines
2.1 KiB
ReStructuredText
Quickstart
|
||
===========
|
||
|
||
To launch a **fault-tolerant** job, run the following on all nodes.
|
||
|
||
.. code-block:: bash
|
||
|
||
torchrun
|
||
--nnodes=NUM_NODES
|
||
--nproc-per-node=TRAINERS_PER_NODE
|
||
--max-restarts=NUM_ALLOWED_FAILURES
|
||
--rdzv-id=JOB_ID
|
||
--rdzv-backend=c10d
|
||
--rdzv-endpoint=HOST_NODE_ADDR
|
||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||
|
||
|
||
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
|
||
and at most ``MAX_SIZE`` nodes.
|
||
|
||
.. code-block:: bash
|
||
|
||
torchrun
|
||
--nnodes=MIN_SIZE:MAX_SIZE
|
||
--nproc-per-node=TRAINERS_PER_NODE
|
||
--max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
|
||
--rdzv-id=JOB_ID
|
||
--rdzv-backend=c10d
|
||
--rdzv-endpoint=HOST_NODE_ADDR
|
||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||
|
||
.. note::
|
||
TorchElastic models failures as membership changes. When a node fails,
|
||
this is treated as a "scale down" event. When the failed node is replaced by
|
||
the scheduler, it is a "scale up" event. Hence for both fault tolerant
|
||
and elastic jobs, ``--max-restarts`` is used to control the total number of
|
||
restarts before giving up, regardless of whether the restart was caused
|
||
due to a failure or a scaling event.
|
||
|
||
``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
|
||
specifies the node and the port on which the C10d rendezvous backend should be
|
||
instantiated and hosted. It can be any node in your training cluster, but
|
||
ideally you should pick a node that has a high bandwidth.
|
||
|
||
.. note::
|
||
If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400.
|
||
|
||
.. note::
|
||
The ``--standalone`` option can be passed to launch a single node job with a
|
||
sidecar rendezvous backend. You don’t have to pass ``--rdzv-id``,
|
||
``--rdzv-endpoint``, and ``--rdzv-backend`` when the ``--standalone`` option
|
||
is used.
|
||
|
||
|
||
.. note::
|
||
Learn more about writing your distributed training script
|
||
`here <train_script.html>`_.
|
||
|
||
If ``torchrun`` does not meet your requirements you may use our APIs directly
|
||
for more powerful customization. Start by taking a look at the
|
||
`elastic agent <agent.html>`_ API.
|