mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/elastic/pull/148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811 Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst` Reviewed By: H-Huang Differential Revision: D27974751 fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
51 lines
1.6 KiB
ReStructuredText
51 lines
1.6 KiB
ReStructuredText
Quickstart
|
||
===========
|
||
|
||
.. code-block:: bash
|
||
|
||
pip install torch
|
||
|
||
# start a single-node etcd server on ONE host
|
||
etcd --enable-v2
|
||
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
|
||
--advertise-client-urls PUBLIC_HOSTNAME:2379
|
||
|
||
To launch a **fault-tolerant** job, run the following on all nodes.
|
||
|
||
.. code-block:: bash
|
||
|
||
python -m torch.distributed.run
|
||
--nnodes=NUM_NODES
|
||
--nproc_per_node=TRAINERS_PER_NODE
|
||
--rdzv_id=JOB_ID
|
||
--rdzv_backend=etcd
|
||
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
|
||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||
|
||
|
||
To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
|
||
and at most ``MAX_SIZE`` nodes.
|
||
|
||
.. code-block:: bash
|
||
|
||
python -m torch.distributed.run
|
||
--nnodes=MIN_SIZE:MAX_SIZE
|
||
--nproc_per_node=TRAINERS_PER_NODE
|
||
--rdzv_id=JOB_ID
|
||
--rdzv_backend=etcd
|
||
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
|
||
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
|
||
|
||
|
||
.. note:: The `--standalone` option can be passed to launch a single node job with
|
||
a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint,
|
||
and —rdzv_backend when the —standalone option is used
|
||
|
||
|
||
.. note:: Learn more about writing your distributed training script
|
||
`here <train_script.html>`_.
|
||
|
||
If ``torch.distributed.run`` does not meet your requirements
|
||
you may use our APIs directly for more powerful customization. Start by
|
||
taking a look at the `elastic agent <agent.html>`_ API).
|