pytorch/docs/source/elastic
Aliaksandr Ivanou 4e181dfc35 [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: kiukchung, cbalioglu

Differential Revision: D29413019

fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
2021-06-30 23:31:02 -07:00
..
agent_diagram.jpg
agent.rst
customization.rst
errors.rst [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925) 2021-06-30 23:31:02 -07:00
etcd_rdzv_diagram.png
events.rst
examples.rst
kubernetes.rst
metrics.rst
multiprocessing.rst
quickstart.rst [torch/elastic] Update the rendezvous docs (#58160) 2021-05-12 16:54:28 -07:00
rendezvous.rst [torch/elastic] Update the rendezvous docs (#58160) 2021-05-12 16:54:28 -07:00
run.rst [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925) 2021-06-30 23:31:02 -07:00
timer.rst
train_script.rst [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925) 2021-06-30 23:31:02 -07:00