pytorch/torch/distributed
Can Balioglu ae63b1d1c6 [torch/elastic] Revise distributed run script (#58159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159

This PR includes the following changes:

- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.

- The `import` statement for `EtcdServer` has been removed from the run script.

- The docstrings and parameter descriptions of the run script have been revised and improved.

- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267

Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.

Reviewed By: tierex

Differential Revision: D28383681

fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
..
algorithms [Gradient Compression] Update the docstring of fp16_compress_hook (#58168) 2021-05-12 14:28:41 -07:00
autograd Add Python declaration of torch._C and torch._C._autograd modules. (#46622) 2020-11-06 01:25:47 -08:00
benchmarks Add lint for unqualified type: ignore (#56290) 2021-04-21 08:07:23 -07:00
elastic [torch/elastic] Revise distributed run script (#58159) 2021-05-12 16:53:31 -07:00
launcher Add lint for unqualified type: ignore (#56290) 2021-04-21 08:07:23 -07:00
nn make remote model instantiation async when possible (#58052) 2021-05-12 13:48:09 -07:00
optim Add lint for unqualified type: ignore (#56290) 2021-04-21 08:07:23 -07:00
pipeline Convert assert -> cast. (#57458) 2021-05-12 13:54:16 -07:00
rpc Allow passing cpu to CUDA RPC device maps (#57019) 2021-05-04 04:14:27 -07:00
__init__.py [torch distributed] Implementing all_gather_base (#56315) 2021-04-23 14:16:47 -07:00
argparse_util.py [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214) 2021-04-16 13:38:23 -07:00
constants.py make ProcessGroupDefaultTimeout the same as python (#56549) 2021-04-21 17:56:05 -07:00
CONTRIBUTING.md Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598) 2021-04-21 22:10:41 -07:00
distributed_c10d.py [c10d] Log when store based barrier succeeds (#57711) 2021-05-10 21:09:40 -07:00
launch.py [23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run (#56831) 2021-04-29 11:06:20 -07:00
rendezvous.py Fix path handling on Win32 in rendezvous.py (#57000) 2021-04-29 13:55:11 -07:00
run.py [torch/elastic] Revise distributed run script (#58159) 2021-05-12 16:53:31 -07:00