Commit Graph

29 Commits

Author SHA1 Message Date
Kunal Bhalla
af229ecd34 [RFC] Change --standalone to bind to a random port (#107734)
Given standalone generates args anyways, it seems like it would be more convenient if it explicitly used a random port by default instead of trying to use 29400.

That way users can directly go with `--standalone` instead of having to spell out `--rdzv-backend=c10d --rdzv-endpoint=localhost:0`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107734
Approved by: https://github.com/H-Huang
2023-08-25 22:13:44 +00:00
shibo19
0af3203c72 fix torchrun script for custom device (#105443)
Fixes #ISSUE_NUMBER
as the title,add torchrun support for custom device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105443
Approved by: https://github.com/kumpera
2023-07-31 05:46:23 +00:00
Edward Z. Yang
5a7aad9681 Convert logging f-strings to use % format, part four (#98705)
This does multi-line concatenated string literals.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705
Approved by: https://github.com/voznesenskym
2023-04-11 13:17:59 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Kazuaki Ishizaki
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Jeffrey Dunn
d779dadda1 Remove stack trace captures from import (#97274)
Summary:
Calls to this function without an argument will get a stack trace at
import time. This is expensive, we can just skip it by passing in a value.

Test Plan: Wait for tests

Differential Revision: D44244345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274
Approved by: https://github.com/kiukchung
2023-03-22 18:34:13 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
Chris Zheng
5d37890b8e Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified (#88922)
Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: https://github.com/pytorch/pytorch/issues/85300

Before:
For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
```
return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)
```

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, `socket.getfqdn(<hostname>) `

Test Plan: Unit tests.

Differential Revision: D41204028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922
Approved by: https://github.com/d4l3k
2022-12-21 03:55:01 +00:00
Ram Rachum
351d73b97f Fix exception causes all over the codebase (#90271)
This is the continuation to #90134 and hopefully the final PR in this series.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271
Approved by: https://github.com/kit1980
2022-12-07 04:29:00 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Kiuk Chung
1a8bd1a7eb (torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73598

resolves https://github.com/pytorch/pytorch/issues/73319

Simply clarifies that `torchrun` is a console script that invokes `python -m torch.distributed.run`.

Test Plan: N/A doc change only, letting github CI validate that the docs build correctly.

Reviewed By: sinannasir, d4l3k

Differential Revision: D34558538

fbshipit-source-id: 70332c7efc57164a15eda6621575a7c6f14120c8
(cherry picked from commit a349c048c788ece514658a0c94dc0c87c9644e71)
2022-03-03 08:35:50 +00:00
Kiuk Chung
df11e2d6f9 (torch/elastic) add fqdn hostname to error printout (#66182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182

closes https://github.com/pytorch/pytorch/issues/63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
2021-10-07 01:40:02 -07:00
Aliaksandr Ivanou
4937218611 [torch][launch] Add ability to override sys.executable for torch.distributed.run (#66179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66179

The diff adds check for `PYTHON_EXEC` environment variable. If the variable is set, it will override `sys.executable` for `torch.distibuted.run`.
This means that  if `PYTHON_EXEC` is set, user scripts executed via `torch.distributed.run` will start via value of `os.environ["PYTHON_EXEC"]`

Test Plan: unittest

Reviewed By: kiukchung

Differential Revision: D31329003

fbshipit-source-id: b9d0167d99bbf463a6390f508324883ca4a1e439
2021-10-06 17:33:19 -07:00
Kiuk Chung
3900509b7d (torchelastic) make --max_restarts explicit in the quickstart and runner docs (#65838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65838

closes https://github.com/pytorch/pytorch/pull/65675

The default `--max_restarts` for `torch.distributed.run` was changed to `0` from `3` to make things backwards compatible with `torch.distributed.launch`. Since the default `--max_restarts` used to be greater than `0` we never documented passing `--max_restarts` explicitly in any of our example code.

Test Plan: N/A doc change only

Reviewed By: d4l3k

Differential Revision: D31279544

fbshipit-source-id: 98b31e6a158371bc56907552c5c13958446716f9
2021-09-29 19:29:01 -07:00
Can Balioglu
65e6194aeb Introduce the torchrun entrypoint (#64049)
Summary:
This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049

Reviewed By: cbalioglu

Differential Revision: D30584041

Pulled By: kiukchung

fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d
2021-08-26 20:17:48 -07:00
Kiuk Chung
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
Howard Huang
7299565768 Update torch.distributed.run OMP_NUM_THREADS message to log.warning (#63953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63953

Closes #61138

Test:
`python -m torch.distributed.run --nproc_per_node 2 test.py`
Still outputs message

`LOGLEVEL=ERROR python -m torch.distributed.run --nproc_per_node 2 test.py`
Does not output message anymore

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30542997

Pulled By: H-Huang

fbshipit-source-id: e7da30dcda51516abf4e56f1f510132e44397027
2021-08-25 11:55:06 -07:00
Aliaksandr Ivanou
60382de455 [torch] Set nproc_per_node to 1 (#61552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61552

Set `nproc_per_node` to 1

Test Plan: unittests

Reviewed By: cbalioglu

Differential Revision: D29667056

fbshipit-source-id: 6601f66fec5e018c7737d909f8c71642451abb29
2021-07-13 13:35:25 -07:00
Aliaksandr Ivanou
13658b10bb [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#61294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: cbalioglu

Differential Revision: D29559553

fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
2021-07-08 16:28:06 -07:00
Kento Nozawa
376dc500a9 Minor bug fix in the warning message (#61127)
Summary:
The current example code does not work. The correct one is like this: cb7d813275/torch/distributed/run.py (L266)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61127

Reviewed By: cbalioglu

Differential Revision: D29572003

Pulled By: mrshenli

fbshipit-source-id: 05b470230f3d70f8a6164edb5f92894a1112069f
2021-07-07 11:42:51 -07:00
Vitaly Fedyunin
ccfdb30644 Revert D29413019: [torch] Various improvements to torch.distributed.launch and torch.distributed.run
Test Plan: revert-hammer

Differential Revision:
D29413019 (4e181dfc35)

Original commit changeset: 323bfbad9d0e

fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2
2021-07-01 08:44:51 -07:00
Aliaksandr Ivanou
4e181dfc35 [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: kiukchung, cbalioglu

Differential Revision: D29413019

fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
2021-06-30 23:31:02 -07:00
Aliaksandr Ivanou
b99523832b Remove use_env from torch.distributed.run, clarify bc around that parameter in comment. (#59409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59409

Remove use_env from torch.distributed.run, and clarify bc around that parameter in comment.

Test Plan: n/a

Reviewed By: cbalioglu

Differential Revision: D28876485

fbshipit-source-id: 5f10365968d204985ce517b83c392c688995d76e
2021-06-04 09:02:47 -07:00
Kiuk Chung
998374a702 [tsm] add support for jetter to Role (base_image) for mast launches (#58252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58252

Pull Request resolved: https://github.com/pytorch/elastic/pull/149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
2021-05-14 17:39:18 -07:00
Mike Ruberry
c8644326a7 Revert D28177553: [tsm] add support for jetter to Role (base_image) for mast launches
Test Plan: revert-hammer

Differential Revision:
D28177553 (8a1dab3d26)

Original commit changeset: 29daada4bc26

fbshipit-source-id: 28132684dfdc28915d5fa5217a4591fec8d880fe
2021-05-12 23:21:59 -07:00
Kiuk Chung
8a1dab3d26 [tsm] add support for jetter to Role (base_image) for mast launches
Summary:
1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex, yifuwang

Differential Revision: D28177553

fbshipit-source-id: 29daada4bc26e5ef0949bf75215f35e557bd35b8
2021-05-12 22:10:15 -07:00
Can Balioglu
ae63b1d1c6 [torch/elastic] Revise distributed run script (#58159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159

This PR includes the following changes:

- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.

- The `import` statement for `EtcdServer` has been removed from the run script.

- The docstrings and parameter descriptions of the run script have been revised and improved.

- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267

Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.

Reviewed By: tierex

Differential Revision: D28383681

fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
Aliaksandr Ivanou
8a949f9e51 [23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run (#56831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56831

Rename torch.distributed.elastic_launch to torch.distributed.run

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
  buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...
  flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow

Reviewed By: kiukchung

Differential Revision: D27921159

fbshipit-source-id: cc7f2f035223b2d4abd7373af298998887e14c12
2021-04-29 11:06:20 -07:00