Commit Graph

55 Commits

Author SHA1 Message Date
Kiuk Chung
998374a702 [tsm] add support for jetter to Role (base_image) for mast launches (#58252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58252

Pull Request resolved: https://github.com/pytorch/elastic/pull/149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
2021-05-14 17:39:18 -07:00
Mike Ruberry
c8644326a7 Revert D28177553: [tsm] add support for jetter to Role (base_image) for mast launches
Test Plan: revert-hammer

Differential Revision:
D28177553 (8a1dab3d26)

Original commit changeset: 29daada4bc26

fbshipit-source-id: 28132684dfdc28915d5fa5217a4591fec8d880fe
2021-05-12 23:21:59 -07:00
Kiuk Chung
8a1dab3d26 [tsm] add support for jetter to Role (base_image) for mast launches
Summary:
1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex, yifuwang

Differential Revision: D28177553

fbshipit-source-id: 29daada4bc26e5ef0949bf75215f35e557bd35b8
2021-05-12 22:10:15 -07:00
Can Balioglu
ae63b1d1c6 [torch/elastic] Revise distributed run script (#58159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159

This PR includes the following changes:

- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.

- The `import` statement for `EtcdServer` has been removed from the run script.

- The docstrings and parameter descriptions of the run script have been revised and improved.

- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267

Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.

Reviewed By: tierex

Differential Revision: D28383681

fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
Aliaksandr Ivanou
8a949f9e51 [23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run (#56831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56831

Rename torch.distributed.elastic_launch to torch.distributed.run

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
  buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...
  flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow

Reviewed By: kiukchung

Differential Revision: D27921159

fbshipit-source-id: cc7f2f035223b2d4abd7373af298998887e14c12
2021-04-29 11:06:20 -07:00