Kiuk Chung
998374a702
[tsm] add support for jetter to Role (base_image) for mast launches ( #58252 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58252
Pull Request resolved: https://github.com/pytorch/elastic/pull/149
1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)
NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)
Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum
# fetch the fbpkgs
cd ~/tmp
fbpkg fetch --symlink-tags -o -d . jetter:prod
fbpkg fetch --symlink-tags -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags -o -d . torchx.examples.dist_sum
jetter/LAST/jetter apply-and-run \
torchx.examples.dist_sum.base/LAST/torchrun \
torchx.examples.dist_sum/LAST \
-- \
--as_function \
--rdzv_id foobar \
--nnodes 1 \
--nproc_per_node 2 \
--max_restarts 0 \
--role worker \
--no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```
## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
--scheduler mast
--base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
--fbpkg torchx.examples.dist_sum:f38ab46 \
--run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
--nnodes 2 \
--resource T1 \
--nproc_per_node 4 \
--name kiuk_jetter_test \
pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa ?
Reviewed By: tierex
Differential Revision: D28421033
fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
2021-05-14 17:39:18 -07:00
Mike Ruberry
c8644326a7
Revert D28177553: [tsm] add support for jetter to Role (base_image) for mast launches
...
Test Plan: revert-hammer
Differential Revision:
D28177553 (8a1dab3d26 )
Original commit changeset: 29daada4bc26
fbshipit-source-id: 28132684dfdc28915d5fa5217a4591fec8d880fe
2021-05-12 23:21:59 -07:00
Kiuk Chung
8a1dab3d26
[tsm] add support for jetter to Role (base_image) for mast launches
...
Summary:
1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)
NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)
Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum
# fetch the fbpkgs
cd ~/tmp
fbpkg fetch --symlink-tags -o -d . jetter:prod
fbpkg fetch --symlink-tags -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags -o -d . torchx.examples.dist_sum
jetter/LAST/jetter apply-and-run \
torchx.examples.dist_sum.base/LAST/torchrun \
torchx.examples.dist_sum/LAST \
-- \
--as_function \
--rdzv_id foobar \
--nnodes 1 \
--nproc_per_node 2 \
--max_restarts 0 \
--role worker \
--no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```
## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
--scheduler mast
--base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
--fbpkg torchx.examples.dist_sum:f38ab46 \
--run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
--nnodes 2 \
--resource T1 \
--nproc_per_node 4 \
--name kiuk_jetter_test \
pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa ?
Reviewed By: tierex, yifuwang
Differential Revision: D28177553
fbshipit-source-id: 29daada4bc26e5ef0949bf75215f35e557bd35b8
2021-05-12 22:10:15 -07:00
Can Balioglu
ae63b1d1c6
[torch/elastic] Revise distributed run script ( #58159 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159
This PR includes the following changes:
- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.
- The `import` statement for `EtcdServer` has been removed from the run script.
- The docstrings and parameter descriptions of the run script have been revised and improved.
- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267
Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.
Reviewed By: tierex
Differential Revision: D28383681
fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
Aliaksandr Ivanou
8a949f9e51
[23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run ( #56831 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56831
Rename torch.distributed.elastic_launch to torch.distributed.run
Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...
flow-cli canary pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow
Reviewed By: kiukchung
Differential Revision: D27921159
fbshipit-source-id: cc7f2f035223b2d4abd7373af298998887e14c12
2021-04-29 11:06:20 -07:00