Commit Graph

72 Commits

Author SHA1 Message Date
Aliaksandr Ivanou
e54c1f6c90 [torch][elastic] Make final agent barrier to shutdown properly
Summary:
When workers finish their work TE agent will start `synchronize_barrier` procedure. The barrier will wait for other agents at the end of the execution.

There is a race condition may happen: The barrier uses TCPStore which is located on Rank0. When Rank0 finishes the work, other ranks may still be in a process of executing `get_all` method. This means that some of them will fail because the TCPStore will be destroyed.

The fix adds additional check on Rank0 process: Rank0 process now waits for all other ranks to finish before terminating the process.

Test Plan: unit tests

Differential Revision: D35227180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74931
Approved by: https://github.com/kiukchung
2022-04-15 20:29:05 +00:00
Kiuk Chung
766eba60f7 (torchx/elastic) honor NCCL_ASYNC_ERROR_HANDLING set from the env var (#73982)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73982

Currently there is no way for users using torchelastic to override NCCL_ASYNC_ERROR_HANDLING=0. This PR enables this.

Test Plan:
Added unittests

Manual testing
```
$ torchx run fb.dist.ddp -- --img torchx_examples -m print_env_vars.py --env NCCL_ASYNC_ERROR_HANDLING=0
```

Validated the NCCL_ASYNC_ERROR_HANDLING in the process running `print_env_vars.py` is indeed `0`.

Reviewed By: mannatsingh, aivanou

Differential Revision: D34765786

fbshipit-source-id: 3f9f6d3b61e7d265adf689d387e020ab534c9259
(cherry picked from commit 2b787b46c6d37f049fe39eb64eecedf68799e75c)
2022-03-11 01:03:54 +00:00
Kiuk Chung
b08309ee0a (torch/elastic) skip logging structured error info if error_file is not set (#73477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477

resolves https://github.com/pytorch/pytorch/issues/73465

This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`.

Test Plan:
Induce an error on the agent process by `kill -15 $AGENT_PID`
```
python -m torch.distributed.run \
   --nproc_per_node 2 \
   --nnodes 1:1 \
   --rdzv_backend c10d \
  --rdzv_endpoint localhost:29500 \
  --monitor_interval 3 test.py
```

Produces

{F704936697}

In contrast to the duplicated error before:

{F704936729}

Reviewed By: d4l3k

Differential Revision: D34501852

fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8
(cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
2022-03-01 19:31:44 +00:00
Can Balioglu
6e640a0acf Revise the socket implementation of c10d (#68226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226

**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**

This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:

 - significantly better error handling and much more verbose logging (see the example output below)
 - explicit support for IPv6 and dual-stack sockets
 - correct handling of signal interrupts
 - better Windows support

A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.

## Example Output

```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987

Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.

Reviewed By: Babar, wilson100hong, mrshenli

Differential Revision: D32372333

fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
2021-11-16 20:49:25 -08:00
Jane Xu
36d9a74bc6 Enforce that test cases extend from correct TestCase (#67819)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/66903

Main code is in  torch/testing/_internal/common_utils.py and everything else is fixing the lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67819

Reviewed By: H-Huang

Differential Revision: D32259978

Pulled By: janeyx99

fbshipit-source-id: 39c5ffbaa510e1e533d6bdacf5c6158a3dd9885d
2021-11-08 18:28:36 -08:00
Kiuk Chung
f6402c469e (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749

Fixes: https://github.com/pytorch/pytorch/issues/67742

Test Plan:
Added unittests.

Validated manually:

```
# start agent 0
$ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# start agent 1
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# kill agent 0
CTRL+C (SIGINT) or kill -15 (SIGTERM)

# restart it
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
```

Reviewed By: cbalioglu

Differential Revision: D32129005

fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325
2021-11-05 12:18:46 -07:00
Jane Xu
a23814577b Overload TestCase not vanilla TestCase for some elastic tests (#67700)
Summary:
Addresses a bit of https://github.com/pytorch/pytorch/issues/66903

Fixes it so that https://github.com/pytorch/pytorch/issues/66207 can be properly disabled

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67700

Reviewed By: H-Huang

Differential Revision: D32116908

Pulled By: janeyx99

fbshipit-source-id: 205ff68a7408609cfced2357fd99f41949ef6390
2021-11-03 11:14:52 -07:00
Jane Xu
251278d385 [skip ci] set more tests with owners for distributed and elastic (#67583)
Summary:
It turns out my lint doesn't work on CI all the time because of shell differences. I'm working on a new more comprehensive lint in https://github.com/pytorch/pytorch/pull/66826 and it'd be nice if these could be cleared first.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67583

Reviewed By: H-Huang, mruberry

Differential Revision: D32045155

Pulled By: janeyx99

fbshipit-source-id: ecfe9f008310c28e3b731e246c2b2ed0106d03b1
2021-11-01 12:26:03 -07:00
Jane Xu
eb8b80b76f Add test owners for elastic tests (#67293)
Summary:
Action following discussion with distributed and r2p team--the tests under elastic in distributed should be owned by oncall: r2p and not distributed.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67293

Reviewed By: jbschlosser

Differential Revision: D31973779

Pulled By: janeyx99

fbshipit-source-id: 05875a7600c6eb1da1310a48e1e32a1a69461c55
2021-10-28 08:32:50 -07:00
Howard Huang
d7ac6e977a Fix test_create_store_multi flaky test (#66953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66953

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: kiukchung

Differential Revision: D31802767

Pulled By: H-Huang

fbshipit-source-id: a430e242788aac164496d4e65b85bf326537d019
2021-10-26 11:08:51 -07:00
Aliaksandr Ivanou
018e06edca [torchelastic] Skip tests in tsan mode (#67103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67103

Skip tests in tsan mode for now. More info: T104010063

Test Plan: sandcastle + running tests in mode/dev-tsan

Reviewed By: d4l3k

Differential Revision: D31861426

fbshipit-source-id: d50e5d06afbc82ccce6d102e52f72b5b01f6f41a
2021-10-22 15:55:18 -07:00
Kiuk Chung
df11e2d6f9 (torch/elastic) add fqdn hostname to error printout (#66182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182

closes https://github.com/pytorch/pytorch/issues/63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
2021-10-07 01:40:02 -07:00
Howard Huang
a95fabfecb Fix port allocation race condition for elastic test (#65149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149

Fixes #64789

There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.

This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30993166

Pulled By: H-Huang

fbshipit-source-id: eac4f28d641ac87c4ebee89df83f90955144f2f1
2021-09-17 08:32:47 -07:00
Pritam Damania
2d671ca41b [8/N] Remove c10d/ddp fork tests. (#63454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454

Continuation of https://github.com/pytorch/pytorch/pull/63443, this
PR removes all fork tests from torch.distributed.
ghstack-source-id: 136285511

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30387872

fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513
2021-08-20 12:23:18 -07:00
Pritam Damania
d565a7bd68 [6/N] Enable opt-asan for elastic and launcher tests. (#63442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442

Continuation of https://github.com/pytorch/pytorch/pull/62051, I've
enabled elastic and launcher tests to run in opt-asan mode which is supported
with spawn multiprocessing.

This allows us to completely get rid of fork based tests from torch.distributed
and have all tests run in spawn mode.
ghstack-source-id: 136057123

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D30384267

fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb
2021-08-18 10:48:49 -07:00
Pritam Damania
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
Neel Pragnesh Gandhi
f2369f12f9 Add logging for dynamic rendezvous (#61822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822

Added scuba logging to the following files:
- dynamic_rendezvous.py
- c10d_rendezvous_backend.py

NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed.

Test Plan:
The following tests can be run.
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test
```

Reviewed By: aivanou

Differential Revision: D29643774

fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd
2021-07-26 09:39:09 -07:00
Aliaksandr Ivanou
0c55f1bdec [torchelastic] Improve process termination logic (#61602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602

The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.

When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.

Test Plan: unittests, sandcastle

Reviewed By: cbalioglu

Differential Revision: D29671783

fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
2021-07-23 11:00:15 -07:00
Kiuk Chung
5a2f41a2db [torch/distributed.elastic] Fix utils.distributed_test.test_create_store_timeout_on_server to be dual-stack ip compatible (#60558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60558

Fixes 1/2 flaky tests as described in: https://github.com/pytorch/pytorch/issues/60260

`test_create_store_timeout_on_server` tests whether trying to create a `c10d::TCPStore` server on an already taken port actually fails with an `IOError`. Prior to this change the `utils.get_socket_with_port()` util method was used to synthetically reserve a port, then try creating the `TCPStore` on that port to validate the `IOError`. The issue with this is that on a dual stack ip setup, `get_socket_with_port()` (since it uses `socket.AF_UNSPEC`) reserves an ipv6 port, while `TCPStore` will try binding to an ipv4 port, so an `IOError` is not observed.

Changing the logic of the test to create two `TCPStore` servers. The first chooses a free port (by passing `server_port=0`) while the second tries to create a `TCPStore` server on the port that the first store is already running on. This would induce an `IOError` on the second store's constructor.

NOTE: this change does not solve another broader issue with `TCPStore` where the server and workers can listen and connect on ipv4 vs ipv6 when they are running on dual-stak ip hosts without ipv4 DNS entry and/or a `/etc/gai.conf` specifying the preferred bind ordering. See: https://github.com/pytorch/pytorch/pull/49124

Test Plan:
```
buck test //caffe2/test/distributed/elastic/utils:distributed_test
```

Reviewed By: cbalioglu

Differential Revision: D29334947

fbshipit-source-id: 76b998c59082cb04c0e86b7a1f3b509367fa0136
2021-06-23 17:12:18 -07:00
Aliaksandr Ivanou
8f03018980 [pytorch] Move signal handler test to internal codebase (#60394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60394

Move signal handler test to internal codebase

Github issue: https://github.com/pytorch/pytorch/issues/60260

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing:api_test

    buck test mode/dev-nosan //caffe2/torch/distributed/elastic/multiprocessing/fb/test:api_test

Reviewed By: cbalioglu

Differential Revision: D29273160

fbshipit-source-id: e4ae72f7f6d54cbba324119fce7446a30a6c37c9
2021-06-21 18:26:41 -07:00
Rong Rong (AI Infra)
510334f34b [BE] clean up IS_PYTORCH_CI and IN_CI (#60279)
Summary:
`IS_PYTORCH_CI` and `IN_CI` are used randomly, however in some cases IN_CI is not currently set because it only exist in .circleci/scripts/setup_ci_environment.sh. This cleans up the 2 flags and only use IN_CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60279

Test Plan: CI

Reviewed By: seemethere

Differential Revision: D29239545

Pulled By: walterddr

fbshipit-source-id: a069424a2bb8790a3adfdaf0dc460301026bf8c7
2021-06-20 19:45:07 -07:00
Neel Pragnesh Gandhi
2c5db9a40a Add c10d filestore functionality to the current c10d_rendezvous_backend (#59719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719

Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.

Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.

Reviewed By: cbalioglu, H-Huang

Differential Revision: D28997436

fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
2021-06-16 12:13:36 -07:00
Aliaksandr Ivanou
1735775662 [Torch] Cast timestamp type to int (#59712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712

When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5

This diff adds support for casting str timestamp to int.

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

Reviewed By: suphoff

Differential Revision: D28995827

fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
2021-06-09 15:56:37 -07:00
Can Balioglu
44c442293f [torch/elastic] Fix the edge case when no node is alive (#59663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663

This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D28971809

fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
2021-06-09 15:31:50 -07:00
Neel Pragnesh Gandhi
47e286d024 Merge c10d elastic agent tests into local_elastic_agent_test.py file (#59657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59657

Introduce tests that test elastic agent with c10d and etc2-v2 rendezvous backends.
Added a port allocation method that uses sockets to find an available port for the c10d backend. This way, agents that are created will all share the specified address/port and can communicate.
Added a method that abstracts the backend to use when running a test. This way, any tests can quickly be switched to run on the backend of choice (c10d, etcd, or etcd-v2)

Test Plan: Tests various components of the elastic agent with 3 different backends: etcd, etcd-v2, and c10d.

Reviewed By: tierex

Differential Revision: D28972604

fbshipit-source-id: fd4cff6417fefdf0de9d7a114820914b968006a8
2021-06-09 14:28:59 -07:00
Can Balioglu
ae63b1d1c6 [torch/elastic] Revise distributed run script (#58159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159

This PR includes the following changes:

- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.

- The `import` statement for `EtcdServer` has been removed from the run script.

- The docstrings and parameter descriptions of the run script have been revised and improved.

- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267

Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.

Reviewed By: tierex

Differential Revision: D28383681

fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
Can Balioglu
1d4d9ffca0 [torch/elastic] Refactor rendezvous store initialization logic (#58057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58057

This PR refactors the store initialization logic and moves it to the `create_backend` function for both C10d and etcd backends.
ghstack-source-id: 128671579

Test Plan: Run the existing and revised tests.

Reviewed By: tierex

Differential Revision: D28356587

fbshipit-source-id: caf9416ab811eefe4834268d8a11a48f2236ed5b
2021-05-11 13:46:07 -07:00
Can Balioglu
e5e095cbe4 [torch/elastic] Rename etcd-/c10d-experimental to etcd-v2 and c10d (#57764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57764

As discussed offline this PR renames etcd-experimental backend to etcd-v2 and c10d-experimental backend to c10d.
ghstack-source-id: 128342523

Test Plan: Run the existing unit tests.

Reviewed By: kiukchung

Differential Revision: D28263739

fbshipit-source-id: c3409037ecea5a8ff6daadeeb1f2fb4205cc3852
2021-05-06 19:51:53 -07:00
Can Balioglu
bf6e3425b0 [23/n] [torch/elastic] Introduce the implementation of DynamicRendezvousHandler (#57151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151

This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28060531

fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
2021-05-03 18:32:43 -07:00
Can Balioglu
a357fc8a4b [22/n] [torch/elastic] Introduce a new from_backend static constructor for DynamicRendezvousHandler (#57150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150

This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28060336

fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
2021-05-03 18:32:42 -07:00
Can Balioglu
4a10bd3b58 [21/n] [torch/elastic] Introduce _RendezvousJoinOp (#57149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149

This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059785

fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
2021-05-03 18:32:40 -07:00
Can Balioglu
81ef683cb3 [20/n] [torch/elastic] Introduce _RendezvousExitOp (#57148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148

This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059764

fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
2021-05-03 18:32:38 -07:00
Can Balioglu
baf8f4c0a6 [19/n] [torch/elastic] Introduce _RendezvousKeepAliveOp (#57147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147

This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059733

fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
2021-05-03 18:32:37 -07:00
Can Balioglu
3e024fcfc9 [18/n] [torch/elastic] Introduce _RendezvousCloseOp (#57146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146

This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059693

fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
2021-05-03 18:32:35 -07:00
Can Balioglu
aa5d35e1d7 [17/n] [torch/elastic] Introduce _DistributedRendezvousOpExecutor (#57145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145

This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059417

fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
2021-05-03 18:31:23 -07:00
Can Balioglu
76bccfb2e0 [15/n] [torch/elastic] Introduce _RendezvousStateHolder (#56538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538

This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D27892600

fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
2021-05-03 12:17:18 -07:00
Can Balioglu
1b745efbe8 [14/n] Introduce a name attribute to _PeriodicTimer (#57143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143

This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751

Test Plan: Run the new and updated unit tests.

Reviewed By: tierex

Differential Revision: D28059045

fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
2021-05-03 11:37:05 -07:00
Can Balioglu
233004b4c8 [13/n] Extend the return type of RendezvousBackend's set_state method (#57142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142

This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058980

fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
2021-05-03 11:37:03 -07:00
Can Balioglu
a6f60cf4f0 [12/n] Rename last_keep_alives to last_heartbeats in _RendezvousState (#57141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141

Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058948

fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
2021-05-03 11:37:01 -07:00
Can Balioglu
3209364724 [11/n] [torch/elastic] Add heartbeat timeout to RendezvousTimeout (#57140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140

This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058908

fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
2021-05-03 11:37:00 -07:00
Can Balioglu
6bf8df6b3b [9/n] [torch/elastic] Introduce RendezvousSettings (#56537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537

This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D27890155

fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
2021-05-03 11:36:04 -07:00
Aliaksandr Ivanou
7fe4c1d0e7 Torchelastic: add multiprocessing tests to ci/cd (#56842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56842

Add elastic multiprocessing test to ci/cd

Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/elastic/multiprocessing/... -- --run-disabled

Reviewed By: wilson100hong

Differential Revision: D27982226

fbshipit-source-id: 1b4e6f1a20867a6aa7ca409e280fdb04e8db198b
2021-05-02 14:03:47 -07:00
Can Balioglu
72b1faa2d2 [8/n] [torch/elastic] Add unit tests for _RendezvousState (#56536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56536

This PR adds unit tests to ensure that the encoded byte length of `_RendezvousState` stays under a certain limit.
ghstack-source-id: 127626622

Test Plan: Run the newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27890704

fbshipit-source-id: 24905c8bc9d985d5ee90d370f28739eb137ce0f0
2021-04-30 13:14:52 -07:00
Aliaksandr Ivanou
5c8ceefe46 Pytorch add agent api tests (#56985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56985

Pytorch add agent api tests

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28020485

fbshipit-source-id: e6acf095f26ce4b99cddfbf7641fb4fa885b0c86
2021-04-29 06:14:39 -07:00
Aliaksandr Ivanou
6ff0002b12 Pytorch: enable many torchelastic tests (#56970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56970

The diff enables metrics, events, utils and timer tests on ci/cd pipeline

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28015200

fbshipit-source-id: 6b419aaf9e62a10a747b6511bff90c82cfb7bcd6
2021-04-28 17:05:09 -07:00
Aliaksandr Ivanou
0df574017d Torchelastic: add support for the new error file format (#57084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57084

The diff adds support for new error message file format:

    {
        "message":"test",
        "timestamp": 12
    }

Test Plan:
fbcode buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

example job: tsm_aivanou-torchelastic_distributed_sum_77c0b147

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D28042764

fbshipit-source-id: 4d21c2319654f3460d551d91cbf48568356cf4e8
2021-04-28 00:04:45 -07:00
Aliaksandr Ivanou
0a72904ab4 Torchelastic: make process failure init error non-fatal (#56739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739

The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

    https://fburl.com/tupperware/0nizb9z8

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D27952596

fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
2021-04-23 00:49:47 -07:00
Can Balioglu
21d9bc246b [6/n] [torch/elastic] Reorder type definitions in dynamic_rendezvous.py (#56534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534

This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889817

fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
2021-04-21 16:01:02 -07:00
Can Balioglu
df91eb924c [5/n] [torch/elastic] Introduce the delay utility function (#56533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533

This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035

Test Plan: Run the associated unit tests.

Reviewed By: H-Huang

Differential Revision: D27889671

fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
2021-04-21 16:01:00 -07:00
Can Balioglu
76ca1eeeb8 [4/n] [torch/elastic] Fix the finalizer of PeriodicTimer (#56532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532

This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.

We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904

Test Plan: Run the existing unit tests as there is no behavioral change.

Reviewed By: H-Huang

Differential Revision: D27889289

fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5
2021-04-21 15:59:19 -07:00