Commit Graph

155 Commits

Author SHA1 Message Date
Can Balioglu
bf6e3425b0 [23/n] [torch/elastic] Introduce the implementation of DynamicRendezvousHandler (#57151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151

This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28060531

fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
2021-05-03 18:32:43 -07:00
Can Balioglu
a357fc8a4b [22/n] [torch/elastic] Introduce a new from_backend static constructor for DynamicRendezvousHandler (#57150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150

This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28060336

fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
2021-05-03 18:32:42 -07:00
Can Balioglu
4a10bd3b58 [21/n] [torch/elastic] Introduce _RendezvousJoinOp (#57149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149

This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059785

fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
2021-05-03 18:32:40 -07:00
Can Balioglu
81ef683cb3 [20/n] [torch/elastic] Introduce _RendezvousExitOp (#57148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148

This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059764

fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
2021-05-03 18:32:38 -07:00
Can Balioglu
baf8f4c0a6 [19/n] [torch/elastic] Introduce _RendezvousKeepAliveOp (#57147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147

This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059733

fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
2021-05-03 18:32:37 -07:00
Can Balioglu
3e024fcfc9 [18/n] [torch/elastic] Introduce _RendezvousCloseOp (#57146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146

This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059693

fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
2021-05-03 18:32:35 -07:00
Can Balioglu
aa5d35e1d7 [17/n] [torch/elastic] Introduce _DistributedRendezvousOpExecutor (#57145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145

This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059417

fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
2021-05-03 18:31:23 -07:00
Can Balioglu
1a6f827ae6 [16/n] [torch/elastic] Introduce _RendezvousOpExecutor (#57144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144

This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous.
ghstack-source-id: 127684898

Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition.

Reviewed By: tierex

Differential Revision: D28059159

fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f
2021-05-03 12:18:27 -07:00
Can Balioglu
76bccfb2e0 [15/n] [torch/elastic] Introduce _RendezvousStateHolder (#56538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538

This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D27892600

fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
2021-05-03 12:17:18 -07:00
Can Balioglu
1b745efbe8 [14/n] Introduce a name attribute to _PeriodicTimer (#57143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143

This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751

Test Plan: Run the new and updated unit tests.

Reviewed By: tierex

Differential Revision: D28059045

fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
2021-05-03 11:37:05 -07:00
Can Balioglu
233004b4c8 [13/n] Extend the return type of RendezvousBackend's set_state method (#57142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142

This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058980

fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
2021-05-03 11:37:03 -07:00
Can Balioglu
a6f60cf4f0 [12/n] Rename last_keep_alives to last_heartbeats in _RendezvousState (#57141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141

Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058948

fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
2021-05-03 11:37:01 -07:00
Can Balioglu
3209364724 [11/n] [torch/elastic] Add heartbeat timeout to RendezvousTimeout (#57140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140

This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058908

fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
2021-05-03 11:37:00 -07:00
Can Balioglu
6876e15dbe [10/n] [torch/elastic] Add comparison operators to _NodeDesc (#57139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139

This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`.
ghstack-source-id: 127626783

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D28058851

fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310
2021-05-03 11:36:58 -07:00
Can Balioglu
6bf8df6b3b [9/n] [torch/elastic] Introduce RendezvousSettings (#56537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537

This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D27890155

fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
2021-05-03 11:36:04 -07:00
Atul Jangra
ca814904b4 Handle error reporting when reply file already exists (#57217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57217

In torch multiprocessing error handler, we try to remove the file if it already exists. Before removing, we try to log the contents of the file. Here the assumption is that the contents would be valid json.
However, in some cases, it isn't and then we end up not clearing the file.
Let's handle this error and make sure that the file is cleaned irrespective of the contents of the file.

Reviewed By: devashisht

Differential Revision: D28041470

fbshipit-source-id: da96d11b8f7091715cf0152cccd3ecc08b688eae
2021-04-29 04:57:35 -07:00
Aliaksandr Ivanou
6ff0002b12 Pytorch: enable many torchelastic tests (#56970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56970

The diff enables metrics, events, utils and timer tests on ci/cd pipeline

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28015200

fbshipit-source-id: 6b419aaf9e62a10a747b6511bff90c82cfb7bcd6
2021-04-28 17:05:09 -07:00
Aliaksandr Ivanou
0df574017d Torchelastic: add support for the new error file format (#57084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57084

The diff adds support for new error message file format:

    {
        "message":"test",
        "timestamp": 12
    }

Test Plan:
fbcode buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

example job: tsm_aivanou-torchelastic_distributed_sum_77c0b147

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D28042764

fbshipit-source-id: 4d21c2319654f3460d551d91cbf48568356cf4e8
2021-04-28 00:04:45 -07:00
Aliaksandr Ivanou
0a72904ab4 Torchelastic: make process failure init error non-fatal (#56739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739

The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

    https://fburl.com/tupperware/0nizb9z8

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D27952596

fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
2021-04-23 00:49:47 -07:00
Can Balioglu
853112bbfc [7/n] [torch/elastic] Rename _Rendezvous to _RendezvousState (#56535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535

This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes.
ghstack-source-id: 126979138

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889894

fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2
2021-04-21 16:01:03 -07:00
Can Balioglu
21d9bc246b [6/n] [torch/elastic] Reorder type definitions in dynamic_rendezvous.py (#56534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534

This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889817

fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
2021-04-21 16:01:02 -07:00
Can Balioglu
df91eb924c [5/n] [torch/elastic] Introduce the delay utility function (#56533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533

This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035

Test Plan: Run the associated unit tests.

Reviewed By: H-Huang

Differential Revision: D27889671

fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
2021-04-21 16:01:00 -07:00
Can Balioglu
76ca1eeeb8 [4/n] [torch/elastic] Fix the finalizer of PeriodicTimer (#56532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532

This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.

We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904

Test Plan: Run the existing unit tests as there is no behavioral change.

Reviewed By: H-Huang

Differential Revision: D27889289

fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5
2021-04-21 15:59:19 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Aliaksandr Ivanou
c5c5230890 Pytorch resolve bug around incorrect rdzv handler resolution (#56386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386

The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher

Added test_launcher to the ci/cd tests

Reviewed By: cbalioglu

Differential Revision: D27858897

fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
2021-04-19 23:50:28 -07:00
Kiuk Chung
023231a2ac [torch/distributed] Fix pydoc for torch.distributed.elastic.multiprocessing (replace Redirect with Std)
Summary: `Redirects` was renamed to `Std` in `torch.distributed.elastic.multiprocessing.api`. Pointed out by a user in https://github.com/pytorch/elastic/issues/147.

Test Plan: N/A just doc change

Reviewed By: tierex

Differential Revision: D27866614

fbshipit-source-id: 9fb901aae7ebe11cde13000a1c118de527f34400
2021-04-19 21:40:16 -07:00
Sam Estep
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
Aliaksandr Ivanou
a6940aae37 [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037

The diff introduces new  `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.

Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`

The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: H-Huang

Differential Revision: D27805799

fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907
2021-04-16 13:38:23 -07:00
Can Balioglu
512c744f2e [torch/elastic] Introduce PeriodicTimer (#55919)
Summary:
This PR introduces a basic timer type that periodically calls a specified function. Its main use in the upcoming `DynamicRendezvousHandler` implementation will be to send periodic keep-alive updates in a background thread.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55919

Reviewed By: tierex

Differential Revision: D27740823

Pulled By: cbalioglu

fbshipit-source-id: e46fc848ab033995946a38a29c01d67d387a4cf5
2021-04-15 14:51:14 -07:00
Can Balioglu
71f9e99e29 [torch/elastic] Introduce aux types required by DynamicRendezvousHandler (#55932)
Summary:
This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27742329

Pulled By: cbalioglu

fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198
2021-04-15 11:12:20 -07:00
Aliaksandr Ivanou
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
Richard Barnes
6269efde91 Add stricter typing to caffe2/torch/distributed/elastic/multiprocessing/errors/__init__.py (#55848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55848

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D27714781

fbshipit-source-id: cff651e04c1e8363a249c7de9de01c33db47f003
2021-04-13 10:47:08 -07:00
Sam Estep
4753100a3b Un-ignore F403 in .flake8 (#55838)
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html

This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).

This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838

Test Plan: CI. You can also run `flake8` locally.

Reviewed By: jbschlosser

Differential Revision: D27724232

Pulled By: samestep

fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
2021-04-13 09:24:07 -07:00
Can Balioglu
e61b4fa691 [3/n] [torch/elastic] Introduce EtcdRendezvousBackend. (#55637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55637

This diff introduces the `EtcdRendezvousBackend` type that will serve as an experimental alternative to the existing `EtcdRendezvousHandler`.

The major advantage of `EtcdRendezvousBackend` is that it delegates the bulk of the rendezvous handling logic to `DynamicRendezvousHandler` which is shared with `C10dRendezvousBackend` (see D27654492) and any other potential future rendezvous backend (e.g. Amazon S3).
ghstack-source-id: 126312209

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654498

fbshipit-source-id: f3259adfc9068b7e323b947a7d8d52fcd0b8ada1
2021-04-12 22:20:29 -07:00
Can Balioglu
339d3bf394 [2/n] [torch/elastic] Introduce C10dRendezvousBackend. (#55636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636

This diff introduces:

- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654492

fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
2021-04-12 22:20:27 -07:00
Can Balioglu
b3dd8cde61 [1/n] [torch/elastic] Introduce DynamicRendezvousHandler and RendezvousBackend. (#55635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635

This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.

`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654478

fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
2021-04-12 22:18:49 -07:00
Ralf Gommers
48ddc9762b Upgrade mypy to version 0.812 (#55712)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211

This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.

I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712

Reviewed By: walterddr

Differential Revision: D27694976

Pulled By: malfet

fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
2021-04-12 18:08:28 -07:00
Sam Estep
cc11aaaa60 Disallow non-breaking spaces (#55465)
Summary:
malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465

Test Plan:
The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout ece075195d and run it locally:
```sh
(! git --no-pager grep -In $'\u00a0' -- . || (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false))
```
It should print over a hundred lines of output and exit with status 1.

Reviewed By: janeyx99

Differential Revision: D27622136

Pulled By: samestep

fbshipit-source-id: e7ffd5a9519093e7a0ffdf55e9291f63e21ce841
2021-04-08 15:44:44 -07:00
Wilson Hong
f665a7f8a1 [pet] Set error code in reply file when child process is terminated by signals.
Summary: Fill reply file's error code with ProcessFailure's exitcode. This is necessary when child process terminated by signals (ex. SIGSEGV).

Test Plan:
- Buck test
```
buck test mode/dev-nosan pytorch/elastic/torchelastic/distributed/fb/test:launch_test
buck test mode/dev-nosan caffe2/torch/distributed/elastic/multiprocessing/errors/fb/test:error_handler_fb_test_needed_coverage
```

- TSM
```
fbpkg build -E torchelastic_distributed_sum

buck run mode/dev-nosan //pytorch/elastic/torchelastic/tsm/fb/cli:tsm -- run_ddp --scheduler mast --fbpkg torchelastic_distributed_sum:ecdf31f --nnodes 2 --nproc_per_node 2 --resource T1  --run_cfg hpcIdentity=oncall_dai_pet,hpcClusterUuid=MastNaoTestCluster main.pa
```
https://www.internalfb.com/mast/job/tsm_wilsonhong-torchelastic_distributed_sum_ef3fd8d3

- classy_vision
```
flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow
```
https://our.intern.facebook.com/intern/fblearner/details/263970380/?notif_channel=cli

Reviewed By: tierex

Differential Revision: D27512554

fbshipit-source-id: 903d25d96655085685f874113826d4627d9a79e4
2021-04-08 09:58:20 -07:00
Can Balioglu
493a233c04 [torch/elastic] Revise the rendezvous handler registry logic. (#55466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466

Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27623215

fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
2021-04-07 20:43:20 -07:00
Aliaksandr Ivanou
f5675f8306 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412

The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu

Differential Revision: D27602838

fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
2021-04-07 09:39:24 -07:00
Nikita Shulga
add49e7e4e Enforce PEP263 for PyTorch python codebase (#55346)
Summary:
All python files containing non-ASCII characters should be correctly annotated with `# -*- coding: utf-8 -*-` comment

Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346

Reviewed By: samestep

Differential Revision: D27582044

Pulled By: malfet

fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44
2021-04-06 18:31:38 -07:00
Brian Hirsh
ae3a876c9c Revert D27572158: [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Test Plan: revert-hammer

Differential Revision:
D27572158 (e9c6a51100)

Original commit changeset: 9a360468acc9

fbshipit-source-id: 29f7e2cba3e134bc81fb31b7e1dfceb7c1f9d734
2021-04-06 11:41:55 -07:00
Aliaksandr Ivanou
e9c6a51100 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27572158

fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b
2021-04-06 11:03:00 -07:00
Brian Hirsh
bf70fe69ae Revert D27442325: [torch/elastic] Revise the rendezvous handler registry logic.
Test Plan: revert-hammer

Differential Revision:
D27442325 (df299dbd7d)

Original commit changeset: 8519a2caacbe

fbshipit-source-id: f10452567f592c23ae79ca31556a2a77546726b1
2021-04-06 06:17:14 -07:00
Can Balioglu
df299dbd7d [torch/elastic] Revise the rendezvous handler registry logic.
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27442325

fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
2021-04-05 23:38:29 -07:00
Can Balioglu
359d0a0205 [torch/elastic] Improve the implementation of RendezvousParameters and add its unit tests. (#146)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807

Improve the implementation and the unit test coverage of `RendezvousParameters`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27342444

fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f
2021-04-05 23:38:27 -07:00
Can Balioglu
7f06c65a4c [torch/elastic] Improve the implementation of the utility functions and add their unit tests. (#54804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54804

Improve the implementation of the utility functions to handle more edge cases and also have a new set of unit tests to cover their usage.

Test Plan: Run the existing and newly introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27327898

fbshipit-source-id: 96b6fe2d910e3de69f44947a0e8a9f687ab50633
2021-04-05 23:38:25 -07:00
Can Balioglu
de7f05b9eb [torch/elastic] Expose a stderr parameter in EtcdServer. (#54805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54805

Expose a `stderr` parameter to `EtcdServer` to have a clean unit test outputs.

Test Plan: Run the existing test suite.

Reviewed By: kiukchung

Differential Revision: D27327495

fbshipit-source-id: 0a342aeda0ff4d85d809aab1cbf155d3fafd4fa1
2021-04-05 23:38:22 -07:00
Can Balioglu
bad8d34780 [torch/elastic] Revise the rendezvous exception types. (#54803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803

Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types.

Test Plan: Run the existing test suite.

Reviewed By: H-Huang

Differential Revision: D27327505

fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92
2021-04-05 23:36:50 -07:00
Aliaksandr Ivanou
77ccd4f9a3 [5/n][torch/elastic][upstream] Move torchelastic/agent to torch/distributed/elastic/agent (#54343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54343

Move torchelastic/agent to torch/distributed/elastic/agent

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
      buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...

Reviewed By: kiukchung, wilson100hong

Differential Revision: D27173271

fbshipit-source-id: 26761acc3f962af2afffcc3c7a237f3b6d65e531
2021-03-22 23:15:37 -07:00
Aliaksandr Ivanou
e91aeb0470 [4/n][torch/elastic][upstream] Move torchelastic/metrics to torch/distributed/elastic/metrics (#53870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53870

Move torchelastic/metrics to torch/distributed/elastic/metrics

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/...

Reviewed By: kiukchung

Differential Revision: D26970901

fbshipit-source-id: 0e0a211fe509b7bc3ab10adfefba81cd71b6db37
2021-03-15 16:07:18 -07:00
Aliaksandr Ivanou
ec484981c6 [3/n][torch/elastic][upstream] Move torchelastic/events to torch/distributed/events (#53760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53760

Pull Request resolved: https://github.com/pytorch/elastic/pull/143

The diff upsteams torchelastic/events to the torch.

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/agent/...
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/events/fb/...

Reviewed By: kiukchung

Differential Revision: D26932830

fbshipit-source-id: 23fc10d2ead5af7f7ed510ae0d2581cc2421cf76
2021-03-11 11:25:24 -08:00
Kiuk Chung
b03c92a9c5 [2/n][torch/elastic][upstream] Move torchelastic/timer torchelastic/multiprocessing to torch/distributed/elastic (#53574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53574

Upstreams `torchelastic/timer|multiprocessing` to `torch/distributed/elastic/timer|multiprocessing`

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/...
buck test mode/dev-nosan //caffe2/test/distributed/elastic/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
buck test mode/dev-nosan //hpc/...
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/...
```

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D26899809

fbshipit-source-id: e6dbc2a78282eac296c262b3206a979e3ef1ff53
2021-03-10 12:32:53 -08:00
Kiuk Chung
ba75cedfc5 [1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172

Pull Request resolved: https://github.com/pytorch/elastic/pull/141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
2021-03-05 11:27:57 -08:00