Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151
This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28060531
fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150
This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28060336
fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149
This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059785
fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148
This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059764
fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147
This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059733
fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146
This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059693
fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145
This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059417
fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144
This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous.
ghstack-source-id: 127684898
Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition.
Reviewed By: tierex
Differential Revision: D28059159
fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538
This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D27892600
fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143
This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751
Test Plan: Run the new and updated unit tests.
Reviewed By: tierex
Differential Revision: D28059045
fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142
This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058980
fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141
Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058948
fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140
This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058908
fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139
This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`.
ghstack-source-id: 127626783
Test Plan: Run the existing unit tests.
Reviewed By: tierex
Differential Revision: D28058851
fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537
This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738
Test Plan: Run the existing unit tests.
Reviewed By: tierex
Differential Revision: D27890155
fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57217
In torch multiprocessing error handler, we try to remove the file if it already exists. Before removing, we try to log the contents of the file. Here the assumption is that the contents would be valid json.
However, in some cases, it isn't and then we end up not clearing the file.
Let's handle this error and make sure that the file is cleaned irrespective of the contents of the file.
Reviewed By: devashisht
Differential Revision: D28041470
fbshipit-source-id: da96d11b8f7091715cf0152cccd3ecc08b688eae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739
The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test
https://fburl.com/tupperware/0nizb9z8
Reviewed By: borovsky-d, wilson100hong
Differential Revision: D27952596
fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535
This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes.
ghstack-source-id: 126979138
Test Plan: Run the existing unit tests.
Reviewed By: H-Huang
Differential Revision: D27889894
fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534
This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087
Test Plan: Run the existing unit tests.
Reviewed By: H-Huang
Differential Revision: D27889817
fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533
This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035
Test Plan: Run the associated unit tests.
Reviewed By: H-Huang
Differential Revision: D27889671
fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532
This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.
We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904
Test Plan: Run the existing unit tests as there is no behavioral change.
Reviewed By: H-Huang
Differential Revision: D27889289
fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386
The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher
Added test_launcher to the ci/cd tests
Reviewed By: cbalioglu
Differential Revision: D27858897
fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
Summary: `Redirects` was renamed to `Std` in `torch.distributed.elastic.multiprocessing.api`. Pointed out by a user in https://github.com/pytorch/elastic/issues/147.
Test Plan: N/A just doc change
Reviewed By: tierex
Differential Revision: D27866614
fbshipit-source-id: 9fb901aae7ebe11cde13000a1c118de527f34400
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037
The diff introduces new `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.
Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`
The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...
Reviewed By: H-Huang
Differential Revision: D27805799
fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907
Summary:
This PR introduces a basic timer type that periodically calls a specified function. Its main use in the upcoming `DynamicRendezvousHandler` implementation will be to send periodic keep-alive updates in a background thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55919
Reviewed By: tierex
Differential Revision: D27740823
Pulled By: cbalioglu
fbshipit-source-id: e46fc848ab033995946a38a29c01d67d387a4cf5
Summary:
This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27742329
Pulled By: cbalioglu
fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687
The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env
The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0
The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.
The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27643206
fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html
This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).
This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838
Test Plan: CI. You can also run `flake8` locally.
Reviewed By: jbschlosser
Differential Revision: D27724232
Pulled By: samestep
fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55637
This diff introduces the `EtcdRendezvousBackend` type that will serve as an experimental alternative to the existing `EtcdRendezvousHandler`.
The major advantage of `EtcdRendezvousBackend` is that it delegates the bulk of the rendezvous handling logic to `DynamicRendezvousHandler` which is shared with `C10dRendezvousBackend` (see D27654492) and any other potential future rendezvous backend (e.g. Amazon S3).
ghstack-source-id: 126312209
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654498
fbshipit-source-id: f3259adfc9068b7e323b947a7d8d52fcd0b8ada1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636
This diff introduces:
- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654492
fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635
This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.
`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654478
fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211
This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.
I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712
Reviewed By: walterddr
Differential Revision: D27694976
Pulled By: malfet
fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
Summary:
malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465
Test Plan:
The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout ece075195d and run it locally:
```sh
(! git --no-pager grep -In $'\u00a0' -- . || (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false))
```
It should print over a hundred lines of output and exit with status 1.
Reviewed By: janeyx99
Differential Revision: D27622136
Pulled By: samestep
fbshipit-source-id: e7ffd5a9519093e7a0ffdf55e9291f63e21ce841
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466
Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.
### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27623215
fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/
When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
User workflow: f263531643
Reviewed By: cbalioglu
Differential Revision: D27602838
fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/
When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
User workflow: f263531643
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27572158
fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27442325
fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807
Improve the implementation and the unit test coverage of `RendezvousParameters`.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: kiukchung
Differential Revision: D27342444
fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54804
Improve the implementation of the utility functions to handle more edge cases and also have a new set of unit tests to cover their usage.
Test Plan: Run the existing and newly introduced unit tests.
Reviewed By: kiukchung
Differential Revision: D27327898
fbshipit-source-id: 96b6fe2d910e3de69f44947a0e8a9f687ab50633
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54805
Expose a `stderr` parameter to `EtcdServer` to have a clean unit test outputs.
Test Plan: Run the existing test suite.
Reviewed By: kiukchung
Differential Revision: D27327495
fbshipit-source-id: 0a342aeda0ff4d85d809aab1cbf155d3fafd4fa1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803
Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types.
Test Plan: Run the existing test suite.
Reviewed By: H-Huang
Differential Revision: D27327505
fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172
Pull Request resolved: https://github.com/pytorch/elastic/pull/141
Upstreams two modules to torch:
1. `torchelastic.rendezvous`
2. `torchelastic.utils`
These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.
==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.
2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.
Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle
Reviewed By: H-Huang
Differential Revision: D26718746
fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51