Summary:
Action following discussion with distributed and r2p team--the tests under elastic in distributed should be owned by oncall: r2p and not distributed.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67293
Reviewed By: jbschlosser
Differential Revision: D31973779
Pulled By: janeyx99
fbshipit-source-id: 05875a7600c6eb1da1310a48e1e32a1a69461c55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636
This diff introduces:
- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654492
fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466
Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.
### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27623215
fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27442325
fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807
Improve the implementation and the unit test coverage of `RendezvousParameters`.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: kiukchung
Differential Revision: D27342444
fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803
Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types.
Test Plan: Run the existing test suite.
Reviewed By: H-Huang
Differential Revision: D27327505
fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172
Pull Request resolved: https://github.com/pytorch/elastic/pull/141
Upstreams two modules to torch:
1. `torchelastic.rendezvous`
2. `torchelastic.utils`
These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.
==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.
2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.
Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle
Reviewed By: H-Huang
Differential Revision: D26718746
fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51