Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635
This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.
`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654478
fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54804
Improve the implementation of the utility functions to handle more edge cases and also have a new set of unit tests to cover their usage.
Test Plan: Run the existing and newly introduced unit tests.
Reviewed By: kiukchung
Differential Revision: D27327898
fbshipit-source-id: 96b6fe2d910e3de69f44947a0e8a9f687ab50633
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172
Pull Request resolved: https://github.com/pytorch/elastic/pull/141
Upstreams two modules to torch:
1. `torchelastic.rendezvous`
2. `torchelastic.utils`
These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.
==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.
2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.
Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle
Reviewed By: H-Huang
Differential Revision: D26718746
fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51