Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149Fixes#64789
There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.
This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D30993166
Pulled By: H-Huang
fbshipit-source-id: eac4f28d641ac87c4ebee89df83f90955144f2f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887
1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`
Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29784152
fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60558
Fixes 1/2 flaky tests as described in: https://github.com/pytorch/pytorch/issues/60260
`test_create_store_timeout_on_server` tests whether trying to create a `c10d::TCPStore` server on an already taken port actually fails with an `IOError`. Prior to this change the `utils.get_socket_with_port()` util method was used to synthetically reserve a port, then try creating the `TCPStore` on that port to validate the `IOError`. The issue with this is that on a dual stack ip setup, `get_socket_with_port()` (since it uses `socket.AF_UNSPEC`) reserves an ipv6 port, while `TCPStore` will try binding to an ipv4 port, so an `IOError` is not observed.
Changing the logic of the test to create two `TCPStore` servers. The first chooses a free port (by passing `server_port=0`) while the second tries to create a `TCPStore` server on the port that the first store is already running on. This would induce an `IOError` on the second store's constructor.
NOTE: this change does not solve another broader issue with `TCPStore` where the server and workers can listen and connect on ipv4 vs ipv6 when they are running on dual-stak ip hosts without ipv4 DNS entry and/or a `/etc/gai.conf` specifying the preferred bind ordering. See: https://github.com/pytorch/pytorch/pull/49124
Test Plan:
```
buck test //caffe2/test/distributed/elastic/utils:distributed_test
```
Reviewed By: cbalioglu
Differential Revision: D29334947
fbshipit-source-id: 76b998c59082cb04c0e86b7a1f3b509367fa0136
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172
Pull Request resolved: https://github.com/pytorch/elastic/pull/141
Upstreams two modules to torch:
1. `torchelastic.rendezvous`
2. `torchelastic.utils`
These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.
==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.
2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.
Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle
Reviewed By: H-Huang
Differential Revision: D26718746
fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51