Commit Graph

9 Commits

Author SHA1 Message Date
Jane Xu
a23814577b Overload TestCase not vanilla TestCase for some elastic tests (#67700)
Summary:
Addresses a bit of https://github.com/pytorch/pytorch/issues/66903

Fixes it so that https://github.com/pytorch/pytorch/issues/66207 can be properly disabled

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67700

Reviewed By: H-Huang

Differential Revision: D32116908

Pulled By: janeyx99

fbshipit-source-id: 205ff68a7408609cfced2357fd99f41949ef6390
2021-11-03 11:14:52 -07:00
Jane Xu
eb8b80b76f Add test owners for elastic tests (#67293)
Summary:
Action following discussion with distributed and r2p team--the tests under elastic in distributed should be owned by oncall: r2p and not distributed.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67293

Reviewed By: jbschlosser

Differential Revision: D31973779

Pulled By: janeyx99

fbshipit-source-id: 05875a7600c6eb1da1310a48e1e32a1a69461c55
2021-10-28 08:32:50 -07:00
Howard Huang
d7ac6e977a Fix test_create_store_multi flaky test (#66953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66953

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: kiukchung

Differential Revision: D31802767

Pulled By: H-Huang

fbshipit-source-id: a430e242788aac164496d4e65b85bf326537d019
2021-10-26 11:08:51 -07:00
Aliaksandr Ivanou
018e06edca [torchelastic] Skip tests in tsan mode (#67103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67103

Skip tests in tsan mode for now. More info: T104010063

Test Plan: sandcastle + running tests in mode/dev-tsan

Reviewed By: d4l3k

Differential Revision: D31861426

fbshipit-source-id: d50e5d06afbc82ccce6d102e52f72b5b01f6f41a
2021-10-22 15:55:18 -07:00
Howard Huang
a95fabfecb Fix port allocation race condition for elastic test (#65149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149

Fixes #64789

There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.

This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30993166

Pulled By: H-Huang

fbshipit-source-id: eac4f28d641ac87c4ebee89df83f90955144f2f1
2021-09-17 08:32:47 -07:00
Pritam Damania
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
Kiuk Chung
5a2f41a2db [torch/distributed.elastic] Fix utils.distributed_test.test_create_store_timeout_on_server to be dual-stack ip compatible (#60558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60558

Fixes 1/2 flaky tests as described in: https://github.com/pytorch/pytorch/issues/60260

`test_create_store_timeout_on_server` tests whether trying to create a `c10d::TCPStore` server on an already taken port actually fails with an `IOError`. Prior to this change the `utils.get_socket_with_port()` util method was used to synthetically reserve a port, then try creating the `TCPStore` on that port to validate the `IOError`. The issue with this is that on a dual stack ip setup, `get_socket_with_port()` (since it uses `socket.AF_UNSPEC`) reserves an ipv6 port, while `TCPStore` will try binding to an ipv4 port, so an `IOError` is not observed.

Changing the logic of the test to create two `TCPStore` servers. The first chooses a free port (by passing `server_port=0`) while the second tries to create a `TCPStore` server on the port that the first store is already running on. This would induce an `IOError` on the second store's constructor.

NOTE: this change does not solve another broader issue with `TCPStore` where the server and workers can listen and connect on ipv4 vs ipv6 when they are running on dual-stak ip hosts without ipv4 DNS entry and/or a `/etc/gai.conf` specifying the preferred bind ordering. See: https://github.com/pytorch/pytorch/pull/49124

Test Plan:
```
buck test //caffe2/test/distributed/elastic/utils:distributed_test
```

Reviewed By: cbalioglu

Differential Revision: D29334947

fbshipit-source-id: 76b998c59082cb04c0e86b7a1f3b509367fa0136
2021-06-23 17:12:18 -07:00
Aliaksandr Ivanou
6ff0002b12 Pytorch: enable many torchelastic tests (#56970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56970

The diff enables metrics, events, utils and timer tests on ci/cd pipeline

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28015200

fbshipit-source-id: 6b419aaf9e62a10a747b6511bff90c82cfb7bcd6
2021-04-28 17:05:09 -07:00
Kiuk Chung
ba75cedfc5 [1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172

Pull Request resolved: https://github.com/pytorch/elastic/pull/141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
2021-03-05 11:27:57 -08:00