Commit Graph

12 Commits

Author SHA1 Message Date
Ryan Chun
96fb63bc75 [Bootcamp] Set default value of TCPStore world_size to None in pybind definition (#77277)
- Set the default value of TCPStore optional world_size arg to None
- Updated the corresponding test case

Fixes #74488
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77277
Approved by: https://github.com/H-Huang
2022-05-12 18:48:48 +00:00
Tristan Rice
5b915e844c c10d: retry dns lookup failures (#74641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74641

This makes dns hostname lookup failures retryable since in some environments such as Kubernetes they're not guaranteed to be resolvable until the job starts. Retrying this eliminates the race condition.

This also fixes `sandcastle_skip_if` when used on the class instead of the method. Previously they wouldn't inherit from TestCase so just wouldn't run under buck at all.

Fixes https://github.com/pytorch/pytorch/issues/73682

Test Plan:
Added a unit test

```
buck test //caffe2/test/distributed:test_store
```

Reviewed By: aivanou

Differential Revision: D35092284

fbshipit-source-id: d40bf187e52c41f551e4fe41c536b2b0015588ee
(cherry picked from commit f8908309d8ee64c25ee466a6b4922f34f2b7618e)
2022-03-24 19:51:09 +00:00
Xiang Gao
6e16c9bb1d Add support for deleteKey for FileStore (#69953)
Summary:
torch_ucc uses `deleteKey`, and trying to run PyTorch tests with torch_ucc leads to failure about `deleteKey not implemented for FileStore`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69953

Reviewed By: ngimel

Differential Revision: D33458457

Pulled By: H-Huang

fbshipit-source-id: f46afd59f950722ae594d9aafb8843f14019e930
2022-01-07 06:20:59 -08:00
Can Balioglu
6e640a0acf Revise the socket implementation of c10d (#68226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226

**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**

This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:

 - significantly better error handling and much more verbose logging (see the example output below)
 - explicit support for IPv6 and dual-stack sockets
 - correct handling of signal interrupts
 - better Windows support

A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.

## Example Output

```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987

Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.

Reviewed By: Babar, wilson100hong, mrshenli

Differential Revision: D32372333

fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
2021-11-16 20:49:25 -08:00
Howard Huang
13a69d23b1 Add retry logic for test_multitenancy and documentation for find_free_port (#67775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67775

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32142749

Pulled By: H-Huang

fbshipit-source-id: 67ab4ede4f4bff96a1ffd41d55b3be0edc82b1ce
2021-11-05 09:05:12 -07:00
Howard Huang
9e71ea292d Fix test_init_pg_and_rpc_with_same_socket by retrying on addr in use error (#67638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67638

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32074698

Pulled By: H-Huang

fbshipit-source-id: 6b980fcdac4b0f1edfe086d0deb99be371a73900
2021-11-02 09:42:47 -07:00
Jane Xu
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
Howard Huang
dc1bd6acee Remove PROCESS GROUP rpc backend (#62411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62411

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29990408

Pulled By: H-Huang

fbshipit-source-id: 183d3b316767b12993cebbe32b73c2850fd1cc42
2021-08-02 12:26:22 -07:00
Howard Huang
43274ca145 test_store multiworker remove multiprocessing (#59599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59599

This will fix the flakiness for these tests internally when running under TSAN. We don't need multiprocessing since we should restrict the testing to the `wait_for_workers` and `world_size` parameters of the tcp store master store.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28947838

Pulled By: H-Huang

fbshipit-source-id: d3e3904aa7ac81ae4c744a193a3b7167c2227bc8
2021-06-08 14:38:42 -07:00
Can Balioglu
1d9c1cc00a [4/n] [c10d] Introduce the multi-tenancy feature in TCPStore (#58331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331

This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair.
ghstack-source-id: 130676394

Test Plan:
- Run the existing and newly-introduced tests.
- Run several smoke tests including the short code snippet referred in GitHub issue #41614.

Reviewed By: H-Huang

Differential Revision: D28453850

fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6
2021-06-05 07:50:07 -07:00
Andrew Gu
bbf7eceaf0 Refactor c10d and dist aliases for torch.distributed (#59456)
Summary:
**Overview:**
This consolidates `c10d` and `dist` to only `dist` as the alias for `torch.distributed` in `test_store.py`. Both aliases were used most likely due to incremental additions to the test file and not intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59456

Test Plan:
```
python test/distributed/test_store.py
```

Reviewed By: agolynski

Differential Revision: D28910169

Pulled By: andwgu

fbshipit-source-id: f830dead29e9de48aaf2845dfa5861c9cccec15d
2021-06-04 16:07:44 -07:00
Andrew Gu
2ad4b8e58c Extract c10d Store tests to dedicated test file (#59271)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/55340

**Overview**
This factors out `FileStoreTest`, `HashStoreTest`, `PrefixFileStoreTest`, `TCPStoreTest`, `PrefixTCPStoreTest`, `PythonStoreTest`, `RendezvousTest`, `RendezvousEnvTest`, `RendezvousFileTest`, and `RendezvousTCPTest` from `test_c10d_common.py` to a new file `test_store.py`.

Additionally, unused import/initialization statements are removed from `test_c10d_common.py`, and the minimal set of import/initialization statements are used for `test_store.py`.

Also, this changes `.jenkins/pytorch/multigpu-test.sh`, `.jenkins/pytorch/win-test-helpers/test_distributed.bat`, and `test/run_test.py` to include the new `test_store.py`.

**Testing**
All commands shown are run on an AI AWS cluster.

I check the Store tests:
```
python test/distributed/test_store.py
```

I also check `test_c10d_common.py` since it is the source of the refactored code. In addition, I check `test_c10d_nccl.py` and `test_c10d_gloo.py` since they import from `test_c10d_common.py`; those two should be the only test files depending on `test_c10d_common.py`.
```
python test/distributed/test_c10d_common.py
python test/distributed/test_c10d_nccl.py
python test/distributed/test_c10d_gloo.py
```
`test_c10d_gloo.py` produces warnings about how using sparse tensors in TorchScript is experimental, but the warnings do not result from this PR's changes.

**Testing Issues** (To Be Revisited)
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py
```
Running the above command fails three tests (written as `[Test]`: `[Error]`):
- `ProcessGroupGlooWrapperTest.test_collective_hang`: `RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.200.24.101]:15580`
- `CommTest.test_broadcast_coalesced_gloo_cuda`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
- `CommTest.test_sequence_num_incremented_gloo_default`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
However, running each of the following yields no errors:
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_collective_hang
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_broadcast_coalesced_gloo_cuda
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_sequence_num_incremented_gloo_default
```
This suggests the existence of some inadvertent state dependency between tests (e.g. improper cleanup). I have not explored this further yet. In particular, I do not have a solid understanding of the tests to be able to explain why using `pytest` and `gpurun` induces the failure (since notably, running the `.py` directly shows no issue).

Similarly, running the following yields 47 errors:
```
WORLD_SIZE=4 BACKEND=nccl gpurun pytest test/distributed/test_c10d_nccl.py
```
The errors seem to all be simply complaining about the usage of `fork()` instead of `spawn()` for CUDA multiprocessing. Though, most of the tests in `test_c10d_nccl.py` ask for at least 2 CUDA devices, so I think that the `gpurun` is warranted (assuming that the test file does not need to be run partially on different machines).

Both `test_c10d_common.py` and `test_store.py` work fine with `pytest`.

**Other Notes**
I noticed that `torch.distributed` is imported both as `dist` and as `c10d` and that `c10d` is used throughout the Store tests. I was curious if this is intentional (as opposed to using `dist` to refer to `torch.distributed`). Also, the original [issue](https://github.com/pytorch/pytorch/issues/55340) suggests that the Store tests do not use multiprocessing, but I saw that `torch.multiprocessing` is still used in `TCPStoreTest`.

The links for the Store files in the `CONTRIBUTING.md` [file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken. This can fixed in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59271

Reviewed By: jbschlosser, mrshenli

Differential Revision: D28856920

Pulled By: andwgu

fbshipit-source-id: 630950cba18d34e6b5de661f5a748f2cddc1b446
2021-06-03 10:53:33 -07:00