Commit Graph

15 Commits

Author SHA1 Message Date
nariaki3551
6d6e77eb6b Fix some links in torch/distributed/CONTRIBUTING.md (#79855)
Fix some invalid links in torch/distributed/CONTRIBUTING.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79855
Approved by: https://github.com/H-Huang
2022-06-21 00:48:30 +00:00
Howard Huang
24b7142d7a Update distributed/CONTRIBUTING.md to remove ProcessGroupAgent references and add test instructions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78625

Approved by: https://github.com/mrshenli, https://github.com/albanD
2022-06-01 21:31:12 +00:00
Rohan Varma
bd8feb33d4 Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801)
Summary:
Running one test in test_distributed_spawn is a bit confusing but possible. Add documentation to the CONTRIBUTING.md for this.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67801

Reviewed By: mrshenli

Differential Revision: D32157700

Pulled By: rohan-varma

fbshipit-source-id: a1d10f2fb5f169b46c6d15149bf949082d9bd200
2021-11-04 08:54:31 -07:00
Pritam Damania
2d671ca41b [8/N] Remove c10d/ddp fork tests. (#63454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454

Continuation of https://github.com/pytorch/pytorch/pull/63443, this
PR removes all fork tests from torch.distributed.
ghstack-source-id: 136285511

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30387872

fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513
2021-08-20 12:23:18 -07:00
Howard Huang
aa5141f204 Update CONTRIBUTING.md to remove ProcessGroupAgent (#63160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63160

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284439

Pulled By: H-Huang

fbshipit-source-id: 53c31b6917ef5e2125e146fb0ed73ae3d76a8cf9
2021-08-12 12:26:12 -07:00
Rohan Varma
c2098487e8 [c10d] Move pg wrapper tests to their own file. (#59840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59840

moving these tests to their own standalone file. No meaningful code changes.
ghstack-source-id: 131359162

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D29012664

fbshipit-source-id: 348870016509a6ed7e69240fa82bccef4a12d674
2021-06-14 15:05:55 -07:00
Andrew Gu
6d51a89778 Fix broken hyperlinks (#59425)
Summary:
**Overview:**
A number of the hyperlinks in the [`CONTRIBUTING.md` file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken since they include an extraneous `/torch/`. This PR fixes those links.

The files whose links are broken are
- `ProcessGroupNCCL.hpp`
- `Store.hpp`
- `FileStore.hpp`
- `TCPStore.hpp`
- `PrefixStore.hpp`
- `rref_impl.h`
- `rref_context.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59425

Test Plan:
The `CONTRIBUTING.md` file is at https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md.

`ProcessGroupNCCL.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupGloo.hpp, which is equivalent to `../lib/c10d/ProcessGroupGloo.hpp`.

`Store.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Store.hpp, which is equivalent to `../lib/c10d/Store.hpp`.

`FileStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/FileStore.hpp, which is equivalent to `../lib/c10d/FileStore.hpp`.

`PrefixStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/PrefixStore.hpp, which is equivalent to `../lib/c10d/PrefixStore.hpp`.

`rref_interface.h` should have link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/rref_interface.h, which is equivalent to `../../aten/src/ATen/core/rref_interface.h`.

`rref_context.h` should have link https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/rpc/rref_context.h, which is equivalent to `../csrc/distributed/rpc/rref_context.h`.

Reviewed By: mruberry

Differential Revision: D28888188

Pulled By: andwgu

fbshipit-source-id: 023219184d42284ea1cbfcf519c1b4277dd5a02b
2021-06-04 08:27:26 -07:00
Andrew Gu
2ad4b8e58c Extract c10d Store tests to dedicated test file (#59271)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/55340

**Overview**
This factors out `FileStoreTest`, `HashStoreTest`, `PrefixFileStoreTest`, `TCPStoreTest`, `PrefixTCPStoreTest`, `PythonStoreTest`, `RendezvousTest`, `RendezvousEnvTest`, `RendezvousFileTest`, and `RendezvousTCPTest` from `test_c10d_common.py` to a new file `test_store.py`.

Additionally, unused import/initialization statements are removed from `test_c10d_common.py`, and the minimal set of import/initialization statements are used for `test_store.py`.

Also, this changes `.jenkins/pytorch/multigpu-test.sh`, `.jenkins/pytorch/win-test-helpers/test_distributed.bat`, and `test/run_test.py` to include the new `test_store.py`.

**Testing**
All commands shown are run on an AI AWS cluster.

I check the Store tests:
```
python test/distributed/test_store.py
```

I also check `test_c10d_common.py` since it is the source of the refactored code. In addition, I check `test_c10d_nccl.py` and `test_c10d_gloo.py` since they import from `test_c10d_common.py`; those two should be the only test files depending on `test_c10d_common.py`.
```
python test/distributed/test_c10d_common.py
python test/distributed/test_c10d_nccl.py
python test/distributed/test_c10d_gloo.py
```
`test_c10d_gloo.py` produces warnings about how using sparse tensors in TorchScript is experimental, but the warnings do not result from this PR's changes.

**Testing Issues** (To Be Revisited)
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py
```
Running the above command fails three tests (written as `[Test]`: `[Error]`):
- `ProcessGroupGlooWrapperTest.test_collective_hang`: `RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.200.24.101]:15580`
- `CommTest.test_broadcast_coalesced_gloo_cuda`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
- `CommTest.test_sequence_num_incremented_gloo_default`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
However, running each of the following yields no errors:
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_collective_hang
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_broadcast_coalesced_gloo_cuda
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_sequence_num_incremented_gloo_default
```
This suggests the existence of some inadvertent state dependency between tests (e.g. improper cleanup). I have not explored this further yet. In particular, I do not have a solid understanding of the tests to be able to explain why using `pytest` and `gpurun` induces the failure (since notably, running the `.py` directly shows no issue).

Similarly, running the following yields 47 errors:
```
WORLD_SIZE=4 BACKEND=nccl gpurun pytest test/distributed/test_c10d_nccl.py
```
The errors seem to all be simply complaining about the usage of `fork()` instead of `spawn()` for CUDA multiprocessing. Though, most of the tests in `test_c10d_nccl.py` ask for at least 2 CUDA devices, so I think that the `gpurun` is warranted (assuming that the test file does not need to be run partially on different machines).

Both `test_c10d_common.py` and `test_store.py` work fine with `pytest`.

**Other Notes**
I noticed that `torch.distributed` is imported both as `dist` and as `c10d` and that `c10d` is used throughout the Store tests. I was curious if this is intentional (as opposed to using `dist` to refer to `torch.distributed`). Also, the original [issue](https://github.com/pytorch/pytorch/issues/55340) suggests that the Store tests do not use multiprocessing, but I saw that `torch.multiprocessing` is still used in `TCPStoreTest`.

The links for the Store files in the `CONTRIBUTING.md` [file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken. This can fixed in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59271

Reviewed By: jbschlosser, mrshenli

Differential Revision: D28856920

Pulled By: andwgu

fbshipit-source-id: 630950cba18d34e6b5de661f5a748f2cddc1b446
2021-06-03 10:53:33 -07:00
Pavel Belevich
5cc75e46fa Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56598

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913170

fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b
2021-04-21 22:10:41 -07:00
Luca Wehrstedt
3f8d476857 Split out CUDA RPC tests (#55695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55695

In order to be able to run CUDA tests on their own (e.g., to avoid running CPU tests on GPU machines).

Done by moving test methods to a separate class (and sometimes introducing a "common" base class for utils), and then providing new entry points inside a `cuda/` subdirectory.

Test Plan: Checked they are run on Sandcastle.

Reviewed By: mrshenli

Differential Revision: D27618198

fbshipit-source-id: 8f671657f79c8ae115748ab7752fe0066705893b
2021-04-12 07:48:08 -07:00
Rohan Varma
7e10fbfb71 Add note about TCP init in RPC tests to contributing doc. (#50861)
Summary:
We added this option in https://github.com/pytorch/pytorch/pull/48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50861

Reviewed By: mrshenli

Differential Revision: D26014505

Pulled By: rohan-varma

fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a
2021-01-22 13:28:03 -08:00
Rohan Varma
f0217e2f52 Fix link in distributed contributing doc and add link (#49141)
Summary:
One of the links for ramp up tasks wasn't showing any results and the other was only RPC results. Instead of this, I just changed it to one link that has `pt_distributed_rampup` which seems reasonable as the developer will be able to see both RPC and distributed tasks.

Also added test command for DDP tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49141

Reviewed By: ezyang

Differential Revision: D25597560

Pulled By: rohan-varma

fbshipit-source-id: 85d7d2964a19ea69fe149c017cf88dff835b164a
2020-12-16 14:38:56 -08:00
Yanan Cao
5c4bd9a38f Move python-independent c10d implementations to torch/lib (#47309)
Summary:
* This is a pre-step to build c10d into libtorch
* Includes a minor cleanup in c10d/CMakeLists.txt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47309

Reviewed By: wanchaol

Differential Revision: D24711768

Pulled By: gmagogsfm

fbshipit-source-id: 6f9e0a6a73c30f5ac7dafde9082efcc4b725dde1
2020-11-03 23:39:54 -08:00
Pritam Damania
dbf17a1d4c Fixing a few links in distributed CONTRIBUTING.md (#44753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44753

ghstack-source-id: 112132781

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23719077

fbshipit-source-id: 3d943dfde100d175f417554fc7fca1fdb295129f
2020-09-16 10:14:19 -07:00
Pritam Damania
a2a81e1335 Add a CONTRIBUTING.md for the distributed package. (#44224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44224

The purpose of this file is to help developers on PT distributed get
upto speed on the code structure and layout for PT Distributed.
ghstack-source-id: 111644842

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23548377

fbshipit-source-id: 561d5b8e257642de172def8fdcc1311fae20690b
2020-09-10 14:58:00 -07:00