pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
nariaki3551	6d6e77eb6b	Fix some links in torch/distributed/CONTRIBUTING.md (#79855 ) Fix some invalid links in torch/distributed/CONTRIBUTING.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/79855 Approved by: https://github.com/H-Huang	2022-06-21 00:48:30 +00:00
Howard Huang	24b7142d7a	Update distributed/CONTRIBUTING.md to remove ProcessGroupAgent references and add test instructions Pull Request resolved: https://github.com/pytorch/pytorch/pull/78625 Approved by: https://github.com/mrshenli, https://github.com/albanD	2022-06-01 21:31:12 +00:00
Rohan Varma	bd8feb33d4	Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801 ) Summary: Running one test in test_distributed_spawn is a bit confusing but possible. Add documentation to the CONTRIBUTING.md for this. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/67801 Reviewed By: mrshenli Differential Revision: D32157700 Pulled By: rohan-varma fbshipit-source-id: a1d10f2fb5f169b46c6d15149bf949082d9bd200	2021-11-04 08:54:31 -07:00
Pritam Damania	2d671ca41b	[8/N] Remove c10d/ddp fork tests. (#63454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454 Continuation of https://github.com/pytorch/pytorch/pull/63443, this PR removes all fork tests from torch.distributed. ghstack-source-id: 136285511 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D30387872 fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513	2021-08-20 12:23:18 -07:00
Howard Huang	aa5141f204	Update CONTRIBUTING.md to remove ProcessGroupAgent (#63160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63160 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284439 Pulled By: H-Huang fbshipit-source-id: 53c31b6917ef5e2125e146fb0ed73ae3d76a8cf9	2021-08-12 12:26:12 -07:00
Rohan Varma	c2098487e8	[c10d] Move pg wrapper tests to their own file. (#59840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59840 moving these tests to their own standalone file. No meaningful code changes. ghstack-source-id: 131359162 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D29012664 fbshipit-source-id: 348870016509a6ed7e69240fa82bccef4a12d674	2021-06-14 15:05:55 -07:00
Andrew Gu	6d51a89778	Fix broken hyperlinks (#59425 ) Summary: Overview: A number of the hyperlinks in the [`CONTRIBUTING.md` file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken since they include an extraneous `/torch/`. This PR fixes those links. The files whose links are broken are - `ProcessGroupNCCL.hpp` - `Store.hpp` - `FileStore.hpp` - `TCPStore.hpp` - `PrefixStore.hpp` - `rref_impl.h` - `rref_context.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/59425 Test Plan: The `CONTRIBUTING.md` file is at https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md. `ProcessGroupNCCL.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupGloo.hpp, which is equivalent to `../lib/c10d/ProcessGroupGloo.hpp`. `Store.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Store.hpp, which is equivalent to `../lib/c10d/Store.hpp`. `FileStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/FileStore.hpp, which is equivalent to `../lib/c10d/FileStore.hpp`. `PrefixStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/PrefixStore.hpp, which is equivalent to `../lib/c10d/PrefixStore.hpp`. `rref_interface.h` should have link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/rref_interface.h, which is equivalent to `../../aten/src/ATen/core/rref_interface.h`. `rref_context.h` should have link https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/rpc/rref_context.h, which is equivalent to `../csrc/distributed/rpc/rref_context.h`. Reviewed By: mruberry Differential Revision: D28888188 Pulled By: andwgu fbshipit-source-id: 023219184d42284ea1cbfcf519c1b4277dd5a02b	2021-06-04 08:27:26 -07:00
Andrew Gu	2ad4b8e58c	Extract c10d Store tests to dedicated test file (#59271 ) Summary: Partially addresses https://github.com/pytorch/pytorch/issues/55340 Overview This factors out `FileStoreTest`, `HashStoreTest`, `PrefixFileStoreTest`, `TCPStoreTest`, `PrefixTCPStoreTest`, `PythonStoreTest`, `RendezvousTest`, `RendezvousEnvTest`, `RendezvousFileTest`, and `RendezvousTCPTest` from `test_c10d_common.py` to a new file `test_store.py`. Additionally, unused import/initialization statements are removed from `test_c10d_common.py`, and the minimal set of import/initialization statements are used for `test_store.py`. Also, this changes `.jenkins/pytorch/multigpu-test.sh`, `.jenkins/pytorch/win-test-helpers/test_distributed.bat`, and `test/run_test.py` to include the new `test_store.py`. Testing All commands shown are run on an AI AWS cluster. I check the Store tests: ``` python test/distributed/test_store.py ``` I also check `test_c10d_common.py` since it is the source of the refactored code. In addition, I check `test_c10d_nccl.py` and `test_c10d_gloo.py` since they import from `test_c10d_common.py`; those two should be the only test files depending on `test_c10d_common.py`. ``` python test/distributed/test_c10d_common.py python test/distributed/test_c10d_nccl.py python test/distributed/test_c10d_gloo.py ``` `test_c10d_gloo.py` produces warnings about how using sparse tensors in TorchScript is experimental, but the warnings do not result from this PR's changes. Testing Issues (To Be Revisited) ``` WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py ``` Running the above command fails three tests (written as `[Test]`: `[Error]`): - `ProcessGroupGlooWrapperTest.test_collective_hang`: `RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.200.24.101]:15580` - `CommTest.test_broadcast_coalesced_gloo_cuda`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54` - `CommTest.test_sequence_num_incremented_gloo_default`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54` However, running each of the following yields no errors: ``` WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_collective_hang WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_broadcast_coalesced_gloo_cuda WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_sequence_num_incremented_gloo_default ``` This suggests the existence of some inadvertent state dependency between tests (e.g. improper cleanup). I have not explored this further yet. In particular, I do not have a solid understanding of the tests to be able to explain why using `pytest` and `gpurun` induces the failure (since notably, running the `.py` directly shows no issue). Similarly, running the following yields 47 errors: ``` WORLD_SIZE=4 BACKEND=nccl gpurun pytest test/distributed/test_c10d_nccl.py ``` The errors seem to all be simply complaining about the usage of `fork()` instead of `spawn()` for CUDA multiprocessing. Though, most of the tests in `test_c10d_nccl.py` ask for at least 2 CUDA devices, so I think that the `gpurun` is warranted (assuming that the test file does not need to be run partially on different machines). Both `test_c10d_common.py` and `test_store.py` work fine with `pytest`. Other Notes I noticed that `torch.distributed` is imported both as `dist` and as `c10d` and that `c10d` is used throughout the Store tests. I was curious if this is intentional (as opposed to using `dist` to refer to `torch.distributed`). Also, the original [issue](https://github.com/pytorch/pytorch/issues/55340) suggests that the Store tests do not use multiprocessing, but I saw that `torch.multiprocessing` is still used in `TCPStoreTest`. The links for the Store files in the `CONTRIBUTING.md` [file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken. This can fixed in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59271 Reviewed By: jbschlosser, mrshenli Differential Revision: D28856920 Pulled By: andwgu fbshipit-source-id: 630950cba18d34e6b5de661f5a748f2cddc1b446	2021-06-03 10:53:33 -07:00
Pavel Belevich	5cc75e46fa	Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56598 Test Plan: NA Reviewed By: SciPioneer Differential Revision: D27913170 fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b	2021-04-21 22:10:41 -07:00
Luca Wehrstedt	3f8d476857	Split out CUDA RPC tests (#55695 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55695 In order to be able to run CUDA tests on their own (e.g., to avoid running CPU tests on GPU machines). Done by moving test methods to a separate class (and sometimes introducing a "common" base class for utils), and then providing new entry points inside a `cuda/` subdirectory. Test Plan: Checked they are run on Sandcastle. Reviewed By: mrshenli Differential Revision: D27618198 fbshipit-source-id: 8f671657f79c8ae115748ab7752fe0066705893b	2021-04-12 07:48:08 -07:00
Rohan Varma	7e10fbfb71	Add note about TCP init in RPC tests to contributing doc. (#50861 ) Summary: We added this option in https://github.com/pytorch/pytorch/pull/48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50861 Reviewed By: mrshenli Differential Revision: D26014505 Pulled By: rohan-varma fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a	2021-01-22 13:28:03 -08:00
Rohan Varma	f0217e2f52	Fix link in distributed contributing doc and add link (#49141 ) Summary: One of the links for ramp up tasks wasn't showing any results and the other was only RPC results. Instead of this, I just changed it to one link that has `pt_distributed_rampup` which seems reasonable as the developer will be able to see both RPC and distributed tasks. Also added test command for DDP tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49141 Reviewed By: ezyang Differential Revision: D25597560 Pulled By: rohan-varma fbshipit-source-id: 85d7d2964a19ea69fe149c017cf88dff835b164a	2020-12-16 14:38:56 -08:00
Yanan Cao	5c4bd9a38f	Move python-independent c10d implementations to torch/lib (#47309 ) Summary: * This is a pre-step to build c10d into libtorch * Includes a minor cleanup in c10d/CMakeLists.txt Pull Request resolved: https://github.com/pytorch/pytorch/pull/47309 Reviewed By: wanchaol Differential Revision: D24711768 Pulled By: gmagogsfm fbshipit-source-id: 6f9e0a6a73c30f5ac7dafde9082efcc4b725dde1	2020-11-03 23:39:54 -08:00
Pritam Damania	dbf17a1d4c	Fixing a few links in distributed CONTRIBUTING.md (#44753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44753 ghstack-source-id: 112132781 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D23719077 fbshipit-source-id: 3d943dfde100d175f417554fc7fca1fdb295129f	2020-09-16 10:14:19 -07:00
Pritam Damania	a2a81e1335	Add a CONTRIBUTING.md for the distributed package. (#44224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44224 The purpose of this file is to help developers on PT distributed get upto speed on the code structure and layout for PT Distributed. ghstack-source-id: 111644842 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D23548377 fbshipit-source-id: 561d5b8e257642de172def8fdcc1311fae20690b	2020-09-10 14:58:00 -07:00

15 Commits