pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Shen Li	725d6cd8ce	Extract common classes and functions from test_c10d to common_distributed (#23660 ) Summary: MultiProcessTestCase will be useful for both c10d and rpc tests. So, this diff extracts that class and some common decorators to a separate file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23660 Reviewed By: pietern Differential Revision: D16602865 Pulled By: mrshenli fbshipit-source-id: 85ad47dfb8ba187b7debeb3edeea5df08ef690c7	2019-08-02 09:19:32 -07:00
Pieter Noordhuis	95e822622b	Enhance interpretation of GLOO_SOCKET_IFNAME (#22978 ) Summary: With this change you can now list multiple interfaces separated by comma. ProcessGroupGloo creates a single Gloo context for every device in the list (a context represents a connection to every other rank). For every collective that is called, it will select the context in a round robin fashion. The number of worker threads responsible for executing the collectives is set to be twice the number of devices. If you have a single physical interface, and wish to employ increased parallelism, you can also specify `GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`. This makes ProcessGroupGloo use 4 connections per rank, 4 I/O threads, and 8 worker threads responsible for executing the collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978 ghstack-source-id: 87006270 Differential Revision: D16339962 fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1	2019-07-25 04:52:38 -07:00
Pieter Noordhuis	0ffda97aa4	Make Gloo an optional c10d dependency (#22257 ) Summary: The CMake modifications include removal of some unnecessary paths (e.g. find_package(CUDA) and friends) that are no longer used since c10d is always part of the larger torch build. The macro `C10D_USE_...` was ambiguous and is now removed in favor of only having top level `USE_...`. The c10d test suite is changed to include skip annotations for the tests that depend on Gloo as well. Now, if you compile with `USE_DISTRIBUTED=1` and `USE_GLOO=0` you get a functioning build for which the tests actually pass. Closes https://github.com/pytorch/pytorch/issues/18851. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22257 Differential Revision: D16087993 Pulled By: pietern fbshipit-source-id: 0cea66bd5cbd9736b06fa1d45ee13a18cab88adb	2019-07-02 02:39:48 -07:00
Pieter Noordhuis	7a40412158	Delay reduction of unused parameters until first autograd hook is called (#22219 ) Summary: Reduction of gradients for unused parameters should happen as soon as possible, because they potentially block reduction of gradients for used parameters. This used to happen instantly when `prepare_for_backward` was called and it found parameters that didn't contribute. This meant that if you have a model with unused parameters, and you want to discard the model output (i.e. not call backward on some loss), reduction of the gradients of those unused parameters would have been kicked off, and you'd see an error the next time you called `forward`. In this commit, this original approach is slightly changed to delay reduction of the gradients of those unused parameters until the first autograd hook is called. This means that you can now discard the model output regardless of the model having unused parameters or not. This is a prerequisite for making the `find_unused_parameters` argument to DDP default to `True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22219 Differential Revision: D16028698 Pulled By: pietern fbshipit-source-id: c6aec2cd39c4a77746495d9cb1c9fb9c5ac61983	2019-06-27 14:09:44 -07:00
Pieter Noordhuis	77eda8de8e	Support sparse gradients in DistributedDataParallel (#22037 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22037 This adds support for sparse gradients to the reducer as well as to the DistributedDataParallel wrapper. Note that an out of band signal is needed whether or not a dense parameter (e.g. an embedding) is expected to receive a sparse gradient or not. This information is passed to the bucket assignment computation routine and the reducer as a vector of booleans. Every parameter for which we expect a sparse gradient is assigned its own bucket, as we cannot easily group multiple unrelated sparse tensors. Reviewed By: mrshenli Differential Revision: D15926383 fbshipit-source-id: 39c0d5dbd95bf0534314fdf4d44b2385d5321aaf	2019-06-24 07:34:12 -07:00
Pieter Noordhuis	a7ec889de4	Add sparse tensor allreduce (#22036 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22036 Implemented only on ProcessGroupGloo, as an allgather of metadata (sparse_dim, dense_dim, and nnz), followed by an allgather of indices, followed by an allgather of values. Once these operations have finished, all ranks locally compute a reduction over these sparse tensors. Works for both CPU and CUDA tensors. This surfaced a problem with the existing assumption of only modifying tensors that are passed at the call site, because for sparse tensors we don't know the dimensions of the output tensors before we run the collective. To deal with this unknown, this commit adds a `result` function to the `c10d::ProcessGroup::Work` class that returns a vector of tensors. It's a bit odd to have to retrieve the result through this function only for operations on sparse tensors. To make this work irrespective of tensor layout, we can create a follow-up commit to make all in place operations make their results accessible through this function as well. This doesn't break any existing contracts but does have the potential to add interface ambiguity. This is a resubmission of #19146. Reviewed By: mrshenli Differential Revision: D15926384 fbshipit-source-id: b6ee5d81606bfa8ed63c3d63a9e307613491e0ae	2019-06-24 07:34:09 -07:00
Shen Li	08facca1a1	Support accumulating DDP grads using a context manager (#21736 ) Summary: The first attempt and more discussions are available in https://github.com/pytorch/pytorch/issues/19577 #### Goal Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations. #### Concerns Our first attempt in https://github.com/pytorch/pytorch/issues/19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead. #### Proposed Solution Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP. It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required.. The application would then look like: ```python with ddp.no_sync(): for input in inputs: ddp(input).backward() ddp(one_more_input).backward() optimizer.step() ``` chenyangyu1988 myleott Pull Request resolved: https://github.com/pytorch/pytorch/pull/21736 Differential Revision: D15805215 Pulled By: mrshenli fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c	2019-06-20 12:23:52 -07:00
Edward Yang	76fe91bb2f	Revert D14889547: Add sparse tensor allreduce Differential Revision: D14889547 Original commit changeset: 34f3de4d6a2e fbshipit-source-id: 24d2239da0b865280af88dce3d8fb25883fc0174	2019-06-20 10:07:27 -07:00
Edward Yang	cb4c213f55	Revert D15007365: Support sparse gradients in DistributedDataParallel Differential Revision: D15007365 Original commit changeset: f298e83fd3ca fbshipit-source-id: ef5e556d2df37f0c64652bd3563956afd8d9fd7f	2019-06-20 10:07:22 -07:00
Pieter Noordhuis	365de7bda1	Support sparse gradients in DistributedDataParallel (#19443 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19443 This adds support for sparse gradients to the reducer as well as to the DistributedDataParallel wrapper. Note that an out of band signal is needed whether or not a dense parameter (e.g. an embedding) is expected to receive a sparse gradient or not. This information is passed to the bucket assignment computation routine and the reducer as a vector of booleans. Every parameter for which we expect a sparse gradient is assigned its own bucket, as we cannot easily group multiple unrelated sparse tensors. Reviewed By: mrshenli Differential Revision: D15007365 fbshipit-source-id: f298e83fd3ca828fae9e80739e1db89d045c99ac	2019-06-20 07:06:28 -07:00
Pieter Noordhuis	aee6a412e9	Add sparse tensor allreduce (#19146 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19146 Implemented only on ProcessGroupGloo, as an allgather of metadata (sparse_dim, dense_dim, and nnz), followed by an allgather of indices, followed by an allgather of values. Once these operations have finished, all ranks locally compute a reduction over these sparse tensors. Works for both CPU and CUDA tensors. This surfaced a problem with the existing assumption of only modifying tensors that are passed at the call site, because for sparse tensors we don't know the dimensions of the output tensors before we run the collective. To deal with this unknown, this commit adds a `result` function to the `c10d::ProcessGroup::Work` class that returns a vector of tensors. It's a bit odd to have to retrieve the result through this function only for operations on sparse tensors. To make this work irrespective of tensor layout, we can create a follow-up commit to make all in place operations make their results accessible through this function as well. This doesn't break any existing contracts but does have the potential to add interface ambiguity. Reviewed By: mrshenli Differential Revision: D14889547 fbshipit-source-id: 34f3de4d6a2e09c9eba368df47daad0dc11b333e	2019-06-20 07:06:24 -07:00
Shen Li	cbcb2b5ad7	Delete DDP hooks in Reducer destructor (#21591 ) Summary: Closes https://github.com/pytorch/pytorch/issues/21344 DDP assigns the original module to the first module replica instead of creating a new one. Then, it creates a new Reducer to add post hooks to sync gradients. However, because every reconstructed DDP instance wraps the same original module, all their reducers will add hooks to the same set of variables. This PR deletes DDP hooks from variables when destructing Reducer, trying to make DDP failure recoverable. pietern kuttas and I discussed the following solutions: #### Solution 1 Keep `add_post_hook` API intact, and do a `dynamic_cast` in `del_post_hook` to check hook type. If the type matches Reducer's hook, delete it. As pietern mentioned, this will not work if we create multiple DDP instances from the same original model. #### Solution 2 Use a counter to generate a unique key for every hook in `Function`, and keep them in a map. return the key to the caller of `add_post_hook`, and ask the caller to provide key if it needs to delete the hook. Con: this would add extra overhead to `add_post_hook` and every `Function` object. #### Solution 3 [Current implementation] kuttas suggests that, instead of generating a unique key, directly using the address of the pointer would be better. In order to avoid messing up dereferencing, let `add_post_hook` to return a `uintptr_t`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/21591 Differential Revision: D15745706 Pulled By: mrshenli fbshipit-source-id: e56d2d48de0c65f6667790ab16337eac7f7d8b76	2019-06-12 07:08:28 -07:00
Shen Li	8acaa286b7	Make CUDACachingAllocator::recordStream() a no-op on null ptrs (#20658 ) Summary: Fixes #20651 Communication collectives in `torch.distributed` call `CUDACachingAllocator::recordStream()` on input and output tensors to prevent their memory blocks being freed too early. `CUDACachingAllocator` uses tensor's data pointer to track memory blocks, which does not accept null pointers. However, empty tensor's `storage().data()` might be null. In this case, as there is no associated memory block for the empty tensor, it should be fine to make `recordStream()` a no-op. Tests only cover `broadcast` empty tensors for GLOO backend, because GLOO does not support empty inputs (facebookincubator/gloo/issues/179). It can be addressed in either `ProcessGroupGloo` or GLOO itself. Will add more tests when that gap is filled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20658 Differential Revision: D15399371 Pulled By: mrshenli fbshipit-source-id: d29ebd1c72fddae49531f32695f81b89e42e5a4d	2019-05-20 07:13:51 -07:00
Pieter Noordhuis	a0e5240afc	Fix DistributedDataParallelTest.test_accumulate_gradients (#20351 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20351 This was broken because of a merge race between #20282 and the stack in #20236. Cleaned up the test and comments a bit as well. Differential Revision: D15292786 fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228	2019-05-09 23:27:18 -07:00
Pieter Noordhuis	558c6c4d8a	Make DistributedDataParallel usable with CPU models (#20236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236 Use the new version of broadcast_coalesced that deals with both CPU and CUDA models. Add tests that evaluate correctness of DistributedDataParallel for CPU models. Closes #17757. Reviewed By: mrshenli Differential Revision: D15245428 fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a	2019-05-09 14:11:17 -07:00
Pieter Noordhuis	f32c9bd5e9	Refactor core DistributedDataParallel tests (#20235 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20235 The tests expected to only run for CUDA models. In a future commit we need to update this to work for CPU models as well. Therefore, we can no longer rely on only integers being passed for device identifiers. With this change we pass both the materialized list of devices to use (as `torch.Device` objects), as well as an optional list of integers. The latter is specified to exercise the code in the DistributedDataParallel constructor that turns a list of integers into CUDA devices, IFF it is used to wrap a single-device CUDA module. This commit also groups together the 'str' and non-'str' tests. These used to test passing the list of devices as integers or as `torch.Device` instances. These are now executed from the same test. Reviewed By: mrshenli Differential Revision: D15245429 fbshipit-source-id: 5797ba9db33d2c26db8e7493c91bb52f694285ac	2019-05-09 14:11:14 -07:00
Pieter Noordhuis	caa0d0c50a	Add c10d::broadcast_coalesced and tests (#20234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20234 The differences with the existing function _dist_broadcast_coalesced is that this one works for both CPU and CUDA tensors and that it has a maximum number of in flight operations. This should be the final change needed to have only a single version of DistributedDataParallel that both supports CPU and CUDA models, or even a mix of both. See #17757 for more information. Reviewed By: mrshenli Differential Revision: D15228099 fbshipit-source-id: a2113ba6b09b68cb5328f49f4c1960031eb43c93	2019-05-09 14:11:08 -07:00
Chenyang Yu	2019f6cd51	Add unit test to ensure no gradients sync when calling ddp.module(input) (#20282 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20282 Add unit test to ensure no gradients sync when calling ddp.module(input), e.g not invoking prepare_for_backward PyText is depending on DDP for data parallel distributed training. To support accumulate gradients locally before gradients sync, we are calling orig_model.forward instead of ddp_model.forward. Add a unit test to avoid changes break the assumption. Reviewed By: pietern, mrshenli Differential Revision: D15263155 fbshipit-source-id: 7734e174f507690fb23ea6c52dffff4a93f9b151	2019-05-09 10:15:19 -07:00
Pieter Noordhuis	841360029a	Finer grained consistency check in reducer (#19901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19901 The existing code used `expect_autograd_hooks_` as a proxy for the situation where finalization of the previous iteration is needed. This is not correct, however, since you may decide to completely ignore the output of a DDP wrapped module. If this is the case, and no gradients have been passed to the reducer, it is fine to keep going. This commit adds a new variable `require_finalize_` that tracks whether the finalization is really needed. Reviewed By: mrshenli Differential Revision: D15118871 fbshipit-source-id: 25938eaf1fe13e2940feae1312892b9d3da8a67d	2019-04-28 23:12:19 -07:00
Pieter Noordhuis	5525c419fc	Only call into reducer if torch.is_grad_enabled() (#19897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19897 During validation, gradient reduction is not needed, and autograd is never called. The model output will always be a detached tensor. After the new reducer was merged, this meant that it would find all model parameters unused, and kick off reduction for them. When #19799 and output where no parameters are used and it tries to kick off reduction of zeroed gradients. Test for `torch.is_grad_enabled()` and `self.training` before calling into the reducer. Reviewed By: mrshenli Differential Revision: D15118726 fbshipit-source-id: b0208f632a61cbe8110fa626fa427937b7f05924	2019-04-28 23:12:16 -07:00
Shen Li	b695e562e5	Make find_unused_parameters in DDP default to False (#19895 ) Summary: As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer. CC pietern soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895 Reviewed By: pietern Differential Revision: D15118563 Pulled By: mrshenli fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58	2019-04-28 21:22:26 -07:00
Pieter Noordhuis	9b69da2b55	Allow for iterations where no module parameter is used (#19821 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19821 It is possible that not a single parameter is used during an iteration. If this is the case, the `prepare_for_backward` function marks all parameters as unused, kicks off reduction of all buckets, and finalizes the reduction. This is different from the prior implementation where we assumed that autograd would produce a gradient for at least a single parameter. We then used the autograd callback mechanism to queue a finalizer callback. Now, this finalizer may be executed in line. Reviewed By: mrshenli Differential Revision: D15113272 fbshipit-source-id: dc91458b569cd8c106ddaeea558464b515683550	2019-04-27 22:57:59 -07:00
Max Wang	c5845c4482	Add support for reduce-scatter in c10d (#18844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844 ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18844 Add support for reduce-scatter in c10d * #18820 Refactor ProcessGroupNCCL collective primitives Reviewed By: mrshenli Differential Revision: D14768369 fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156	2019-04-26 13:46:57 -07:00
Pieter Noordhuis	0d8a3610c5	Multiple module outputs and multiple calls to backward (#19799 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19799 A module that returns multiple outputs and where the called may end up doing multiple calls to torch.autograd.backward did not work with DistributedDataParallel. It expected the first call to torch.autograd.backward to provide gradients for ALL parameters that expect gradients and were used in computing the module output. If you have outputs with disjoint autograd graphs it is fine to call torch.autograd.backward on both and fill in the module's parameter gradients in separate chunks. With this change we delay queuing the finalizer callback until we have marked all buckets as ready, instead of queueing it the first time we receive an autograd hook. This returns the current implementation to be functionally equivalent to the DistributedDataParallel implementation before #18953 was merged. Reviewed By: mrshenli Differential Revision: D15097045 fbshipit-source-id: 2df023319713bc31e29a8b45108c78e6593fccd4	2019-04-26 08:20:10 -07:00
Pieter Noordhuis	6325b6e44e	Make finding unused model parameters optional (#19515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515 This is still done by default, but can now be disabled by specifying `find_unused_parameters=False`. There are use cases where finding unused parameters results in erroneous behavior, because a subset of model parameters is used outside the `forward` function. One can argue that doing this is not a good idea, but we should not break existing use cases without an escape hatch. This configuration parameter is that escape hatch. Reviewed By: bddppq Differential Revision: D15016381 fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0	2019-04-19 17:23:36 -07:00
Pieter Noordhuis	a5c4348d54	Recursively find tensors in DDP module output (#19360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360 We'll return the output object verbatim since it is a freeform object. We need to find any tensors in this object, though, because we need to figure out which parameters were used during this forward pass, to ensure we short circuit reduction for any unused parameters. Before this commit only lists were handled and the functionality went untested. This commit adds support for dicts and recursive structures, and also adds a test case. Closes #19354. Reviewed By: mrshenli Differential Revision: D14978016 fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8	2019-04-18 14:57:09 -07:00
Shen Li	6732358bf9	Allow DDP to wrap multi-GPU modules (#19271 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19271 allow DDP to take multi-gpu models Reviewed By: pietern Differential Revision: D14822375 fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c	2019-04-17 21:21:54 -07:00
Pieter Noordhuis	a0263ec047	Make DistributedDataParallel use new reducer (#18953 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953 This removes Python side bucketing code from DistributedDataParallel and replaces it with calls to the new C++ based bucketing and reducing code. To confirm this is working well, we ran a test with both the previous implementation and the new implementation, and confirmed they are numerically equivalent. Performance is improved by a couple percent or more, including the single machine multiple GPU runs. Closes #13273. Reviewed By: mrshenli Differential Revision: D14580911 fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2	2019-04-15 12:44:38 -07:00
Shen Li	6b0ca8eae5	Fix flaky store timeout test (#19114 ) Summary: ~Sometimes, `init_process_group()`, `store.get()`, and `destory_process_group()` can take more than a few seconds. Hence, removing thread join timeout.~ The error was due to `Address already in use` when starting TPC backend. The solution is to catch the error and report it to the `retry_on_address_already_in_use_error` decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19114 Reviewed By: ezyang Differential Revision: D14872680 Pulled By: mrshenli fbshipit-source-id: fc504d02853ca73f76288c0ade564ab20bc01f7e	2019-04-10 20:35:36 -07:00
Shen Li	8f9b11cf33	Propagate ProcessGroup timeout to Store (#16571 ) Summary: closes #16520 Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks! Questions: 1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion? 2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571 Differential Revision: D13954527 Pulled By: mrshenli fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87	2019-04-09 12:36:28 -07:00
Pieter Noordhuis	edc7b4726b	Increase default c10d/ProcessGroupGloo test timeout (#18916 ) Summary: See #18659. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18916 Differential Revision: D14808749 Pulled By: pietern fbshipit-source-id: 9a9c8beddb2dbbb1bf4c5e575743d9e1fa3f07fa	2019-04-05 12:16:30 -07:00
Pieter Noordhuis	ce92cf9bd1	Add tests for reducer class (#18845 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845 This adds a few CPU only test cases for the reducer class. Reviewed By: mrshenli Differential Revision: D14768432 fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662	2019-04-05 09:07:29 -07:00
Edward Yang	2934153f35	Correctly call superclass setUp in TestCase subclasses. (#18291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18291 ghimport-source-id: d6e95e899bd320407967df41435801e54864ba62 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18292 Add test for #17271 (torch.exp incorrect for 2*31 size tensor) #18291 Correctly call superclass setUp in TestCase subclasses. This makes PYTORCH_TEST_SKIP_FAST work correctly for more tests, reducing the wasted testing effort on our slow_test job. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14567643 fbshipit-source-id: 40cf1d6556e0dd0a0550ff3d9ffed8b6000f8191	2019-03-22 07:46:44 -07:00
Edward Yang	879bf65811	Disable flaky test Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16274 Reviewed By: pietern Differential Revision: D13788036 fbshipit-source-id: a9b7353fb0655908e6d47387cc77af33e9471aed	2019-01-23 11:57:44 -08:00
Teng Li	b4bc55beef	TCP init method race condition fix (#15684 ) Summary: This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it. This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself. This should fix: https://github.com/pytorch/pytorch/issues/15638 Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage. I had to make rendezvous test in c10d the world size of 1, since it is a single process code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684 Differential Revision: D13570904 Pulled By: teng-li fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589	2019-01-18 02:29:38 -08:00
Jane Wang	f8455ed754	add gloo support for gather on GPU (#14916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14916 as titled Reviewed By: pietern Differential Revision: D13267832 fbshipit-source-id: 3b89d08af93f74941f17ff892c33fc2a4a023c19	2018-12-11 21:21:10 -08:00
Jane Wang	0552326846	add gloo scatter support on GPU (#14917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14917 as titled Reviewed By: pietern Differential Revision: D13271560 fbshipit-source-id: 0187a3390f8ebd72a2c074e7a651432159d427c0	2018-12-11 17:11:13 -08:00
Jane Wang	483ba553bd	add gloo allgather support on GPU (#14576 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14576 as titled Reviewed By: pietern Differential Revision: D13266063 fbshipit-source-id: e262f77d63724a7504a7112907bbfba49612fe75	2018-12-10 14:32:54 -08:00
Teng Li	bfa666eb0d	Skipping two c10d tests only if there are multi-GPUs (#14860 ) Summary: Otherwise, these tests will fail, even though there are never meant to run on single GPU machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14860 Differential Revision: D13369060 Pulled By: teng-li fbshipit-source-id: 8a637a6d57335491ba8602cd09927700b2bbf8a0	2018-12-06 17:28:07 -08:00
Pieter Noordhuis	67dcf10631	Increase test timeout (#14814 ) Summary: It is possible that some sort of contention causes process scheduling delays which in turn cause the timeout to not be hit. Increased sleep here will decrease the probability of this happening. Fixes #14555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14814 Differential Revision: D13351924 Pulled By: pietern fbshipit-source-id: 1222cf0855408dfcb79f30f94694c790ee998cf9	2018-12-05 17:18:11 -08:00
Pieter Noordhuis	c02b3e7cea	Retry test on address already in use error (#14815 ) Summary: Thanks nairbv for the suggestion. Also see #14589. Fixes #14703. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14815 Differential Revision: D13351913 Pulled By: pietern fbshipit-source-id: d11a4152505d0ce15592b13e417bb80551476a61	2018-12-05 17:09:46 -08:00
Pieter Noordhuis	7da2448d62	Fix multi-argument allreduce in ProcessGroupGloo (#14688 ) Summary: If multiple arguments are specified to c10d allreduce, they are interpreted as if they are expanding the ranks in the process group. Therefore, not only is every argument to allreduce an input that must be considered, it is also an output. The problem that this commit fixes is that they were not correctly considered as outputs. The upstream problem is tracked in facebookincubator/gloo#152. Once this is fixed there we can remove the copies that this commit adds. This fixes #14676. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14688 Differential Revision: D13294405 Pulled By: pietern fbshipit-source-id: 078a2a0a0ff12d051392461438f1496201ec3cb9	2018-12-03 09:41:17 -08:00
Teng Li	0d3cb91d8c	Make env init_method support both env and args for rank and size (#14494 ) Summary: Fixing: https://github.com/pytorch/pytorch/issues/14446 This was a supported behavior in old torch.distributed. We want to support it in the new release. Test should cover all combination of scenario when we have either env or arg set up for rank or size or both Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494 Differential Revision: D13253433 Pulled By: teng-li fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848	2018-11-29 18:48:20 -08:00
Jane Wang	dc7498c84d	add gloo support for reduce on GPU (#14443 ) Summary: as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/14443 Reviewed By: pietern Differential Revision: D13222907 Pulled By: janewangfb fbshipit-source-id: f418c5d84880196f97089114d02957cf739243f8	2018-11-29 16:19:39 -08:00
Teng Li	7d07fcd215	Fixed SyncParam/QueueReduction/SyncReduction test for 2+ GPUs (#14452 ) Summary: Fixed: https://github.com/pytorch/pytorch/issues/14445 Also bumped up timeout to 30 seconds, since on 8-GPU machines, DDP test will take more than 15 seconds sometimes. Tested on 8 GPU machines: ``` tengli@learnfair062:~/pytorch/test$ python test_c10d.py --verbose test_dist_broadcast_coalesced_gloo (__main__.DistributedDataParallelTest) ... ok test_dist_broadcast_coalesced_nccl (__main__.DistributedDataParallelTest) ... skipped 'Test skipped due to known issues' test_fp16 (__main__.DistributedDataParallelTest) ... ok test_gloo_backend (__main__.DistributedDataParallelTest) ... ok test_nccl_backend (__main__.DistributedDataParallelTest) ... ok test_queue_reduction (__main__.DistributedDataParallelTest) ... ok test_sync_params_no_buffers (__main__.DistributedDataParallelTest) ... ok test_sync_params_with_buffers (__main__.DistributedDataParallelTest) ... ok test_sync_reduction (__main__.DistributedDataParallelTest) ... ok test_set_get (__main__.FileStoreTest) ... ok test_set_get (__main__.PrefixFileStoreTest) ... ok test_set_get (__main__.PrefixTCPStoreTest) ... ok test_allgather_basics (__main__.ProcessGroupGlooTest) ... ok test_allgather_checks (__main__.ProcessGroupGlooTest) ... ok test_allreduce_basics (__main__.ProcessGroupGlooTest) ... ok test_allreduce_basics_cuda (__main__.ProcessGroupGlooTest) ... ok test_allreduce_checks (__main__.ProcessGroupGlooTest) ... ok test_allreduce_stress (__main__.ProcessGroupGlooTest) ... ok test_allreduce_stress_cuda (__main__.ProcessGroupGlooTest) ... ok test_broadcast_basics (__main__.ProcessGroupGlooTest) ... ok test_broadcast_basics_cuda (__main__.ProcessGroupGlooTest) ... ok test_broadcast_checks (__main__.ProcessGroupGlooTest) ... ok test_broadcast_stress (__main__.ProcessGroupGlooTest) ... ok test_broadcast_stress_cuda (__main__.ProcessGroupGlooTest) ... ok test_gather_basics (__main__.ProcessGroupGlooTest) ... ok test_gather_checks (__main__.ProcessGroupGlooTest) ... ok test_reduce_basics (__main__.ProcessGroupGlooTest) ... ok test_reduce_checks (__main__.ProcessGroupGlooTest) ... ok test_scatter_basics (__main__.ProcessGroupGlooTest) ... ok test_scatter_checks (__main__.ProcessGroupGlooTest) ... ok test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... ok test_timeout_kwarg (__main__.ProcessGroupGlooTest) ... ok test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_barrier (__main__.ProcessGroupNCCLTest) ... ok test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_common_errors (__main__.RendezvousEnvTest) ... ok test_nominal (__main__.RendezvousEnvTest) ... ok test_common_errors (__main__.RendezvousFileTest) ... ok test_nominal (__main__.RendezvousFileTest) ... ok test_common_errors (__main__.RendezvousTCPTest) ... ok test_nominal (__main__.RendezvousTCPTest) ... ok test_unknown_handler (__main__.RendezvousTest) ... ok test_address_already_in_use (__main__.TCPStoreTest) ... ok test_set_get (__main__.TCPStoreTest) ... ok ---------------------------------------------------------------------- Ran 46 tests in 162.980s OK (skipped=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14452 Differential Revision: D13230652 Pulled By: teng-li fbshipit-source-id: 88580fe55b3a4fbc7a499ca3b591958f11623bf8	2018-11-27 21:58:34 -08:00
Pieter Noordhuis	2cc35c161a	Barrier synchronizes with prior work before completing (#14386 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14386 See #13573, #14142, and #14271 for discussion. This change updates ProcessGroupGloo to ensure that all prior operations have completed before executing the barrier. Reviewed By: manojkris Differential Revision: D13205022 fbshipit-source-id: 673e7e6ca357dc843874d6dd8da590832e1de7fa	2018-11-27 10:46:42 -08:00
Pieter Noordhuis	9598d380b0	Make ProcessGroup::Work::wait() throw (#14298 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14298 This is a breaking API change for users of the C++ c10d API. The work object defined wait() to return a boolean. If the work completed successfully it would return true, if it didn't it would return false. It was then up to the user to call the exception() function to figure out what went wrong. This has proven suboptimal as it allows users to forget about failure handling and errors may be ignored. The work class is semantically very similar to std::future, where a call to get() may throw if the underlying std::promise has set an exception. This commit changes the semantic of the work class to be similar to this and turns wait() into a void function that throws if the work completes with an exception. The exception() function can still be used to retrieve the exception if isSuccess() returns false, but now returns an std::exception_ptr instead of a reference to a std::exception. Reviewed By: manojkris Differential Revision: D13158475 fbshipit-source-id: 9cd8569b9e7cbddc867a5f34c6fd0b7be85581b8	2018-11-27 10:46:40 -08:00
Pieter Noordhuis	936c2bba23	Use new style barrier support in c10d/gloo (#14294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14294 This is the final collective to be ported to the new style where there is no longer a need to keep a cached algorithm instance around. There is a follow up change incoming to remove the algorithm caching functionality in ProcessGroupGloo. Reviewed By: manojkris Differential Revision: D13111509 fbshipit-source-id: f3ea0d955a62029fc4e7cfc09055e4957e0943ac	2018-11-27 10:46:32 -08:00
Teng Li	6f3002a50e	Fixed c10d test (#14389 ) Summary: Most likely a typo. Tested on 8-GPU machine ``` tengli@learnfair062:~/pytorch/test$ python test_c10d.py ProcessGroupNCCLTest.test_barrier . ---------------------------------------------------------------------- Ran 1 test in 29.341s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14389 Differential Revision: D13207207 Pulled By: teng-li fbshipit-source-id: aaffe14237076fe19d94e2fa4d9c093397f07bb9	2018-11-26 16:46:33 -08:00
Teng Li	b26f82b0ec	Robust NCCL barrier improvement to cover all devices combinations (#14271 ) Summary: This covers the very edgy case when we run the same NCCL process group with multiple GPU combinations instead of the last GPU combination. We always keep track of what GPUs have been used previously in the NCCL process group and barrier() itself will synchronize on each GPU's NCCL stream. Test covered as well. Tested on 8-GPU machine Pull Request resolved: https://github.com/pytorch/pytorch/pull/14271 Differential Revision: D13164993 Pulled By: teng-li fbshipit-source-id: 81e04352740ea50b5e943369e74cfcba40bb61c1	2018-11-21 18:23:55 -08:00

1 2

91 Commits