Summary:
MultiProcessTestCase will be useful for both c10d and rpc tests. So, this diff extracts that class and some common decorators to a separate file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23660
Reviewed By: pietern
Differential Revision: D16602865
Pulled By: mrshenli
fbshipit-source-id: 85ad47dfb8ba187b7debeb3edeea5df08ef690c7
Summary:
With this change you can now list multiple interfaces separated by
comma. ProcessGroupGloo creates a single Gloo context for every device
in the list (a context represents a connection to every other
rank). For every collective that is called, it will select the context
in a round robin fashion. The number of worker threads responsible for
executing the collectives is set to be twice the number of devices.
If you have a single physical interface, and wish to employ increased
parallelism, you can also specify
`GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`. This makes ProcessGroupGloo
use 4 connections per rank, 4 I/O threads, and 8 worker threads
responsible for executing the collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978
ghstack-source-id: 87006270
Differential Revision: D16339962
fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1
Summary:
The CMake modifications include removal of some unnecessary paths
(e.g. find_package(CUDA) and friends) that are no longer used since
c10d is always part of the larger torch build. The macro
`C10D_USE_...` was ambiguous and is now removed in favor of only
having top level `USE_...`. The c10d test suite is changed to include
skip annotations for the tests that depend on Gloo as well.
Now, if you compile with `USE_DISTRIBUTED=1` and `USE_GLOO=0` you get
a functioning build for which the tests actually pass.
Closes https://github.com/pytorch/pytorch/issues/18851.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22257
Differential Revision: D16087993
Pulled By: pietern
fbshipit-source-id: 0cea66bd5cbd9736b06fa1d45ee13a18cab88adb
Summary:
Reduction of gradients for unused parameters should happen as soon as
possible, because they potentially block reduction of gradients for
used parameters. This used to happen instantly when
`prepare_for_backward` was called and it found parameters that didn't
contribute. This meant that if you have a model with unused
parameters, and you want to discard the model output (i.e. not call
backward on some loss), reduction of the gradients of those unused
parameters would have been kicked off, and you'd see an error the next
time you called `forward`.
In this commit, this original approach is slightly changed to delay
reduction of the gradients of those unused parameters until the first
autograd hook is called. This means that you can now discard the model
output regardless of the model having unused parameters or not.
This is a prerequisite for making the `find_unused_parameters`
argument to DDP default to `True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22219
Differential Revision: D16028698
Pulled By: pietern
fbshipit-source-id: c6aec2cd39c4a77746495d9cb1c9fb9c5ac61983
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22037
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Reviewed By: mrshenli
Differential Revision: D15926383
fbshipit-source-id: 39c0d5dbd95bf0534314fdf4d44b2385d5321aaf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22036
Implemented only on ProcessGroupGloo, as an allgather of metadata
(sparse_dim, dense_dim, and nnz), followed by an allgather of indices,
followed by an allgather of values. Once these operations have
finished, all ranks locally compute a reduction over these sparse
tensors. Works for both CPU and CUDA tensors.
This surfaced a problem with the existing assumption of only modifying
tensors that are passed at the call site, because for sparse tensors
we don't know the dimensions of the output tensors before we run the
collective. To deal with this unknown, this commit adds a `result`
function to the `c10d::ProcessGroup::Work` class that returns a vector
of tensors.
It's a bit odd to have to retrieve the result through this function
only for operations on sparse tensors. To make this work irrespective
of tensor layout, we can create a follow-up commit to make all in
place operations make their results accessible through this function
as well. This doesn't break any existing contracts but does have the
potential to add interface ambiguity.
This is a resubmission of #19146.
Reviewed By: mrshenli
Differential Revision: D15926384
fbshipit-source-id: b6ee5d81606bfa8ed63c3d63a9e307613491e0ae
Summary:
The first attempt and more discussions are available in https://github.com/pytorch/pytorch/issues/19577
#### Goal
Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations.
#### Concerns
Our first attempt in https://github.com/pytorch/pytorch/issues/19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead.
#### Proposed Solution
Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP.
It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required..
The application would then look like:
```python
with ddp.no_sync():
for input in inputs:
ddp(input).backward()
ddp(one_more_input).backward()
optimizer.step()
```
chenyangyu1988 myleott
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21736
Differential Revision: D15805215
Pulled By: mrshenli
fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19443
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Reviewed By: mrshenli
Differential Revision: D15007365
fbshipit-source-id: f298e83fd3ca828fae9e80739e1db89d045c99ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19146
Implemented only on ProcessGroupGloo, as an allgather of metadata
(sparse_dim, dense_dim, and nnz), followed by an allgather of indices,
followed by an allgather of values. Once these operations have
finished, all ranks locally compute a reduction over these sparse
tensors. Works for both CPU and CUDA tensors.
This surfaced a problem with the existing assumption of only modifying
tensors that are passed at the call site, because for sparse tensors
we don't know the dimensions of the output tensors before we run the
collective. To deal with this unknown, this commit adds a `result`
function to the `c10d::ProcessGroup::Work` class that returns a vector
of tensors.
It's a bit odd to have to retrieve the result through this function
only for operations on sparse tensors. To make this work irrespective
of tensor layout, we can create a follow-up commit to make all in
place operations make their results accessible through this function
as well. This doesn't break any existing contracts but does have the
potential to add interface ambiguity.
Reviewed By: mrshenli
Differential Revision: D14889547
fbshipit-source-id: 34f3de4d6a2e09c9eba368df47daad0dc11b333e
Summary:
Closes https://github.com/pytorch/pytorch/issues/21344
DDP assigns the original module to the first module replica instead of creating a new one. Then, it creates a new Reducer to add post hooks to sync gradients. However, because every reconstructed DDP instance wraps the same original module, all their reducers will add hooks to the same set of variables. This PR deletes DDP hooks from variables when destructing Reducer, trying to make DDP failure recoverable.
pietern kuttas and I discussed the following solutions:
#### Solution 1
Keep `add_post_hook` API intact, and do a `dynamic_cast` in `del_post_hook` to check hook type. If the type matches Reducer's hook, delete it. As pietern mentioned, this will not work if we create multiple DDP instances from the same original model.
#### Solution 2
Use a counter to generate a unique key for every hook in `Function`, and keep them in a map. return the key to the caller of `add_post_hook`, and ask the caller to provide key if it needs to delete the hook.
Con: this would add extra overhead to `add_post_hook` and every `Function` object.
#### Solution 3 [Current implementation]
kuttas suggests that, instead of generating a unique key, directly using the address of the pointer would be better. In order to avoid messing up dereferencing, let `add_post_hook` to return a `uintptr_t`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21591
Differential Revision: D15745706
Pulled By: mrshenli
fbshipit-source-id: e56d2d48de0c65f6667790ab16337eac7f7d8b76
Summary:
Fixes#20651
Communication collectives in `torch.distributed` call `CUDACachingAllocator::recordStream()` on input and output tensors to prevent their memory blocks being freed too early. `CUDACachingAllocator` uses tensor's data pointer to track memory blocks, which does not accept null pointers. However, empty tensor's `storage().data()` might be null. In this case, as there is no associated memory block for the empty tensor, it should be fine to make `recordStream()` a no-op.
Tests only cover `broadcast` empty tensors for GLOO backend, because GLOO does not support empty inputs (facebookincubator/gloo/issues/179). It can be addressed in either `ProcessGroupGloo` or GLOO itself. Will add more tests when that gap is filled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20658
Differential Revision: D15399371
Pulled By: mrshenli
fbshipit-source-id: d29ebd1c72fddae49531f32695f81b89e42e5a4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20351
This was broken because of a merge race between #20282 and the stack in #20236.
Cleaned up the test and comments a bit as well.
Differential Revision: D15292786
fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236
Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.
Closes#17757.
Reviewed By: mrshenli
Differential Revision: D15245428
fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20235
The tests expected to only run for CUDA models. In a future commit we
need to update this to work for CPU models as well. Therefore, we can
no longer rely on only integers being passed for device identifiers.
With this change we pass both the materialized list of devices to use
(as `torch.Device` objects), as well as an optional list of integers.
The latter is specified to exercise the code in the
DistributedDataParallel constructor that turns a list of integers into
CUDA devices, IFF it is used to wrap a single-device CUDA module.
This commit also groups together the 'str' and non-'str' tests. These
used to test passing the list of devices as integers or as
`torch.Device` instances. These are now executed from the same test.
Reviewed By: mrshenli
Differential Revision: D15245429
fbshipit-source-id: 5797ba9db33d2c26db8e7493c91bb52f694285ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20234
The differences with the existing function _dist_broadcast_coalesced
is that this one works for both CPU and CUDA tensors and that it has a
maximum number of in flight operations.
This should be the final change needed to have only a single version
of DistributedDataParallel that both supports CPU and CUDA models, or
even a mix of both.
See #17757 for more information.
Reviewed By: mrshenli
Differential Revision: D15228099
fbshipit-source-id: a2113ba6b09b68cb5328f49f4c1960031eb43c93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20282
Add unit test to ensure no gradients sync when calling ddp.module(input), e.g not invoking prepare_for_backward
PyText is depending on DDP for data parallel distributed training. To support accumulate gradients locally before gradients sync, we are calling orig_model.forward instead of ddp_model.forward. Add a unit test to avoid changes break the assumption.
Reviewed By: pietern, mrshenli
Differential Revision: D15263155
fbshipit-source-id: 7734e174f507690fb23ea6c52dffff4a93f9b151
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19901
The existing code used `expect_autograd_hooks_` as a proxy for the
situation where finalization of the previous iteration is needed. This
is not correct, however, since you may decide to completely ignore the
output of a DDP wrapped module. If this is the case, and no gradients
have been passed to the reducer, it is fine to keep going. This commit
adds a new variable `require_finalize_` that tracks whether the
finalization is really needed.
Reviewed By: mrshenli
Differential Revision: D15118871
fbshipit-source-id: 25938eaf1fe13e2940feae1312892b9d3da8a67d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19897
During validation, gradient reduction is not needed, and autograd is
never called. The model output will always be a detached tensor. After
the new reducer was merged, this meant that it would find all model
parameters unused, and kick off reduction for them. When #19799 and
output where no parameters are used and it tries to kick off reduction
of zeroed gradients. Test for `torch.is_grad_enabled()` and
`self.training` before calling into the reducer.
Reviewed By: mrshenli
Differential Revision: D15118726
fbshipit-source-id: b0208f632a61cbe8110fa626fa427937b7f05924
Summary:
As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer.
CC pietern soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895
Reviewed By: pietern
Differential Revision: D15118563
Pulled By: mrshenli
fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19821
It is possible that not a single parameter is used during an
iteration. If this is the case, the `prepare_for_backward` function
marks all parameters as unused, kicks off reduction of all buckets,
*and* finalizes the reduction.
This is different from the prior implementation where we assumed that
autograd would produce a gradient for at least a single parameter.
We then used the autograd callback mechanism to queue a finalizer
callback. Now, this finalizer may be executed in line.
Reviewed By: mrshenli
Differential Revision: D15113272
fbshipit-source-id: dc91458b569cd8c106ddaeea558464b515683550
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19799
A module that returns multiple outputs and where the called may end up
doing multiple calls to torch.autograd.backward did not work with
DistributedDataParallel. It expected the first call to
torch.autograd.backward to provide gradients for ALL parameters that
expect gradients and were used in computing the module output. If you
have outputs with disjoint autograd graphs it is fine to call
torch.autograd.backward on both and fill in the module's parameter
gradients in separate chunks.
With this change we delay queuing the finalizer callback until we have
marked all buckets as ready, instead of queueing it the first time we
receive an autograd hook. This returns the current implementation to
be functionally equivalent to the DistributedDataParallel
implementation before #18953 was merged.
Reviewed By: mrshenli
Differential Revision: D15097045
fbshipit-source-id: 2df023319713bc31e29a8b45108c78e6593fccd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515
This is still done by default, but can now be disabled by specifying
`find_unused_parameters=False`. There are use cases where finding
unused parameters results in erroneous behavior, because a subset of
model parameters is used *outside* the `forward` function. One can
argue that doing this is not a good idea, but we should not break
existing use cases without an escape hatch. This configuration
parameter is that escape hatch.
Reviewed By: bddppq
Differential Revision: D15016381
fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360
We'll return the output object verbatim since it is a freeform object.
We need to find any tensors in this object, though, because we need to
figure out which parameters were used during this forward pass, to
ensure we short circuit reduction for any unused parameters.
Before this commit only lists were handled and the functionality went
untested. This commit adds support for dicts and recursive structures,
and also adds a test case.
Closes#19354.
Reviewed By: mrshenli
Differential Revision: D14978016
fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953
This removes Python side bucketing code from DistributedDataParallel
and replaces it with calls to the new C++ based bucketing and reducing
code. To confirm this is working well, we ran a test with both the
previous implementation and the new implementation, and confirmed they
are numerically equivalent.
Performance is improved by a couple percent or more, including the
single machine multiple GPU runs.
Closes#13273.
Reviewed By: mrshenli
Differential Revision: D14580911
fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2
Summary:
~Sometimes, `init_process_group()`, `store.get()`, and `destory_process_group()` can take more than a few seconds. Hence, removing thread join timeout.~
The error was due to `Address already in use` when starting TPC backend. The solution is to catch the error and report it to the `retry_on_address_already_in_use_error` decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19114
Reviewed By: ezyang
Differential Revision: D14872680
Pulled By: mrshenli
fbshipit-source-id: fc504d02853ca73f76288c0ade564ab20bc01f7e
Summary:
closes#16520
Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks!
Questions:
1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion?
2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571
Differential Revision: D13954527
Pulled By: mrshenli
fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845
This adds a few CPU only test cases for the reducer class.
Reviewed By: mrshenli
Differential Revision: D14768432
fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18291
ghimport-source-id: d6e95e899bd320407967df41435801e54864ba62
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18292 Add test for #17271 (torch.exp incorrect for 2**31 size tensor)
* **#18291 Correctly call superclass setUp in TestCase subclasses.**
This makes PYTORCH_TEST_SKIP_FAST work correctly for more
tests, reducing the wasted testing effort on our slow_test job.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14567643
fbshipit-source-id: 40cf1d6556e0dd0a0550ff3d9ffed8b6000f8191
Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.
This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself.
This should fix: https://github.com/pytorch/pytorch/issues/15638
Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.
I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684
Differential Revision: D13570904
Pulled By: teng-li
fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
Summary:
Otherwise, these tests will fail, even though there are never meant to run on single GPU machines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14860
Differential Revision: D13369060
Pulled By: teng-li
fbshipit-source-id: 8a637a6d57335491ba8602cd09927700b2bbf8a0
Summary:
It is possible that some sort of contention causes process scheduling
delays which in turn cause the timeout to *not* be hit.
Increased sleep here will decrease the probability of this happening.
Fixes#14555.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14814
Differential Revision: D13351924
Pulled By: pietern
fbshipit-source-id: 1222cf0855408dfcb79f30f94694c790ee998cf9
Summary:
If multiple arguments are specified to c10d allreduce, they are
interpreted as if they are expanding the ranks in the process group.
Therefore, not only is every argument to allreduce an input that must
be considered, it is also an output. The problem that this commit
fixes is that they were not correctly considered as outputs.
The upstream problem is tracked in facebookincubator/gloo#152. Once
this is fixed there we can remove the copies that this commit adds.
This fixes#14676.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14688
Differential Revision: D13294405
Pulled By: pietern
fbshipit-source-id: 078a2a0a0ff12d051392461438f1496201ec3cb9
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/14446
This was a supported behavior in old torch.distributed. We want to support it in the new release.
Test should cover all combination of scenario when we have either env or arg set up for rank or size or both
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494
Differential Revision: D13253433
Pulled By: teng-li
fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
Summary:
Fixed: https://github.com/pytorch/pytorch/issues/14445
Also bumped up timeout to 30 seconds, since on 8-GPU machines, DDP test will take more than 15 seconds sometimes.
Tested on 8 GPU machines:
```
tengli@learnfair062:~/pytorch/test$ python test_c10d.py --verbose
test_dist_broadcast_coalesced_gloo (__main__.DistributedDataParallelTest) ... ok
test_dist_broadcast_coalesced_nccl (__main__.DistributedDataParallelTest) ... skipped 'Test skipped due to known issues'
test_fp16 (__main__.DistributedDataParallelTest) ... ok
test_gloo_backend (__main__.DistributedDataParallelTest) ... ok
test_nccl_backend (__main__.DistributedDataParallelTest) ... ok
test_queue_reduction (__main__.DistributedDataParallelTest) ... ok
test_sync_params_no_buffers (__main__.DistributedDataParallelTest) ... ok
test_sync_params_with_buffers (__main__.DistributedDataParallelTest) ... ok
test_sync_reduction (__main__.DistributedDataParallelTest) ... ok
test_set_get (__main__.FileStoreTest) ... ok
test_set_get (__main__.PrefixFileStoreTest) ... ok
test_set_get (__main__.PrefixTCPStoreTest) ... ok
test_allgather_basics (__main__.ProcessGroupGlooTest) ... ok
test_allgather_checks (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_basics (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_basics_cuda (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_checks (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_stress (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_stress_cuda (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_basics (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_basics_cuda (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_checks (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_stress (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_stress_cuda (__main__.ProcessGroupGlooTest) ... ok
test_gather_basics (__main__.ProcessGroupGlooTest) ... ok
test_gather_checks (__main__.ProcessGroupGlooTest) ... ok
test_reduce_basics (__main__.ProcessGroupGlooTest) ... ok
test_reduce_checks (__main__.ProcessGroupGlooTest) ... ok
test_scatter_basics (__main__.ProcessGroupGlooTest) ... ok
test_scatter_checks (__main__.ProcessGroupGlooTest) ... ok
test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... ok
test_timeout_kwarg (__main__.ProcessGroupGlooTest) ... ok
test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok
test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_barrier (__main__.ProcessGroupNCCLTest) ... ok
test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok
test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_common_errors (__main__.RendezvousEnvTest) ... ok
test_nominal (__main__.RendezvousEnvTest) ... ok
test_common_errors (__main__.RendezvousFileTest) ... ok
test_nominal (__main__.RendezvousFileTest) ... ok
test_common_errors (__main__.RendezvousTCPTest) ... ok
test_nominal (__main__.RendezvousTCPTest) ... ok
test_unknown_handler (__main__.RendezvousTest) ... ok
test_address_already_in_use (__main__.TCPStoreTest) ... ok
test_set_get (__main__.TCPStoreTest) ... ok
----------------------------------------------------------------------
Ran 46 tests in 162.980s
OK (skipped=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14452
Differential Revision: D13230652
Pulled By: teng-li
fbshipit-source-id: 88580fe55b3a4fbc7a499ca3b591958f11623bf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14386
See #13573, #14142, and #14271 for discussion.
This change updates ProcessGroupGloo to ensure that all prior
operations have completed before executing the barrier.
Reviewed By: manojkris
Differential Revision: D13205022
fbshipit-source-id: 673e7e6ca357dc843874d6dd8da590832e1de7fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14298
This is a breaking API change for users of the C++ c10d API. The work
object defined wait() to return a boolean. If the work completed
successfully it would return true, if it didn't it would return false.
It was then up to the user to call the exception() function to figure
out what went wrong. This has proven suboptimal as it allows users to
forget about failure handling and errors may be ignored.
The work class is semantically very similar to std::future, where a
call to get() may throw if the underlying std::promise has set an
exception. This commit changes the semantic of the work class to be
similar to this and turns wait() into a void function that throws if
the work completes with an exception.
The exception() function can still be used to retrieve the exception
if isSuccess() returns false, but now returns an std::exception_ptr
instead of a reference to a std::exception.
Reviewed By: manojkris
Differential Revision: D13158475
fbshipit-source-id: 9cd8569b9e7cbddc867a5f34c6fd0b7be85581b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14294
This is the final collective to be ported to the new style where there
is no longer a need to keep a cached algorithm instance around. There
is a follow up change incoming to remove the algorithm caching
functionality in ProcessGroupGloo.
Reviewed By: manojkris
Differential Revision: D13111509
fbshipit-source-id: f3ea0d955a62029fc4e7cfc09055e4957e0943ac
Summary:
Most likely a typo.
Tested on 8-GPU machine
```
tengli@learnfair062:~/pytorch/test$ python test_c10d.py ProcessGroupNCCLTest.test_barrier
.
----------------------------------------------------------------------
Ran 1 test in 29.341s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14389
Differential Revision: D13207207
Pulled By: teng-li
fbshipit-source-id: aaffe14237076fe19d94e2fa4d9c093397f07bb9
Summary:
This covers the very edgy case when we run the same NCCL process group with multiple GPU combinations instead of the last GPU combination. We always keep track of what GPUs have been used previously in the NCCL process group and barrier() itself will synchronize on each GPU's NCCL stream.
Test covered as well. Tested on 8-GPU machine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14271
Differential Revision: D13164993
Pulled By: teng-li
fbshipit-source-id: 81e04352740ea50b5e943369e74cfcba40bb61c1