Commit Graph

278 Commits

Author SHA1 Message Date
Alban Desmaison
f1e575a0bf Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D20496044

Original commit changeset: 248d680f4b1b

fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d
2020-06-16 10:38:40 -07:00
Michael Carilli
2beb9690c3 Change AccumulateGrad to yield .grads that match weights' memory layout (#34904)
Summary:
Currently, whether `AccumulateGrad`  [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout.  If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient.  This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).

Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads.  This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout).  The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes.  I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix.  The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

## BC notes
ngimel albanD gchanan

#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change.  Any user code that was accustomed to `view(-1)`ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed.  In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params.  Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous.  IMO this is a mild BC breakage.  Param backward hooks still see grads come in with whatever format the backward kernel gave them.  The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`.  Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place.  Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout.  After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`.  This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR):  Presetting `param.grad` does not ensure in-place accumulation all the time.  For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~  PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately.  Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904

Differential Revision: D20496044

Pulled By: albanD

fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
2020-06-16 08:43:31 -07:00
Yanli Zhao
b98948e6dd implement dynamic bucket order in DDP (#35137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35137

bucket order is rebuilt dynamically in the first reduction backward pass when find_unused_parameters = false
ghstack-source-id: 104794018

Test Plan: unit test

Differential Revision: D20128537

fbshipit-source-id: fad73de965cdcb59a51c0a12b248271344584b9f
2020-05-28 12:59:52 -07:00
Shen Li
8d6a8d2b3f Fix DDP bug in single process multiple device use cases (#36503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36503

Test Plan: Imported from OSS

Differential Revision: D21179274

Pulled By: mrshenli

fbshipit-source-id: 0afce30ae0ddda753d1e240584a0f80df9aec4c2
2020-04-22 15:06:28 -07:00
Shen Li
5afd816793 Add a warning for Single-Process Multi-GPU DDP (#36656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36656

Test Plan: Imported from OSS

Differential Revision: D21042537

Pulled By: mrshenli

fbshipit-source-id: fa3501dc2bba14550ec4f254612a80f61fe86a4a
2020-04-15 12:43:50 -07:00
Xiang Gao
df8d6eeb19 Update docs about DP and DDP for CUDA (#35063)
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063

Differential Revision: D20549621

Pulled By: ngimel

fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
2020-03-20 20:06:37 -07:00
danthe3rd
46539eee03 Ensure that DDP wrapped module has parameters that require gradients (#25858)
Summary:
…ent - see https://github.com/pytorch/pytorch/issues/25552

**TEST PLAN**
```
python test/run_test.py -f distributed
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25858

Differential Revision: D17687542

Pulled By: danthe3rd

fbshipit-source-id: 11bfe4142e72bb21382b30379fe10e60418c7ec9
2019-10-01 09:03:52 -07:00
Karl Ostmo
ef6356133e Revert D16428208: [pytorch][PR] only scatter in forward if multi-device per process
Differential Revision:
D16428208

Original commit changeset: eaa3876b2b95

fbshipit-source-id: 9db3bc86bf419dd06fdaaff434f72b92ecb5a427
2019-07-27 22:41:20 -07:00
Adam Stooke
d6d7a5f075 only scatter in forward if multi-device per process (#22384)
Summary:
Scatter is unnecessary if only using one device, and it breaks on some custom data structures like namedtuple, so would like to avoid :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22384

Differential Revision: D16428208

Pulled By: soumith

fbshipit-source-id: eaa3876b2b95c1006ccaaacdb62f54c5280e730c
2019-07-26 17:30:34 -07:00
Adam Paszke
f1775796dd Fix minor issues with #21736 (#22074)
Summary:
cc mrshenli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22074

Differential Revision: D15965376

Pulled By: mrshenli

fbshipit-source-id: 50ff96de6390817d8ea52c04322c6bee3d649b32
2019-06-24 15:18:26 -07:00
Pieter Noordhuis
77eda8de8e Support sparse gradients in DistributedDataParallel (#22037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22037

This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.

Reviewed By: mrshenli

Differential Revision: D15926383

fbshipit-source-id: 39c0d5dbd95bf0534314fdf4d44b2385d5321aaf
2019-06-24 07:34:12 -07:00
Shen Li
08facca1a1 Support accumulating DDP grads using a context manager (#21736)
Summary:
The first attempt and more discussions are available in https://github.com/pytorch/pytorch/issues/19577

#### Goal

Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations.

#### Concerns

Our first attempt in https://github.com/pytorch/pytorch/issues/19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead.

#### Proposed Solution

Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP.

It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required..

The application would then look like:

```python
with ddp.no_sync():
  for input in inputs:
    ddp(input).backward()

ddp(one_more_input).backward()
optimizer.step()
```

chenyangyu1988 myleott
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21736

Differential Revision: D15805215

Pulled By: mrshenli

fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c
2019-06-20 12:23:52 -07:00
Edward Yang
cb4c213f55 Revert D15007365: Support sparse gradients in DistributedDataParallel
Differential Revision:
D15007365

Original commit changeset: f298e83fd3ca

fbshipit-source-id: ef5e556d2df37f0c64652bd3563956afd8d9fd7f
2019-06-20 10:07:22 -07:00
Pieter Noordhuis
365de7bda1 Support sparse gradients in DistributedDataParallel (#19443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19443

This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.

Reviewed By: mrshenli

Differential Revision: D15007365

fbshipit-source-id: f298e83fd3ca828fae9e80739e1db89d045c99ac
2019-06-20 07:06:28 -07:00
Shen Li
fa4ca4e70e Emphasize all DDP forward() outputs must participate in computing loss (#20586)
Summary:
CC borguz chenyangyu1988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20586

Reviewed By: ezyang

Differential Revision: D15373674

Pulled By: mrshenli

fbshipit-source-id: b986918b3592616a9bcc88fba1b8fd53016f68d7
2019-05-17 07:35:49 -07:00
Pieter Noordhuis
558c6c4d8a Make DistributedDataParallel usable with CPU models (#20236)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236

Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.

Closes #17757.

Reviewed By: mrshenli

Differential Revision: D15245428

fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a
2019-05-09 14:11:17 -07:00
Pieter Noordhuis
5525c419fc Only call into reducer if torch.is_grad_enabled() (#19897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19897

During validation, gradient reduction is not needed, and autograd is
never called. The model output will always be a detached tensor. After
the new reducer was merged, this meant that it would find all model
parameters unused, and kick off reduction for them. When #19799 and
output where no parameters are used and it tries to kick off reduction
of zeroed gradients. Test for `torch.is_grad_enabled()` and
`self.training` before calling into the reducer.

Reviewed By: mrshenli

Differential Revision: D15118726

fbshipit-source-id: b0208f632a61cbe8110fa626fa427937b7f05924
2019-04-28 23:12:16 -07:00
Shen Li
b695e562e5 Make find_unused_parameters in DDP default to False (#19895)
Summary:
As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer.

CC pietern soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895

Reviewed By: pietern

Differential Revision: D15118563

Pulled By: mrshenli

fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58
2019-04-28 21:22:26 -07:00
Pieter Noordhuis
6325b6e44e Make finding unused model parameters optional (#19515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515

This is still done by default, but can now be disabled by specifying
`find_unused_parameters=False`. There are use cases where finding
unused parameters results in erroneous behavior, because a subset of
model parameters is used *outside* the `forward` function. One can
argue that doing this is not a good idea, but we should not break
existing use cases without an escape hatch. This configuration
parameter is that escape hatch.

Reviewed By: bddppq

Differential Revision: D15016381

fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0
2019-04-19 17:23:36 -07:00
Pieter Noordhuis
a5c4348d54 Recursively find tensors in DDP module output (#19360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360

We'll return the output object verbatim since it is a freeform object.
We need to find any tensors in this object, though, because we need to
figure out which parameters were used during this forward pass, to
ensure we short circuit reduction for any unused parameters.

Before this commit only lists were handled and the functionality went
untested. This commit adds support for dicts and recursive structures,
and also adds a test case.

Closes #19354.

Reviewed By: mrshenli

Differential Revision: D14978016

fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8
2019-04-18 14:57:09 -07:00
Shen Li
6732358bf9 Allow DDP to wrap multi-GPU modules (#19271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19271

allow DDP to take multi-gpu models

Reviewed By: pietern

Differential Revision: D14822375

fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c
2019-04-17 21:21:54 -07:00
Pieter Noordhuis
a0263ec047 Make DistributedDataParallel use new reducer (#18953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953

This removes Python side bucketing code from DistributedDataParallel
and replaces it with calls to the new C++ based bucketing and reducing
code. To confirm this is working well, we ran a test with both the
previous implementation and the new implementation, and confirmed they
are numerically equivalent.

Performance is improved by a couple percent or more, including the
single machine multiple GPU runs.

Closes #13273.

Reviewed By: mrshenli

Differential Revision: D14580911

fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2
2019-04-15 12:44:38 -07:00
Shen Li
168c0797c4 Remind users to set map_location properly when using DDP
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19084

Differential Revision: D14861702

Pulled By: mrshenli

fbshipit-source-id: 10ca4a9b41e707050a6bce228ccca4177c9fa4a6
2019-04-09 16:29:38 -07:00
Shen Li
5eb6a2be41 Avoid calling tensor.data.set_() in DDP
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18961

Differential Revision: D14811208

Pulled By: mrshenli

fbshipit-source-id: c1c46dfa13e0a6ec83aefd35696ee31a7ea3d810
2019-04-09 14:18:24 -07:00
Edward Yang
173f224570 Turn on F401: Unused import warning. (#18598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**

This was requested by someone at Facebook; this lint is turned
on for Facebook by default.  "Sure, why not."

I had to noqa a number of imports in __init__.  Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it.  Left for future work.

Be careful!  flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments.  flake8-3 will
report an import unused; flake8-2 will not.  For now, I just
noqa'd all these sites.

All the changes were done by hand.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D14687478

fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
2019-03-30 09:01:17 -07:00
Elliot Waite
1e42720a77 Fix some typos in distributed.py.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17959

Differential Revision: D14437347

Pulled By: soumith

fbshipit-source-id: 4c33571f56e9da687666516a310f91924cddd4d9
2019-03-13 09:28:03 -07:00
jiej
39669316a6 (#14267)
Summary:
- Summary:

Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group.
Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API)

- User-facing api:

1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None)

2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, ***process_group=None***)

- supported use case:
DistributedDataParallel with ***single-gpu multi-process***

a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below:

  1. use layers directly:

     torch.nn.SyncBatchNorm(...)

     similar API as with torch.nn.BatchNormXd(...)
     with added argument `process_group` which is used to limit the scope of
     synchronization within each process group. Default value is None, which
     implies synchronization across all GPUs

  2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group)

     recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm`
     preserving values of parameters/buffers.
     the utility function also allows user to specify process_group value to all
     converted layers.

b. user wraps their model with
   `torch.distributed.parallel.DataParallelDistributed`, from this point, user
   should follow the general guidelines for DDP use guide

- Error checking

For use cases not supported, we error out:

1. Application launched without ddp:
   > import torch
   > sbn = torch.nn.SyncBatchNorm(10).cuda()
   > inp = torch.randn(5, 10, 3, 3).cuda()
   > sbn(inp) --> Error!
   > AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

2. Application launched using DDP with multi-GPU per-process:
   > ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank)
   > ValueError: SyncBatchNorm is only supported for DDP with single GPU per process
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267

Differential Revision: D14270035

Pulled By: ezyang

fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d
2019-03-06 13:39:11 -08:00
ZhuBaohe
19a6de328f Correct docstring of vision/init functions
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17351

Differential Revision: D14276355

Pulled By: soumith

fbshipit-source-id: 9b572b6a04eeb1e44cd93961edac76ed10f7b24e
2019-03-01 11:40:23 -08:00
Derek Kim
4171ef3728 Enhance the documentation for DistributedDataParallel from torch.nn.parallel.distributed (#16010)
Summary:
- a typo fixed
- made the docs consistent with #5108

And maybe one more change is needed. According to the current docs
> The batch size should be larger than the number of GPUs used **locally**.

But shouldn't the batch size be larger than the number of GPUs used **either locally or remotely**? Sadly, I couldn't experiment this with my single GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16010

Differential Revision: D13709516

Pulled By: ezyang

fbshipit-source-id: e44459a602a8a834fd365fe46e4063e9e045d5ce
2019-01-17 01:02:44 -08:00
Teng Li
f56217af3b Doc improvement on DDP (#15440)
Summary:
I noticed that some users don't even know we have this support. Adding into the doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15440

Differential Revision: D13531045

Pulled By: teng-li

fbshipit-source-id: 9757c400c0010608758c754df04e603b36035a10
2018-12-20 14:51:57 -08:00
Teng Li
2d3cf98b49 Making dist.get_default_group private for PT1 release (#14767)
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions.  It should really be private.

All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.

We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767

Reviewed By: pietern

Differential Revision: D13330655

Pulled By: teng-li

fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
2018-12-04 19:22:24 -08:00
Teng Li
cac03280f9 Fixed DistributedDataParallel state pickling for multi-gpus (#14690)
Summary:
Fixed: https://github.com/pytorch/pytorch/issues/14678

This PR fixed DDP doesn't work after save() and load() for multiple GPUs, because, it needs all these replicating logics and bucketing in the constructor.

So I refactored some of the logics in the constructor to a helper function. And this will be used for load().

Added test too. Tested on 8 GPU machines.

```
tengli@learnfair062:~/pytorch/test$ python run_test.py -i distributed --verbose
Test executor: ['/private/home/tengli/miniconda3/bin/python']
Selected tests: distributed
Running test_distributed ... [2018-12-02 18:33:55.833580]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the mpi backend
test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok

----------------------------------------------------------------------
Ran 68 tests in 6.315s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.315s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.315s

OK (skipped=15)
Running distributed tests for the mpi backend with file init_method
test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok

----------------------------------------------------------------------
Ran 68 tests in 6.415s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.415s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.415s

OK (skipped=15)
Running distributed tests for the nccl backend
test_Backend_enum_class (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU'
test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather'
test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL'
test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_multigpu (__main__.TestDistBackend) ... ok
test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL'
test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_cuda (__main__.TestDistBackend) ... ok
test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_cuda (__main__.TestDistBackend) ... ok
test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped'
test_destroy_full_group (__main__.TestDistBackend) ... ok
test_destroy_group (__main__.TestDistBackend) ... ok
test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_get_backend (__main__.TestDistBackend) ... ok
test_get_default_group (__main__.TestDistBackend) ... ok
test_get_rank (__main__.TestDistBackend) ... ok
test_get_rank_size_full_group (__main__.TestDistBackend) ... ok
test_get_rank_size_group (__main__.TestDistBackend) ... ok
test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv'
test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend'
test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_multigpu (__main__.TestDistBackend) ... ok
test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum_cuda (__main__.TestDistBackend) ... ok
test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source'
test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'

----------------------------------------------------------------------
Ran 68 tests in 69.549s

OK (skipped=52)
Running distributed tests for the nccl backend with file init_method
test_Backend_enum_class (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU'
test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather'
test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL'
test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_multigpu (__main__.TestDistBackend) ... ok
test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL'
test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_cuda (__main__.TestDistBackend) ... ok
test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_cuda (__main__.TestDistBackend) ... ok
test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped'
test_destroy_full_group (__main__.TestDistBackend) ... ok
test_destroy_group (__main__.TestDistBackend) ... ok
test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_get_backend (__main__.TestDistBackend) ... ok
test_get_default_group (__main__.TestDistBackend) ... ok
test_get_rank (__main__.TestDistBackend) ... ok
test_get_rank_size_full_group (__main__.TestDistBackend) ... ok
test_get_rank_size_group (__main__.TestDistBackend) ... ok
test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv'
test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend'
test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_multigpu (__main__.TestDistBackend) ... ok
test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum_cuda (__main__.TestDistBackend) ... ok
test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source'
test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'

----------------------------------------------------------------------
Ran 68 tests in 70.381s

OK (skipped=52)
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14690

Differential Revision: D13294169

Pulled By: teng-li

fbshipit-source-id: 69ccac34c6c016899bfe8fbc50b48d4bfd1d3876
2018-12-03 12:04:26 -08:00
Teng Li
5268dd468c Fixed DistributedDataParallel cannot kick off all-reduce in a corner case (#14675)
Summary:
Ok, this corner happens for translation guys, and it only happens in the following corner case:

(1) when the module is registered a parameter that does not requires grad

and

(2) this registered parameter has a unique type (say, double, or half) and it's the only unique type such that itself alone will be put into a separate bucket.

and

(3) it is the last parameter that got registered in the module, such that its bucket reduction is the first to be kicked off.

Once this corner case happens, since it does not require grad, the backward hook won't be kicked off. Now that all other buckets are waiting for its bucket to be kicked off, in this case, no bucket will be kicked off since it's blocked by the first bucket (the unique type parameter).

This PR fixes two things:
(1) Make sure that we will only bucket parameters that requires_grad
(2) Make all-reduction checks in the next iteration. As long as we detect the previous iteration's all-reduction has not been fully kicked off, we will issue an error in the next iteration.
(3) Also removed some unused variables

With this bug fixed, the only case when this error can happen is when the user changed parameters later after wrapping up the module with DDP, like the case in:
https://github.com/pytorch/pytorch/issues/12603

Test covered as well

Without the first fix, I varied that the repro in fbcode hit this error message:

```
result = self.forward(*input, **kwargs)
  File "/data/users/tengli/fbsource/fbcode/buck-out/dev/gen/language_technology/neural_mt/os/pytorch_translate/train#link-tree/torch/nn/parallel/distributed.py", line 312, in forward
    raise RuntimeError("Not all gradients are all-reduced from "
RuntimeError: Not all gradients are all-reduced from the backward of the previous iteration. This is unexpected and fatal error. Please check and ensure that the model's parameters are not changed after you wrap up the model with DistributedDataParallel.

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14675

Differential Revision: D13291083

Pulled By: teng-li

fbshipit-source-id: 2539b699fae843f104b4b8d22721ae82502ba684
2018-12-02 17:13:07 -08:00
Teng Li
85d3fccee7 Removed redundant allreduce options in DDP (#14208)
Summary:
This somehow is not cleaned up after the C++ migration. Unused and can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14208

Differential Revision: D13132492

Pulled By: teng-li

fbshipit-source-id: 0f05b6368174664ebb2560c037347c8eb45f7c38
2018-11-21 16:56:46 -08:00
Teng Li
4983397c02 Better documentation and warning (#13946)
Summary:
This is to address https://github.com/pytorch/pytorch/issues/12603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13946

Differential Revision: D13055254

Pulled By: teng-li

fbshipit-source-id: 20a206ebd3456eac9dc50584664c4bca3ee955d1
2018-11-14 10:41:46 -08:00
Teng Li
dceec1de30 Distributed Data Parallel documentation for PT1 release (#13657)
Summary:
This should fix https://github.com/pytorch/pytorch/issues/12604

Make html and look through the html pages to make sure that everything looks good
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13657

Reviewed By: calebho

Differential Revision: D12954250

Pulled By: teng-li

fbshipit-source-id: 40e1925ec0cdce5e6a1d8ba29537937da8ef9194
2018-11-07 12:11:57 -08:00
Teng Li
1413dd4bfc Added the finer bucketing option for DDP (#13607)
Summary:
We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway.

Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607

Differential Revision: D12944515

Pulled By: teng-li

fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508
2018-11-07 12:00:55 -08:00
Teng Li
74819087de Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496)
Summary:
When go to mixed precision fp16 training, DDP randomly hangs.  Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size.  How could this even happen?

It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer.  An interesting bug digging and fix.

Now fp16 DDP training should be fully working now.

Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.

Fixed: https://github.com/pytorch/pytorch/issues/12150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496

Differential Revision: D12920985

Pulled By: teng-li

fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c
2018-11-05 16:22:15 -08:00
Teng Li
e475d3ede3 DDP multi-GPU segfault fix (#13291)
Summary:
Fix https://github.com/pytorch/pytorch/issues/13200

Tested on 8 GPU machines since CI doesn't have this many GPUs, so multi-GPU test won't be triggered

```
tengli@learnfair096:~/pytorch/test$ python run_test.py -i distributed --verbose
Selected tests: distributed
Running test_distributed ... [2018-10-29 20:32:46.355858]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the gloo backend
test_DistBackend (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... ok
```

Also I would like to bump up the bucket size of broadcast to higher for performance reasons
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13291

Differential Revision: D12842840

Pulled By: teng-li

fbshipit-source-id: e8c50f15ebf2ab3e2cd1b51d365e41a6106b98fe
2018-10-31 00:43:42 -07:00
sli
9d9e5f8d1e Solve bug of DistributedDataParallel (#13248)
Summary:
Fixed bug [https://github.com/facebookresearch/maskrcnn-benchmark/issues/52](https://github.com/facebookresearch/maskrcnn-benchmark/issues/52)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13248

Reviewed By: pietern

Differential Revision: D12830451

Pulled By: teng-li

fbshipit-source-id: ab33faf3f6f4545f8fe07da7ecbeb2f0a2ea23f0
2018-10-29 15:19:55 -07:00
Teng Li
c250f6f3d5 DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954)
Summary:
- Moved sync_reduction to C++
- Use a dedicated CUDA stream for memcpy
- Also use a dedicated CUDA stream for memcpy in queue_reduction

Added test as well.

CI should cover both DDP and unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954

Differential Revision: D10520069

Pulled By: teng-li

fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
2018-10-24 21:37:13 -07:00
Teng Li
8d3e7e2fcb Move DDP queue_reduction to C++ (#12852)
Summary:
fully working version by using continuing on goldsborough 's initial version.

waiting on the stream guard to be merged before adding more stream perf logics into the c++ version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852

Differential Revision: D10468696

Pulled By: teng-li

fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
2018-10-22 16:07:46 -07:00
Teng Li
d120b9af5a Make c10d pickling/unpickling work (#12694)
Summary:
This fixes the issue for https://github.com/pytorch/pytorch/issues/12168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12694

Differential Revision: D10468717

Pulled By: teng-li

fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048
2018-10-19 16:42:36 -07:00
Wei Yang
54107ae8cf convert output_device at data_parallel from torch.device to index (#10189)
Summary:
- fixes #9984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189

Differential Revision: D9545390

Pulled By: weiyangfb

fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e
2018-09-11 20:27:07 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
Jerry Ma
afd7477eaa Add `buffers(), named_buffers()` methods. (#10554)
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554

Reviewed By: SsnL

Differential Revision: D9367762

Pulled By: jma127

fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
2018-08-16 16:26:48 -07:00
Tongzhou Wang
a77b391de7 [SpectralNorm] don't register original weight as buffer (#8170)
* don't register original weight as buffer; fixes for buffers that require grad

* add test
2018-06-12 14:42:05 -04:00
Ailing
52e4d3c4a2 add error when backend is not supported by DDP (#8325) 2018-06-11 02:18:30 -04:00
Isaac Ge
537cb10525 improve DataParallel/DistributedDataParallel docs (#7407) 2018-05-09 10:30:42 +02:00
Jon Malmaud
5463a4a319 Fix typo. (#6609) 2018-04-15 11:43:10 +02:00
Ailing
1499a604cf fix assertion error when input size smaller than number of module_copies (#6252) 2018-04-04 12:05:34 +02:00
Ailing
f5aa8d55ad fix detach in place error in DDP (#5829)
* fix detach in DDP

* fix typo

* make lint happy
2018-03-16 09:22:04 -04:00
Teng Li
579de82bcf DDP: 10% of NCCL backend perf improvements with mixed-prec support (#5064) 2018-02-21 23:59:52 +01:00
Teng Li
4b8f4fc259 Added mixed-precision support in distributed training (#4891) 2018-02-21 14:29:39 +01:00
Richard Zou
cac3026b35 Fix typo in DataParallel docs (#5268) 2018-02-15 23:02:26 +01:00
Teng Li
d7b6a61a54 DDP: coalescing many little broadcasts to improve performance (#4978) 2018-02-12 16:41:33 +01:00
Tongzhou Wang
805639906a Broacast output requires_grad if only corresponding input requires_grad (#5061) 2018-02-05 23:38:35 -05:00
Teng Li
ae28411af8 Slightly improve DDP single GPU multi-process dist training performance 2018-01-27 12:15:44 +01:00
Teng Li
154038e318 Removing NCCL clear_group_cache workaround with one more check in new_group (#4766) 2018-01-23 11:03:52 +01:00
Sam Gross
d605058212
Replace Variable.volatile with torch.no_grad() (#3970)
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().

In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()

Fixes #3627
2017-12-18 15:46:13 -05:00
ngimel
7f41149e14 handle requires_grad when creating buckets for distributed (#4044) 2017-12-18 02:13:53 -05:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
SsnL
01be4d6b20 sparse broadcast_coalesce and reduce_add_coalesced 2017-10-28 18:52:35 -04:00
SsnL
de1f4e69dd raw text (#3327) 2017-10-28 01:24:02 +05:30
Luca Antiga
6743d59513 Add missing import. Add return to __getstate__ 2017-10-08 11:07:10 -04:00
Sergey Kolesnikov
5f8bab47c8 bugfix for 2428 ussue (#3000) 2017-10-06 09:20:12 -04:00
jekbradbury
7aa6bc516f add "Basics" section to distributed docs (#2433) 2017-08-24 17:07:20 -04:00
Robert Kirby
5d09fcd028 Make DistributedDataParallel threads Daemon threads to allow clean process exit (#2524) 2017-08-24 06:32:29 -04:00
Christian Sarofeen
4c69697d2a Distribtued bug fixes. (#2434) 2017-08-23 14:46:52 -04:00
LuoweiZhou
5c43fcda8d Support params that don’t require grad in DistributedDataParallel (#2464) 2017-08-19 11:22:20 -04:00
Robert Kirby
9199c954f1 Fix typo in DistributedDataParallel (#2320) 2017-08-08 21:53:42 -04:00
Adam Paszke
dc17fb68e4 Fix minor bug in parallel_apply (#2193) 2017-07-25 03:45:00 +05:30
Adam Paszke
8ab3d214d5 Fixes for DistributedDataParallel (#2168) 2017-07-21 16:00:46 -04:00
Adam Paszke
4af40e3471 Let parallel_apply accept arbitrary inputs 2017-07-20 01:45:57 -04:00
Sam Gross
10e23943b3 Fix missing _forward_pre_hooks in serialized modules (#2057) 2017-07-11 18:23:35 -04:00
Leonid Vlasenkov
46a868dab7 [Ready] Limit docs line length (#1900)
* some docs are ready

* docs

* docs

* fix some more

* fix some more
2017-07-10 10:24:54 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00