Summary:
As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer.
CC pietern soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895
Reviewed By: pietern
Differential Revision: D15118563
Pulled By: mrshenli
fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515
This is still done by default, but can now be disabled by specifying
`find_unused_parameters=False`. There are use cases where finding
unused parameters results in erroneous behavior, because a subset of
model parameters is used *outside* the `forward` function. One can
argue that doing this is not a good idea, but we should not break
existing use cases without an escape hatch. This configuration
parameter is that escape hatch.
Reviewed By: bddppq
Differential Revision: D15016381
fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360
We'll return the output object verbatim since it is a freeform object.
We need to find any tensors in this object, though, because we need to
figure out which parameters were used during this forward pass, to
ensure we short circuit reduction for any unused parameters.
Before this commit only lists were handled and the functionality went
untested. This commit adds support for dicts and recursive structures,
and also adds a test case.
Closes#19354.
Reviewed By: mrshenli
Differential Revision: D14978016
fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953
This removes Python side bucketing code from DistributedDataParallel
and replaces it with calls to the new C++ based bucketing and reducing
code. To confirm this is working well, we ran a test with both the
previous implementation and the new implementation, and confirmed they
are numerically equivalent.
Performance is improved by a couple percent or more, including the
single machine multiple GPU runs.
Closes#13273.
Reviewed By: mrshenli
Differential Revision: D14580911
fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
- Summary:
Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group.
Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API)
- User-facing api:
1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None)
2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, ***process_group=None***)
- supported use case:
DistributedDataParallel with ***single-gpu multi-process***
a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below:
1. use layers directly:
torch.nn.SyncBatchNorm(...)
similar API as with torch.nn.BatchNormXd(...)
with added argument `process_group` which is used to limit the scope of
synchronization within each process group. Default value is None, which
implies synchronization across all GPUs
2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group)
recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm`
preserving values of parameters/buffers.
the utility function also allows user to specify process_group value to all
converted layers.
b. user wraps their model with
`torch.distributed.parallel.DataParallelDistributed`, from this point, user
should follow the general guidelines for DDP use guide
- Error checking
For use cases not supported, we error out:
1. Application launched without ddp:
> import torch
> sbn = torch.nn.SyncBatchNorm(10).cuda()
> inp = torch.randn(5, 10, 3, 3).cuda()
> sbn(inp) --> Error!
> AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
2. Application launched using DDP with multi-GPU per-process:
> ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank)
> ValueError: SyncBatchNorm is only supported for DDP with single GPU per process
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267
Differential Revision: D14270035
Pulled By: ezyang
fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d
Summary:
- a typo fixed
- made the docs consistent with #5108
And maybe one more change is needed. According to the current docs
> The batch size should be larger than the number of GPUs used **locally**.
But shouldn't the batch size be larger than the number of GPUs used **either locally or remotely**? Sadly, I couldn't experiment this with my single GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16010
Differential Revision: D13709516
Pulled By: ezyang
fbshipit-source-id: e44459a602a8a834fd365fe46e4063e9e045d5ce
Summary:
I noticed that some users don't even know we have this support. Adding into the doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15440
Differential Revision: D13531045
Pulled By: teng-li
fbshipit-source-id: 9757c400c0010608758c754df04e603b36035a10
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private.
All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.
We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767
Reviewed By: pietern
Differential Revision: D13330655
Pulled By: teng-li
fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
Summary:
Fixed: https://github.com/pytorch/pytorch/issues/14678
This PR fixed DDP doesn't work after save() and load() for multiple GPUs, because, it needs all these replicating logics and bucketing in the constructor.
So I refactored some of the logics in the constructor to a helper function. And this will be used for load().
Added test too. Tested on 8 GPU machines.
```
tengli@learnfair062:~/pytorch/test$ python run_test.py -i distributed --verbose
Test executor: ['/private/home/tengli/miniconda3/bin/python']
Selected tests: distributed
Running test_distributed ... [2018-12-02 18:33:55.833580]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the mpi backend
test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
----------------------------------------------------------------------
Ran 68 tests in 6.315s
OK (skipped=15)
ok
----------------------------------------------------------------------
Ran 68 tests in 6.315s
OK (skipped=15)
ok
----------------------------------------------------------------------
Ran 68 tests in 6.315s
OK (skipped=15)
Running distributed tests for the mpi backend with file init_method
test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
----------------------------------------------------------------------
Ran 68 tests in 6.415s
OK (skipped=15)
ok
----------------------------------------------------------------------
Ran 68 tests in 6.415s
OK (skipped=15)
ok
----------------------------------------------------------------------
Ran 68 tests in 6.415s
OK (skipped=15)
Running distributed tests for the nccl backend
test_Backend_enum_class (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU'
test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather'
test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL'
test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_multigpu (__main__.TestDistBackend) ... ok
test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL'
test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_cuda (__main__.TestDistBackend) ... ok
test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_cuda (__main__.TestDistBackend) ... ok
test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped'
test_destroy_full_group (__main__.TestDistBackend) ... ok
test_destroy_group (__main__.TestDistBackend) ... ok
test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_get_backend (__main__.TestDistBackend) ... ok
test_get_default_group (__main__.TestDistBackend) ... ok
test_get_rank (__main__.TestDistBackend) ... ok
test_get_rank_size_full_group (__main__.TestDistBackend) ... ok
test_get_rank_size_group (__main__.TestDistBackend) ... ok
test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv'
test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend'
test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_multigpu (__main__.TestDistBackend) ... ok
test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum_cuda (__main__.TestDistBackend) ... ok
test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source'
test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
----------------------------------------------------------------------
Ran 68 tests in 69.549s
OK (skipped=52)
Running distributed tests for the nccl backend with file init_method
test_Backend_enum_class (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU'
test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather'
test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL'
test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_multigpu (__main__.TestDistBackend) ... ok
test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL'
test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_cuda (__main__.TestDistBackend) ... ok
test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_cuda (__main__.TestDistBackend) ... ok
test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped'
test_destroy_full_group (__main__.TestDistBackend) ... ok
test_destroy_group (__main__.TestDistBackend) ... ok
test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_get_backend (__main__.TestDistBackend) ... ok
test_get_default_group (__main__.TestDistBackend) ... ok
test_get_rank (__main__.TestDistBackend) ... ok
test_get_rank_size_full_group (__main__.TestDistBackend) ... ok
test_get_rank_size_group (__main__.TestDistBackend) ... ok
test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv'
test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend'
test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_multigpu (__main__.TestDistBackend) ... ok
test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum_cuda (__main__.TestDistBackend) ... ok
test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source'
test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
----------------------------------------------------------------------
Ran 68 tests in 70.381s
OK (skipped=52)
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14690
Differential Revision: D13294169
Pulled By: teng-li
fbshipit-source-id: 69ccac34c6c016899bfe8fbc50b48d4bfd1d3876
Summary:
Ok, this corner happens for translation guys, and it only happens in the following corner case:
(1) when the module is registered a parameter that does not requires grad
and
(2) this registered parameter has a unique type (say, double, or half) and it's the only unique type such that itself alone will be put into a separate bucket.
and
(3) it is the last parameter that got registered in the module, such that its bucket reduction is the first to be kicked off.
Once this corner case happens, since it does not require grad, the backward hook won't be kicked off. Now that all other buckets are waiting for its bucket to be kicked off, in this case, no bucket will be kicked off since it's blocked by the first bucket (the unique type parameter).
This PR fixes two things:
(1) Make sure that we will only bucket parameters that requires_grad
(2) Make all-reduction checks in the next iteration. As long as we detect the previous iteration's all-reduction has not been fully kicked off, we will issue an error in the next iteration.
(3) Also removed some unused variables
With this bug fixed, the only case when this error can happen is when the user changed parameters later after wrapping up the module with DDP, like the case in:
https://github.com/pytorch/pytorch/issues/12603
Test covered as well
Without the first fix, I varied that the repro in fbcode hit this error message:
```
result = self.forward(*input, **kwargs)
File "/data/users/tengli/fbsource/fbcode/buck-out/dev/gen/language_technology/neural_mt/os/pytorch_translate/train#link-tree/torch/nn/parallel/distributed.py", line 312, in forward
raise RuntimeError("Not all gradients are all-reduced from "
RuntimeError: Not all gradients are all-reduced from the backward of the previous iteration. This is unexpected and fatal error. Please check and ensure that the model's parameters are not changed after you wrap up the model with DistributedDataParallel.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14675
Differential Revision: D13291083
Pulled By: teng-li
fbshipit-source-id: 2539b699fae843f104b4b8d22721ae82502ba684
Summary:
This somehow is not cleaned up after the C++ migration. Unused and can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14208
Differential Revision: D13132492
Pulled By: teng-li
fbshipit-source-id: 0f05b6368174664ebb2560c037347c8eb45f7c38
Summary:
We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway.
Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607
Differential Revision: D12944515
Pulled By: teng-li
fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508
Summary:
When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen?
It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix.
Now fp16 DDP training should be fully working now.
Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.
Fixed: https://github.com/pytorch/pytorch/issues/12150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496
Differential Revision: D12920985
Pulled By: teng-li
fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c
Summary:
Fix https://github.com/pytorch/pytorch/issues/13200
Tested on 8 GPU machines since CI doesn't have this many GPUs, so multi-GPU test won't be triggered
```
tengli@learnfair096:~/pytorch/test$ python run_test.py -i distributed --verbose
Selected tests: distributed
Running test_distributed ... [2018-10-29 20:32:46.355858]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the gloo backend
test_DistBackend (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... ok
```
Also I would like to bump up the bucket size of broadcast to higher for performance reasons
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13291
Differential Revision: D12842840
Pulled By: teng-li
fbshipit-source-id: e8c50f15ebf2ab3e2cd1b51d365e41a6106b98fe
Summary:
- Moved sync_reduction to C++
- Use a dedicated CUDA stream for memcpy
- Also use a dedicated CUDA stream for memcpy in queue_reduction
Added test as well.
CI should cover both DDP and unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954
Differential Revision: D10520069
Pulled By: teng-li
fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
Summary:
fully working version by using continuing on goldsborough 's initial version.
waiting on the stream guard to be merged before adding more stream perf logics into the c++ version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852
Differential Revision: D10468696
Pulled By: teng-li
fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554
Reviewed By: SsnL
Differential Revision: D9367762
Pulled By: jma127
fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().
In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()
Fixes#3627
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast