Commit Graph

350 Commits

Author SHA1 Message Date
Derek Kim
4171ef3728 Enhance the documentation for DistributedDataParallel from torch.nn.parallel.distributed (#16010)
Summary:
- a typo fixed
- made the docs consistent with #5108

And maybe one more change is needed. According to the current docs
> The batch size should be larger than the number of GPUs used **locally**.

But shouldn't the batch size be larger than the number of GPUs used **either locally or remotely**? Sadly, I couldn't experiment this with my single GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16010

Differential Revision: D13709516

Pulled By: ezyang

fbshipit-source-id: e44459a602a8a834fd365fe46e4063e9e045d5ce
2019-01-17 01:02:44 -08:00
Teng Li
f56217af3b Doc improvement on DDP (#15440)
Summary:
I noticed that some users don't even know we have this support. Adding into the doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15440

Differential Revision: D13531045

Pulled By: teng-li

fbshipit-source-id: 9757c400c0010608758c754df04e603b36035a10
2018-12-20 14:51:57 -08:00
Teng Li
2d3cf98b49 Making dist.get_default_group private for PT1 release (#14767)
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions.  It should really be private.

All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.

We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767

Reviewed By: pietern

Differential Revision: D13330655

Pulled By: teng-li

fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
2018-12-04 19:22:24 -08:00
Teng Li
cac03280f9 Fixed DistributedDataParallel state pickling for multi-gpus (#14690)
Summary:
Fixed: https://github.com/pytorch/pytorch/issues/14678

This PR fixed DDP doesn't work after save() and load() for multiple GPUs, because, it needs all these replicating logics and bucketing in the constructor.

So I refactored some of the logics in the constructor to a helper function. And this will be used for load().

Added test too. Tested on 8 GPU machines.

```
tengli@learnfair062:~/pytorch/test$ python run_test.py -i distributed --verbose
Test executor: ['/private/home/tengli/miniconda3/bin/python']
Selected tests: distributed
Running test_distributed ... [2018-12-02 18:33:55.833580]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the mpi backend
test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok

----------------------------------------------------------------------
Ran 68 tests in 6.315s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.315s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.315s

OK (skipped=15)
Running distributed tests for the mpi backend with file init_method
test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel'
test_DistributedDataParallelCPU (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather'
test_all_gather_full_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_group (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu'
test_all_reduce_full_group_max (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_min (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_product (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_full_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_max (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_min (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_product (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_group_sum (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_max (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_min (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_all_reduce_product (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_full_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier"
test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce'
test_broadcast_full_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu"
test_destroy_full_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_destroy_group (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_full_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_gather_group (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_backend (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_default_group (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_full_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_get_rank_size_group (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_irecv (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_isend (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_max (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_min (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_product (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_full_group_sum (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_max (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_min (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_product (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_group_sum (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_max (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_min (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu'
test_reduce_product (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce'
test_scatter (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_full_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_scatter_group (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_any_source (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok
test_send_recv_with_tag (__main__.TestMPI) ... ok

----------------------------------------------------------------------
Ran 68 tests in 6.415s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.415s

OK (skipped=15)
ok

----------------------------------------------------------------------
Ran 68 tests in 6.415s

OK (skipped=15)
Running distributed tests for the nccl backend
test_Backend_enum_class (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU'
test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather'
test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL'
test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_multigpu (__main__.TestDistBackend) ... ok
test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL'
test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_cuda (__main__.TestDistBackend) ... ok
test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_cuda (__main__.TestDistBackend) ... ok
test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped'
test_destroy_full_group (__main__.TestDistBackend) ... ok
test_destroy_group (__main__.TestDistBackend) ... ok
test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_get_backend (__main__.TestDistBackend) ... ok
test_get_default_group (__main__.TestDistBackend) ... ok
test_get_rank (__main__.TestDistBackend) ... ok
test_get_rank_size_full_group (__main__.TestDistBackend) ... ok
test_get_rank_size_group (__main__.TestDistBackend) ... ok
test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv'
test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend'
test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_multigpu (__main__.TestDistBackend) ... ok
test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum_cuda (__main__.TestDistBackend) ... ok
test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source'
test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'

----------------------------------------------------------------------
Ran 68 tests in 69.549s

OK (skipped=52)
Running distributed tests for the nccl backend with file init_method
test_Backend_enum_class (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU'
test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather'
test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL'
test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_gather_multigpu (__main__.TestDistBackend) ... ok
test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL'
test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested'
test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_cuda (__main__.TestDistBackend) ... ok
test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier'
test_barrier_group_cuda (__main__.TestDistBackend) ... ok
test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts'
test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_cuda (__main__.TestDistBackend) ... ok
test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped'
test_destroy_full_group (__main__.TestDistBackend) ... ok
test_destroy_group (__main__.TestDistBackend) ... ok
test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_get_backend (__main__.TestDistBackend) ... ok
test_get_default_group (__main__.TestDistBackend) ... ok
test_get_rank (__main__.TestDistBackend) ... ok
test_get_rank_size_full_group (__main__.TestDistBackend) ... ok
test_get_rank_size_group (__main__.TestDistBackend) ... ok
test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv'
test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend'
test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_multigpu (__main__.TestDistBackend) ... ok
test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors'
test_reduce_sum_cuda (__main__.TestDistBackend) ... ok
test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter'
test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'
test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source'
test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv'

----------------------------------------------------------------------
Ran 68 tests in 70.381s

OK (skipped=52)
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14690

Differential Revision: D13294169

Pulled By: teng-li

fbshipit-source-id: 69ccac34c6c016899bfe8fbc50b48d4bfd1d3876
2018-12-03 12:04:26 -08:00
Teng Li
5268dd468c Fixed DistributedDataParallel cannot kick off all-reduce in a corner case (#14675)
Summary:
Ok, this corner happens for translation guys, and it only happens in the following corner case:

(1) when the module is registered a parameter that does not requires grad

and

(2) this registered parameter has a unique type (say, double, or half) and it's the only unique type such that itself alone will be put into a separate bucket.

and

(3) it is the last parameter that got registered in the module, such that its bucket reduction is the first to be kicked off.

Once this corner case happens, since it does not require grad, the backward hook won't be kicked off. Now that all other buckets are waiting for its bucket to be kicked off, in this case, no bucket will be kicked off since it's blocked by the first bucket (the unique type parameter).

This PR fixes two things:
(1) Make sure that we will only bucket parameters that requires_grad
(2) Make all-reduction checks in the next iteration. As long as we detect the previous iteration's all-reduction has not been fully kicked off, we will issue an error in the next iteration.
(3) Also removed some unused variables

With this bug fixed, the only case when this error can happen is when the user changed parameters later after wrapping up the module with DDP, like the case in:
https://github.com/pytorch/pytorch/issues/12603

Test covered as well

Without the first fix, I varied that the repro in fbcode hit this error message:

```
result = self.forward(*input, **kwargs)
  File "/data/users/tengli/fbsource/fbcode/buck-out/dev/gen/language_technology/neural_mt/os/pytorch_translate/train#link-tree/torch/nn/parallel/distributed.py", line 312, in forward
    raise RuntimeError("Not all gradients are all-reduced from "
RuntimeError: Not all gradients are all-reduced from the backward of the previous iteration. This is unexpected and fatal error. Please check and ensure that the model's parameters are not changed after you wrap up the model with DistributedDataParallel.

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14675

Differential Revision: D13291083

Pulled By: teng-li

fbshipit-source-id: 2539b699fae843f104b4b8d22721ae82502ba684
2018-12-02 17:13:07 -08:00
Teng Li
85d3fccee7 Removed redundant allreduce options in DDP (#14208)
Summary:
This somehow is not cleaned up after the C++ migration. Unused and can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14208

Differential Revision: D13132492

Pulled By: teng-li

fbshipit-source-id: 0f05b6368174664ebb2560c037347c8eb45f7c38
2018-11-21 16:56:46 -08:00
Teng Li
4983397c02 Better documentation and warning (#13946)
Summary:
This is to address https://github.com/pytorch/pytorch/issues/12603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13946

Differential Revision: D13055254

Pulled By: teng-li

fbshipit-source-id: 20a206ebd3456eac9dc50584664c4bca3ee955d1
2018-11-14 10:41:46 -08:00
Teng Li
dceec1de30 Distributed Data Parallel documentation for PT1 release (#13657)
Summary:
This should fix https://github.com/pytorch/pytorch/issues/12604

Make html and look through the html pages to make sure that everything looks good
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13657

Reviewed By: calebho

Differential Revision: D12954250

Pulled By: teng-li

fbshipit-source-id: 40e1925ec0cdce5e6a1d8ba29537937da8ef9194
2018-11-07 12:11:57 -08:00
Teng Li
1413dd4bfc Added the finer bucketing option for DDP (#13607)
Summary:
We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway.

Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607

Differential Revision: D12944515

Pulled By: teng-li

fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508
2018-11-07 12:00:55 -08:00
Teng Li
74819087de Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496)
Summary:
When go to mixed precision fp16 training, DDP randomly hangs.  Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size.  How could this even happen?

It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer.  An interesting bug digging and fix.

Now fp16 DDP training should be fully working now.

Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.

Fixed: https://github.com/pytorch/pytorch/issues/12150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496

Differential Revision: D12920985

Pulled By: teng-li

fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c
2018-11-05 16:22:15 -08:00
Teng Li
e475d3ede3 DDP multi-GPU segfault fix (#13291)
Summary:
Fix https://github.com/pytorch/pytorch/issues/13200

Tested on 8 GPU machines since CI doesn't have this many GPUs, so multi-GPU test won't be triggered

```
tengli@learnfair096:~/pytorch/test$ python run_test.py -i distributed --verbose
Selected tests: distributed
Running test_distributed ... [2018-10-29 20:32:46.355858]
/public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec
Running distributed tests for the gloo backend
test_DistBackend (__main__.TestDistBackend) ... ok
test_DistributedDataParallel (__main__.TestDistBackend) ... ok
test_DistributedDataParallelCPU (__main__.TestDistBackend) ... ok
```

Also I would like to bump up the bucket size of broadcast to higher for performance reasons
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13291

Differential Revision: D12842840

Pulled By: teng-li

fbshipit-source-id: e8c50f15ebf2ab3e2cd1b51d365e41a6106b98fe
2018-10-31 00:43:42 -07:00
sli
9d9e5f8d1e Solve bug of DistributedDataParallel (#13248)
Summary:
Fixed bug [https://github.com/facebookresearch/maskrcnn-benchmark/issues/52](https://github.com/facebookresearch/maskrcnn-benchmark/issues/52)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13248

Reviewed By: pietern

Differential Revision: D12830451

Pulled By: teng-li

fbshipit-source-id: ab33faf3f6f4545f8fe07da7ecbeb2f0a2ea23f0
2018-10-29 15:19:55 -07:00
Teng Li
c250f6f3d5 DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954)
Summary:
- Moved sync_reduction to C++
- Use a dedicated CUDA stream for memcpy
- Also use a dedicated CUDA stream for memcpy in queue_reduction

Added test as well.

CI should cover both DDP and unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954

Differential Revision: D10520069

Pulled By: teng-li

fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
2018-10-24 21:37:13 -07:00
Teng Li
8d3e7e2fcb Move DDP queue_reduction to C++ (#12852)
Summary:
fully working version by using continuing on goldsborough 's initial version.

waiting on the stream guard to be merged before adding more stream perf logics into the c++ version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852

Differential Revision: D10468696

Pulled By: teng-li

fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
2018-10-22 16:07:46 -07:00
Teng Li
d120b9af5a Make c10d pickling/unpickling work (#12694)
Summary:
This fixes the issue for https://github.com/pytorch/pytorch/issues/12168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12694

Differential Revision: D10468717

Pulled By: teng-li

fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048
2018-10-19 16:42:36 -07:00
Wei Yang
54107ae8cf convert output_device at data_parallel from torch.device to index (#10189)
Summary:
- fixes #9984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189

Differential Revision: D9545390

Pulled By: weiyangfb

fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e
2018-09-11 20:27:07 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
Jerry Ma
afd7477eaa Add `buffers(), named_buffers()` methods. (#10554)
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554

Reviewed By: SsnL

Differential Revision: D9367762

Pulled By: jma127

fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
2018-08-16 16:26:48 -07:00
Tongzhou Wang
a77b391de7 [SpectralNorm] don't register original weight as buffer (#8170)
* don't register original weight as buffer; fixes for buffers that require grad

* add test
2018-06-12 14:42:05 -04:00
Ailing
52e4d3c4a2 add error when backend is not supported by DDP (#8325) 2018-06-11 02:18:30 -04:00
Isaac Ge
537cb10525 improve DataParallel/DistributedDataParallel docs (#7407) 2018-05-09 10:30:42 +02:00
Jon Malmaud
5463a4a319 Fix typo. (#6609) 2018-04-15 11:43:10 +02:00
Ailing
1499a604cf fix assertion error when input size smaller than number of module_copies (#6252) 2018-04-04 12:05:34 +02:00
Ailing
f5aa8d55ad fix detach in place error in DDP (#5829)
* fix detach in DDP

* fix typo

* make lint happy
2018-03-16 09:22:04 -04:00
Teng Li
579de82bcf DDP: 10% of NCCL backend perf improvements with mixed-prec support (#5064) 2018-02-21 23:59:52 +01:00
Teng Li
4b8f4fc259 Added mixed-precision support in distributed training (#4891) 2018-02-21 14:29:39 +01:00
Richard Zou
cac3026b35 Fix typo in DataParallel docs (#5268) 2018-02-15 23:02:26 +01:00
Teng Li
d7b6a61a54 DDP: coalescing many little broadcasts to improve performance (#4978) 2018-02-12 16:41:33 +01:00
Tongzhou Wang
805639906a Broacast output requires_grad if only corresponding input requires_grad (#5061) 2018-02-05 23:38:35 -05:00
Teng Li
ae28411af8 Slightly improve DDP single GPU multi-process dist training performance 2018-01-27 12:15:44 +01:00
Teng Li
154038e318 Removing NCCL clear_group_cache workaround with one more check in new_group (#4766) 2018-01-23 11:03:52 +01:00
Sam Gross
d605058212
Replace Variable.volatile with torch.no_grad() (#3970)
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().

In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()

Fixes #3627
2017-12-18 15:46:13 -05:00
ngimel
7f41149e14 handle requires_grad when creating buckets for distributed (#4044) 2017-12-18 02:13:53 -05:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
SsnL
01be4d6b20 sparse broadcast_coalesce and reduce_add_coalesced 2017-10-28 18:52:35 -04:00
SsnL
de1f4e69dd raw text (#3327) 2017-10-28 01:24:02 +05:30
Luca Antiga
6743d59513 Add missing import. Add return to __getstate__ 2017-10-08 11:07:10 -04:00
Sergey Kolesnikov
5f8bab47c8 bugfix for 2428 ussue (#3000) 2017-10-06 09:20:12 -04:00
jekbradbury
7aa6bc516f add "Basics" section to distributed docs (#2433) 2017-08-24 17:07:20 -04:00
Robert Kirby
5d09fcd028 Make DistributedDataParallel threads Daemon threads to allow clean process exit (#2524) 2017-08-24 06:32:29 -04:00
Christian Sarofeen
4c69697d2a Distribtued bug fixes. (#2434) 2017-08-23 14:46:52 -04:00
LuoweiZhou
5c43fcda8d Support params that don’t require grad in DistributedDataParallel (#2464) 2017-08-19 11:22:20 -04:00
Robert Kirby
9199c954f1 Fix typo in DistributedDataParallel (#2320) 2017-08-08 21:53:42 -04:00
Adam Paszke
dc17fb68e4 Fix minor bug in parallel_apply (#2193) 2017-07-25 03:45:00 +05:30
Adam Paszke
8ab3d214d5 Fixes for DistributedDataParallel (#2168) 2017-07-21 16:00:46 -04:00
Adam Paszke
4af40e3471 Let parallel_apply accept arbitrary inputs 2017-07-20 01:45:57 -04:00
Sam Gross
10e23943b3 Fix missing _forward_pre_hooks in serialized modules (#2057) 2017-07-11 18:23:35 -04:00
Leonid Vlasenkov
46a868dab7 [Ready] Limit docs line length (#1900)
* some docs are ready

* docs

* docs

* fix some more

* fix some more
2017-07-10 10:24:54 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00