pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Derek Kim	4171ef3728	Enhance the documentation for DistributedDataParallel from torch.nn.parallel.distributed (#16010 ) Summary: - a typo fixed - made the docs consistent with #5108 And maybe one more change is needed. According to the current docs > The batch size should be larger than the number of GPUs used locally. But shouldn't the batch size be larger than the number of GPUs used either locally or remotely? Sadly, I couldn't experiment this with my single GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16010 Differential Revision: D13709516 Pulled By: ezyang fbshipit-source-id: e44459a602a8a834fd365fe46e4063e9e045d5ce	2019-01-17 01:02:44 -08:00
Teng Li	f56217af3b	Doc improvement on DDP (#15440 ) Summary: I noticed that some users don't even know we have this support. Adding into the doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/15440 Differential Revision: D13531045 Pulled By: teng-li fbshipit-source-id: 9757c400c0010608758c754df04e603b36035a10	2018-12-20 14:51:57 -08:00
Teng Li	2d3cf98b49	Making dist.get_default_group private for PT1 release (#14767 ) Summary: When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private. All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design. We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767 Reviewed By: pietern Differential Revision: D13330655 Pulled By: teng-li fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c	2018-12-04 19:22:24 -08:00
Teng Li	cac03280f9	Fixed DistributedDataParallel state pickling for multi-gpus (#14690 ) Summary: Fixed: https://github.com/pytorch/pytorch/issues/14678 This PR fixed DDP doesn't work after save() and load() for multiple GPUs, because, it needs all these replicating logics and bucketing in the constructor. So I refactored some of the logics in the constructor to a helper function. And this will be used for load(). Added test too. Tested on 8 GPU machines. ``` tengli@learnfair062:~/pytorch/test$ python run_test.py -i distributed --verbose Test executor: ['/private/home/tengli/miniconda3/bin/python'] Selected tests: distributed Running test_distributed ... [2018-12-02 18:33:55.833580] /public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec Running distributed tests for the mpi backend test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) Running distributed tests for the mpi backend with file init_method test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) Running distributed tests for the nccl backend test_Backend_enum_class (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU' test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather' test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL' test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_multigpu (__main__.TestDistBackend) ... ok test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL' test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_cuda (__main__.TestDistBackend) ... ok test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_group_cuda (__main__.TestDistBackend) ... ok test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_cuda (__main__.TestDistBackend) ... ok test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped' test_destroy_full_group (__main__.TestDistBackend) ... ok test_destroy_group (__main__.TestDistBackend) ... ok test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_get_backend (__main__.TestDistBackend) ... ok test_get_default_group (__main__.TestDistBackend) ... ok test_get_rank (__main__.TestDistBackend) ... ok test_get_rank_size_full_group (__main__.TestDistBackend) ... ok test_get_rank_size_group (__main__.TestDistBackend) ... ok test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv' test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend' test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_multigpu (__main__.TestDistBackend) ... ok test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum_cuda (__main__.TestDistBackend) ... ok test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source' test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' ---------------------------------------------------------------------- Ran 68 tests in 69.549s OK (skipped=52) Running distributed tests for the nccl backend with file init_method test_Backend_enum_class (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU' test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather' test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL' test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_multigpu (__main__.TestDistBackend) ... ok test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL' test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_cuda (__main__.TestDistBackend) ... ok test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_group_cuda (__main__.TestDistBackend) ... ok test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_cuda (__main__.TestDistBackend) ... ok test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped' test_destroy_full_group (__main__.TestDistBackend) ... ok test_destroy_group (__main__.TestDistBackend) ... ok test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_get_backend (__main__.TestDistBackend) ... ok test_get_default_group (__main__.TestDistBackend) ... ok test_get_rank (__main__.TestDistBackend) ... ok test_get_rank_size_full_group (__main__.TestDistBackend) ... ok test_get_rank_size_group (__main__.TestDistBackend) ... ok test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv' test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend' test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_multigpu (__main__.TestDistBackend) ... ok test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum_cuda (__main__.TestDistBackend) ... ok test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source' test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' ---------------------------------------------------------------------- Ran 68 tests in 70.381s OK (skipped=52) `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14690 Differential Revision: D13294169 Pulled By: teng-li fbshipit-source-id: 69ccac34c6c016899bfe8fbc50b48d4bfd1d3876	2018-12-03 12:04:26 -08:00
Teng Li	5268dd468c	Fixed DistributedDataParallel cannot kick off all-reduce in a corner case (#14675 ) Summary: Ok, this corner happens for translation guys, and it only happens in the following corner case: (1) when the module is registered a parameter that does not requires grad and (2) this registered parameter has a unique type (say, double, or half) and it's the only unique type such that itself alone will be put into a separate bucket. and (3) it is the last parameter that got registered in the module, such that its bucket reduction is the first to be kicked off. Once this corner case happens, since it does not require grad, the backward hook won't be kicked off. Now that all other buckets are waiting for its bucket to be kicked off, in this case, no bucket will be kicked off since it's blocked by the first bucket (the unique type parameter). This PR fixes two things: (1) Make sure that we will only bucket parameters that requires_grad (2) Make all-reduction checks in the next iteration. As long as we detect the previous iteration's all-reduction has not been fully kicked off, we will issue an error in the next iteration. (3) Also removed some unused variables With this bug fixed, the only case when this error can happen is when the user changed parameters later after wrapping up the module with DDP, like the case in: https://github.com/pytorch/pytorch/issues/12603 Test covered as well Without the first fix, I varied that the repro in fbcode hit this error message: ``` result = self.forward(input, *kwargs) File "/data/users/tengli/fbsource/fbcode/buck-out/dev/gen/language_technology/neural_mt/os/pytorch_translate/train#link-tree/torch/nn/parallel/distributed.py", line 312, in forward raise RuntimeError("Not all gradients are all-reduced from " RuntimeError: Not all gradients are all-reduced from the backward of the previous iteration. This is unexpected and fatal error. Please check and ensure that the model's parameters are not changed after you wrap up the model with DistributedDataParallel. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14675 Differential Revision: D13291083 Pulled By: teng-li fbshipit-source-id: 2539b699fae843f104b4b8d22721ae82502ba684	2018-12-02 17:13:07 -08:00
Teng Li	85d3fccee7	Removed redundant allreduce options in DDP (#14208 ) Summary: This somehow is not cleaned up after the C++ migration. Unused and can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14208 Differential Revision: D13132492 Pulled By: teng-li fbshipit-source-id: 0f05b6368174664ebb2560c037347c8eb45f7c38	2018-11-21 16:56:46 -08:00
Teng Li	4983397c02	Better documentation and warning (#13946 ) Summary: This is to address https://github.com/pytorch/pytorch/issues/12603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13946 Differential Revision: D13055254 Pulled By: teng-li fbshipit-source-id: 20a206ebd3456eac9dc50584664c4bca3ee955d1	2018-11-14 10:41:46 -08:00
Teng Li	dceec1de30	Distributed Data Parallel documentation for PT1 release (#13657 ) Summary: This should fix https://github.com/pytorch/pytorch/issues/12604 Make html and look through the html pages to make sure that everything looks good Pull Request resolved: https://github.com/pytorch/pytorch/pull/13657 Reviewed By: calebho Differential Revision: D12954250 Pulled By: teng-li fbshipit-source-id: 40e1925ec0cdce5e6a1d8ba29537937da8ef9194	2018-11-07 12:11:57 -08:00
Teng Li	1413dd4bfc	Added the finer bucketing option for DDP (#13607 ) Summary: We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway. Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607 Differential Revision: D12944515 Pulled By: teng-li fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508	2018-11-07 12:00:55 -08:00
Teng Li	74819087de	Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496 ) Summary: When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen? It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix. Now fp16 DDP training should be fully working now. Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that. Fixed: https://github.com/pytorch/pytorch/issues/12150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496 Differential Revision: D12920985 Pulled By: teng-li fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c	2018-11-05 16:22:15 -08:00
Teng Li	e475d3ede3	DDP multi-GPU segfault fix (#13291 ) Summary: Fix https://github.com/pytorch/pytorch/issues/13200 Tested on 8 GPU machines since CI doesn't have this many GPUs, so multi-GPU test won't be triggered ``` tengli@learnfair096:~/pytorch/test$ python run_test.py -i distributed --verbose Selected tests: distributed Running test_distributed ... [2018-10-29 20:32:46.355858] /public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec Running distributed tests for the gloo backend test_DistBackend (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... ok ``` Also I would like to bump up the bucket size of broadcast to higher for performance reasons Pull Request resolved: https://github.com/pytorch/pytorch/pull/13291 Differential Revision: D12842840 Pulled By: teng-li fbshipit-source-id: e8c50f15ebf2ab3e2cd1b51d365e41a6106b98fe	2018-10-31 00:43:42 -07:00
sli	9d9e5f8d1e	Solve bug of DistributedDataParallel (#13248 ) Summary: Fixed bug [https://github.com/facebookresearch/maskrcnn-benchmark/issues/52](https://github.com/facebookresearch/maskrcnn-benchmark/issues/52) Pull Request resolved: https://github.com/pytorch/pytorch/pull/13248 Reviewed By: pietern Differential Revision: D12830451 Pulled By: teng-li fbshipit-source-id: ab33faf3f6f4545f8fe07da7ecbeb2f0a2ea23f0	2018-10-29 15:19:55 -07:00
Teng Li	c250f6f3d5	DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954 ) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65	2018-10-24 21:37:13 -07:00
Teng Li	8d3e7e2fcb	Move DDP queue_reduction to C++ (#12852 ) Summary: fully working version by using continuing on goldsborough 's initial version. waiting on the stream guard to be merged before adding more stream perf logics into the c++ version Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852 Differential Revision: D10468696 Pulled By: teng-li fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f	2018-10-22 16:07:46 -07:00
Teng Li	d120b9af5a	Make c10d pickling/unpickling work (#12694 ) Summary: This fixes the issue for https://github.com/pytorch/pytorch/issues/12168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12694 Differential Revision: D10468717 Pulled By: teng-li fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048	2018-10-19 16:42:36 -07:00
Wei Yang	54107ae8cf	convert output_device at data_parallel from torch.device to index (#10189 ) Summary: - fixes #9984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189 Differential Revision: D9545390 Pulled By: weiyangfb fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e	2018-09-11 20:27:07 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00
Jerry Ma	afd7477eaa	Add ``buffers(),` `named_buffers()`` methods. (#10554 ) Summary: This commit adds the ``buffers()`` and ``named_buffers()`` methods as analogues of ``parameters()`` and ``named_parameters()``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554 Reviewed By: SsnL Differential Revision: D9367762 Pulled By: jma127 fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e	2018-08-16 16:26:48 -07:00
Tongzhou Wang	a77b391de7	[SpectralNorm] don't register original weight as buffer (#8170 ) * don't register original weight as buffer; fixes for buffers that require grad * add test	2018-06-12 14:42:05 -04:00
Ailing	52e4d3c4a2	add error when backend is not supported by DDP (#8325 )	2018-06-11 02:18:30 -04:00
Isaac Ge	537cb10525	improve DataParallel/DistributedDataParallel docs (#7407 )	2018-05-09 10:30:42 +02:00
Jon Malmaud	5463a4a319	Fix typo. (#6609 )	2018-04-15 11:43:10 +02:00
Ailing	1499a604cf	fix assertion error when input size smaller than number of module_copies (#6252 )	2018-04-04 12:05:34 +02:00
Ailing	f5aa8d55ad	fix detach in place error in DDP (#5829 ) * fix detach in DDP * fix typo * make lint happy	2018-03-16 09:22:04 -04:00
Teng Li	579de82bcf	DDP: 10% of NCCL backend perf improvements with mixed-prec support (#5064 )	2018-02-21 23:59:52 +01:00
Teng Li	4b8f4fc259	Added mixed-precision support in distributed training (#4891 )	2018-02-21 14:29:39 +01:00
Richard Zou	cac3026b35	Fix typo in DataParallel docs (#5268 )	2018-02-15 23:02:26 +01:00
Teng Li	d7b6a61a54	DDP: coalescing many little broadcasts to improve performance (#4978 )	2018-02-12 16:41:33 +01:00
Tongzhou Wang	805639906a	Broacast output requires_grad if only corresponding input requires_grad (#5061 )	2018-02-05 23:38:35 -05:00
Teng Li	ae28411af8	Slightly improve DDP single GPU multi-process dist training performance	2018-01-27 12:15:44 +01:00
Teng Li	154038e318	Removing NCCL clear_group_cache workaround with one more check in new_group (#4766 )	2018-01-23 11:03:52 +01:00
Sam Gross	d605058212	Replace Variable.volatile with torch.no_grad() (#3970 ) This removes volatile from Variable. The functionality is mostly replaced by a global (thread-local) flag, which is controlled by torch.set_grad_enabled() and the context manager torch.no_grad(). In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled() Fixes #3627	2017-12-18 15:46:13 -05:00
ngimel	7f41149e14	handle requires_grad when creating buckets for distributed (#4044 )	2017-12-18 02:13:53 -05:00
Teng Li	926ed2b280	Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435 ) * Implemented NCCL Distributed Backend for PyTorch with new dist APIs * Let FindNCCL to determine the NCCL version * Let NCCL2 Backend use ATEN instead deprecated THPP * Let distributed parallel model use a single reduction thread for NCCL backend * Caching the sockets, bug fix, refactoring, and addressed Adam's comments * Make BcastNcclID take a single param and bug fix for all_gather * Removed barrier function, added warning for users, and not exposing experimental func to users * Use the simplest single bucket working solution for distriubted data parallel model with rebase * Cleanup, fixes and further addressed Adam's comments * Used PySequence_Fast in distributed csrc * Removed the limitation that each group is only bound to a given device sequence * Used THPObjectPtr for PySequence_Fast	2017-11-29 15:57:02 -05:00
SsnL	01be4d6b20	sparse broadcast_coalesce and reduce_add_coalesced	2017-10-28 18:52:35 -04:00
SsnL	de1f4e69dd	raw text (#3327 )	2017-10-28 01:24:02 +05:30
Luca Antiga	6743d59513	Add missing import. Add return to __getstate__	2017-10-08 11:07:10 -04:00
Sergey Kolesnikov	5f8bab47c8	bugfix for 2428 ussue (#3000 )	2017-10-06 09:20:12 -04:00
jekbradbury	7aa6bc516f	add "Basics" section to distributed docs (#2433 )	2017-08-24 17:07:20 -04:00
Robert Kirby	5d09fcd028	Make DistributedDataParallel threads Daemon threads to allow clean process exit (#2524 )	2017-08-24 06:32:29 -04:00
Christian Sarofeen	4c69697d2a	Distribtued bug fixes. (#2434 )	2017-08-23 14:46:52 -04:00
LuoweiZhou	5c43fcda8d	Support params that don’t require grad in DistributedDataParallel (#2464 )	2017-08-19 11:22:20 -04:00
Robert Kirby	9199c954f1	Fix typo in DistributedDataParallel (#2320 )	2017-08-08 21:53:42 -04:00
Adam Paszke	dc17fb68e4	Fix minor bug in parallel_apply (#2193 )	2017-07-25 03:45:00 +05:30
Adam Paszke	8ab3d214d5	Fixes for DistributedDataParallel (#2168 )	2017-07-21 16:00:46 -04:00
Adam Paszke	4af40e3471	Let parallel_apply accept arbitrary inputs	2017-07-20 01:45:57 -04:00
Sam Gross	10e23943b3	Fix missing _forward_pre_hooks in serialized modules (#2057 )	2017-07-11 18:23:35 -04:00
Leonid Vlasenkov	46a868dab7	[Ready] Limit docs line length (#1900 ) * some docs are ready * docs * docs * fix some more * fix some more	2017-07-10 10:24:54 -04:00
Adam Paszke	d9d50f80c7	Rename arguments to distributed collectives	2017-06-12 22:02:11 -04:00
Adam Paszke	12813b88f6	Add DistributedDataParallel	2017-06-12 22:00:22 -04:00

... 3 4 5 6 7

350 Commits