pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Edward Yang	2934153f35	Correctly call superclass setUp in TestCase subclasses. (#18291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18291 ghimport-source-id: d6e95e899bd320407967df41435801e54864ba62 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18292 Add test for #17271 (torch.exp incorrect for 2*31 size tensor) #18291 Correctly call superclass setUp in TestCase subclasses. This makes PYTORCH_TEST_SKIP_FAST work correctly for more tests, reducing the wasted testing effort on our slow_test job. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14567643 fbshipit-source-id: 40cf1d6556e0dd0a0550ff3d9ffed8b6000f8191	2019-03-22 07:46:44 -07:00
jiej	39669316a6	(#14267 ) Summary: - Summary: Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group. Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API) - User-facing api: 1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None) 2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, *process_group=None) - supported use case: DistributedDataParallel with single-gpu multi-process* a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below: 1. use layers directly: torch.nn.SyncBatchNorm(...) similar API as with torch.nn.BatchNormXd(...) with added argument `process_group` which is used to limit the scope of synchronization within each process group. Default value is None, which implies synchronization across all GPUs 2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group) recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm` preserving values of parameters/buffers. the utility function also allows user to specify process_group value to all converted layers. b. user wraps their model with `torch.distributed.parallel.DataParallelDistributed`, from this point, user should follow the general guidelines for DDP use guide - Error checking For use cases not supported, we error out: 1. Application launched without ddp: > import torch > sbn = torch.nn.SyncBatchNorm(10).cuda() > inp = torch.randn(5, 10, 3, 3).cuda() > sbn(inp) --> Error! > AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel 2. Application launched using DDP with multi-GPU per-process: > ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank) > ValueError: SyncBatchNorm is only supported for DDP with single GPU per process Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267 Differential Revision: D14270035 Pulled By: ezyang fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d	2019-03-06 13:39:11 -08:00
Pieter Noordhuis	252e9058d4	Improve assertion failure message (#14813 ) Summary: See #14554. I can't figure out how the reported issue can happen. The best next thing is have more information when this happens again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14813 Differential Revision: D13351908 Pulled By: pietern fbshipit-source-id: 61b30fcae2e34da54329d0893ca4921b6ad60f0d	2018-12-05 17:20:25 -08:00
Teng Li	2d3cf98b49	Making dist.get_default_group private for PT1 release (#14767 ) Summary: When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private. All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design. We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767 Reviewed By: pietern Differential Revision: D13330655 Pulled By: teng-li fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c	2018-12-04 19:22:24 -08:00
Teng Li	cac03280f9	Fixed DistributedDataParallel state pickling for multi-gpus (#14690 ) Summary: Fixed: https://github.com/pytorch/pytorch/issues/14678 This PR fixed DDP doesn't work after save() and load() for multiple GPUs, because, it needs all these replicating logics and bucketing in the constructor. So I refactored some of the logics in the constructor to a helper function. And this will be used for load(). Added test too. Tested on 8 GPU machines. ``` tengli@learnfair062:~/pytorch/test$ python run_test.py -i distributed --verbose Test executor: ['/private/home/tengli/miniconda3/bin/python'] Selected tests: distributed Running test_distributed ... [2018-12-02 18:33:55.833580] /public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec Running distributed tests for the mpi backend test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) Running distributed tests for the mpi backend with file init_method test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) Running distributed tests for the nccl backend test_Backend_enum_class (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU' test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather' test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL' test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_multigpu (__main__.TestDistBackend) ... ok test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL' test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_cuda (__main__.TestDistBackend) ... ok test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_group_cuda (__main__.TestDistBackend) ... ok test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_cuda (__main__.TestDistBackend) ... ok test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped' test_destroy_full_group (__main__.TestDistBackend) ... ok test_destroy_group (__main__.TestDistBackend) ... ok test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_get_backend (__main__.TestDistBackend) ... ok test_get_default_group (__main__.TestDistBackend) ... ok test_get_rank (__main__.TestDistBackend) ... ok test_get_rank_size_full_group (__main__.TestDistBackend) ... ok test_get_rank_size_group (__main__.TestDistBackend) ... ok test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv' test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend' test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_multigpu (__main__.TestDistBackend) ... ok test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum_cuda (__main__.TestDistBackend) ... ok test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source' test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' ---------------------------------------------------------------------- Ran 68 tests in 69.549s OK (skipped=52) Running distributed tests for the nccl backend with file init_method test_Backend_enum_class (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU' test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather' test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL' test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_multigpu (__main__.TestDistBackend) ... ok test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL' test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_cuda (__main__.TestDistBackend) ... ok test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_group_cuda (__main__.TestDistBackend) ... ok test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_cuda (__main__.TestDistBackend) ... ok test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped' test_destroy_full_group (__main__.TestDistBackend) ... ok test_destroy_group (__main__.TestDistBackend) ... ok test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_get_backend (__main__.TestDistBackend) ... ok test_get_default_group (__main__.TestDistBackend) ... ok test_get_rank (__main__.TestDistBackend) ... ok test_get_rank_size_full_group (__main__.TestDistBackend) ... ok test_get_rank_size_group (__main__.TestDistBackend) ... ok test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv' test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend' test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_multigpu (__main__.TestDistBackend) ... ok test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum_cuda (__main__.TestDistBackend) ... ok test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source' test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' ---------------------------------------------------------------------- Ran 68 tests in 70.381s OK (skipped=52) `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14690 Differential Revision: D13294169 Pulled By: teng-li fbshipit-source-id: 69ccac34c6c016899bfe8fbc50b48d4bfd1d3876	2018-12-03 12:04:26 -08:00
Pieter Noordhuis	74c3cbc013	Increase test barrier timeout for barrier test (#14689 ) Summary: The CUDA initialization for the participating processes can take long enough for the barrier timeout to trigger on the process that doesn't participate in the group. See #14676. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14689 Reviewed By: teng-li Differential Revision: D13293695 Pulled By: pietern fbshipit-source-id: 6268dc9acfdb22f70c027e5e4be082f7127c0db4	2018-12-02 17:46:17 -08:00
Teng Li	5268dd468c	Fixed DistributedDataParallel cannot kick off all-reduce in a corner case (#14675 ) Summary: Ok, this corner happens for translation guys, and it only happens in the following corner case: (1) when the module is registered a parameter that does not requires grad and (2) this registered parameter has a unique type (say, double, or half) and it's the only unique type such that itself alone will be put into a separate bucket. and (3) it is the last parameter that got registered in the module, such that its bucket reduction is the first to be kicked off. Once this corner case happens, since it does not require grad, the backward hook won't be kicked off. Now that all other buckets are waiting for its bucket to be kicked off, in this case, no bucket will be kicked off since it's blocked by the first bucket (the unique type parameter). This PR fixes two things: (1) Make sure that we will only bucket parameters that requires_grad (2) Make all-reduction checks in the next iteration. As long as we detect the previous iteration's all-reduction has not been fully kicked off, we will issue an error in the next iteration. (3) Also removed some unused variables With this bug fixed, the only case when this error can happen is when the user changed parameters later after wrapping up the module with DDP, like the case in: https://github.com/pytorch/pytorch/issues/12603 Test covered as well Without the first fix, I varied that the repro in fbcode hit this error message: ``` result = self.forward(input, *kwargs) File "/data/users/tengli/fbsource/fbcode/buck-out/dev/gen/language_technology/neural_mt/os/pytorch_translate/train#link-tree/torch/nn/parallel/distributed.py", line 312, in forward raise RuntimeError("Not all gradients are all-reduced from " RuntimeError: Not all gradients are all-reduced from the backward of the previous iteration. This is unexpected and fatal error. Please check and ensure that the model's parameters are not changed after you wrap up the model with DistributedDataParallel. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14675 Differential Revision: D13291083 Pulled By: teng-li fbshipit-source-id: 2539b699fae843f104b4b8d22721ae82502ba684	2018-12-02 17:13:07 -08:00
Pieter Noordhuis	11ef5191ff	Enable tests for CPU tensors in test_distributed.py (#14572 ) Summary: These were not enabled after adding support in the Gloo backend. The argument checks in ProcessGroupGloo raised an error in two cases: * If the input tensor list to scatter was ``[None]`` on processes other than the source process. * If the output tensor list to gather was ``[None]`` on processes other than the destination process. This commit prepares these arguments explicitly instead of boxing them at the process group call site. This fixes #14536. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14572 Differential Revision: D13272812 Pulled By: pietern fbshipit-source-id: 12cb0d85ec92f175365cbada585260f89330aad8	2018-11-29 21:39:02 -08:00
Pieter Noordhuis	0f62af4ab1	Add timeout kwarg to init_process_group (#14435 ) Summary: This applies to the gloo backend only. Timeout support for the NCCL and MPI backends is tracked in issues #14371 and #14372 respectively. When creating a new process group (either the global one or any subgroup created through `new_group`) you can specify a timeout keyword argument (of type datetime.timedelta). This timeout applies to all collective operations executed against that process group, such that any operation taking longer than the timeout will throw a runtime error. Using a different, better catchable error type is tracked in #14433. This fixes #14376. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435 Differential Revision: D13234317 Pulled By: pietern fbshipit-source-id: 973993b67994dc64861c0977cbb6f051ec9d87f6	2018-11-28 11:35:01 -08:00
Teng Li	bb301a431d	Make NCCL backend support barrier op (#14142 ) Summary: This is a feature request from: https://github.com/pytorch/pytorch/issues/13573 As the title says, this PR makes NCCL backend support barrier op. There are a couple scenarios that need to be addressed: (1) When there is already a NCCL op happened, we need to record what GPU device(s) the previous op happened and queue the allreduce barrier op on the same GPU device (2) When there is no NCCL op yet, we will try to use a single GPU and separate each process from a single GPU as the best effort. As for the async work, during wait, we would like not just wait on the NCCL kernel to be completed, but also block the thread until the current stream and nccl stream return. `test_distributed` should cover the test. I also manually tested both scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14142 Differential Revision: D13113391 Pulled By: teng-li fbshipit-source-id: 96c33d4d129e2977e6892d85d0fc449424c35499	2018-11-20 21:12:22 -08:00
Tongzhou Wang	044d00516c	Rename DistBackend -> Backend (#11830 ) Summary: Also add docs for get_backend, Backend, and reduce_op fixes #11803 cc The controller you requested could not be found. pietern apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830 Differential Revision: D9927991 Pulled By: SsnL fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48	2018-11-07 11:58:12 -08:00
sli	9d9e5f8d1e	Solve bug of DistributedDataParallel (#13248 ) Summary: Fixed bug [https://github.com/facebookresearch/maskrcnn-benchmark/issues/52](https://github.com/facebookresearch/maskrcnn-benchmark/issues/52) Pull Request resolved: https://github.com/pytorch/pytorch/pull/13248 Reviewed By: pietern Differential Revision: D12830451 Pulled By: teng-li fbshipit-source-id: ab33faf3f6f4545f8fe07da7ecbeb2f0a2ea23f0	2018-10-29 15:19:55 -07:00
Tongzhou Wang	8ad69a80e3	Test scripts only run cases defined in the running script (#13250 ) Summary: 1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it. 2. Adds an assertion in `load_tests` that each script only runs cases defined in itself. cc yf225 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250 Differential Revision: D12823734 Pulled By: SsnL fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b	2018-10-29 13:57:40 -07:00
Pieter Noordhuis	2a6431ba2d	Use fixed MASTER_PORT in test_distributed (#13109 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13109 The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes #12876. Differential Revision: D10850614 fbshipit-source-id: c19f12bb4916141187ee8ddb52880f5f418310dc	2018-10-25 08:51:34 -07:00
Pieter Noordhuis	917b203b01	Assert spawned processes terminating in distributed tests (#13071 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13071 In the case where a process got stuck and timed out on joining, we would see a None != 1 assertion error in the code path where the exit statuses are compared. This implies that the first process exited with exit code 1 and another one didn't exit at all. With this commit the error message is more descriptive. Differential Revision: D10785266 fbshipit-source-id: c8cc02d07ea4fdc6f5374afd9a0aac72218fe61d	2018-10-24 16:03:36 -07:00
Teng Li	d120b9af5a	Make c10d pickling/unpickling work (#12694 ) Summary: This fixes the issue for https://github.com/pytorch/pytorch/issues/12168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12694 Differential Revision: D10468717 Pulled By: teng-li fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048	2018-10-19 16:42:36 -07:00
James Sun	f4944f0f8a	Rename test/common.py to test/common_utils.py (#12794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380	2018-10-17 23:04:29 -07:00
Edward Yang	3bfa7258b3	Don't serialize hooks (#11705 ) Summary: Fixes #11683. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/11705 Differential Revision: D9833057 Pulled By: ezyang fbshipit-source-id: 18af9bcd77b088326738d567100fbe4a4c869dd6	2018-10-16 20:11:03 -07:00
Pieter Noordhuis	1c77f9e543	Support torch.distributed.barrier in gloo backend Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11844 Reviewed By: colesbury, SsnL Differential Revision: D9929055 Pulled By: pietern fbshipit-source-id: 3a34a179cb80f495f18aa926c0f9513924737d8e	2018-09-20 09:25:59 -07:00
Tongzhou Wang	540ef9b1fc	Add distributed get_backend (#11715 ) Summary: I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`. cc mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715 Reviewed By: pietern Differential Revision: D9889646 Pulled By: SsnL fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2	2018-09-18 10:56:24 -07:00
Pieter Noordhuis	24a8c13f36	Add barrier to fix distributed test flakiness (#11775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11775 This should fix #11582. Reviewed By: ezyang Differential Revision: D9885546 fbshipit-source-id: 3544f42ebe8b595cdf6941859c67484d3ea9b3f8	2018-09-17 17:31:45 -07:00
Pieter Noordhuis	7535d98ec4	Add message tag parameter to send/recv Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490 Reviewed By: teng-li Differential Revision: D9828116 Pulled By: pietern fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7	2018-09-14 10:55:37 -07:00
Tongzhou Wang	02c4cd3c8a	Skip flaky distributed tests (#11594 ) Summary: context: https://github.com/pytorch/pytorch/issues/11582 cc pietern The controller you requested could not be found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11594 Differential Revision: D9798871 Pulled By: SsnL fbshipit-source-id: 9f9e1871c7fd9505ca898865eb8068fab4d3416d	2018-09-12 14:57:57 -07:00
Wei Yang	54107ae8cf	convert output_device at data_parallel from torch.device to index (#10189 ) Summary: - fixes #9984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189 Differential Revision: D9545390 Pulled By: weiyangfb fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e	2018-09-11 20:27:07 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00
Pieter Noordhuis	3d2862526b	Support send/recv for the gloo process group (#11387 ) Summary: This change removes the skips for the existing send/recv tests in the backwards compatibility layer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11387 Reviewed By: teng-li Differential Revision: D9729330 Pulled By: pietern fbshipit-source-id: f8899219a94d806386d03e9ef53bff622d8658a3	2018-09-07 20:25:18 -07:00
Teng Li	576807ce1a	flaky test fix trial (#11391 ) Summary: Add a barrier() to wait for all PG created before destroy Pull Request resolved: https://github.com/pytorch/pytorch/pull/11391 Differential Revision: D9727383 Pulled By: teng-li fbshipit-source-id: 689d62c978e642b68f4949dcf29982e34869ada4	2018-09-07 14:10:06 -07:00
Teng Li	7726b36489	Full-fledged group testings and fixes for c10d frontend APIs (#11318 ) Summary: Fixed a few bugs that were not tested in the c10d frontend APIs, including get_rank, get_world_size, and destroy_process_group of a given group. These APIs are added to the CI tests. Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths. Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318 Reviewed By: pietern Differential Revision: D9675896 Pulled By: teng-li fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc	2018-09-06 18:26:11 -07:00
Teng Li	220c9e52b9	Distributed Data Parallel CPU module for C10D (#11168 ) Summary: Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed. We will keep both in the first release and remove the THD one once c10d is stable enough. Test fully covered just as THD too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11168 Differential Revision: D9674963 Pulled By: teng-li fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7	2018-09-05 21:59:31 -07:00
Pieter Noordhuis	9a0effb92c	Update send/recv tests to reflect intended use (#11275 ) Summary: The existing tests had every rank run send to every other rank and only then switch to recv mode. This only works if the send operations are non-blocking and the passed tensors are immediately copied to some kind of send buffer. Instead, every send must be matched with a recv on the other side, because from the API perspective they may block. E.g. imagine a 1GB tensor being sent to every other rank. It can only go through if there is a recv on the other side, or it will deadlock. This change reflects this in the send/recv unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11275 Differential Revision: D9658197 Pulled By: pietern fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974	2018-09-05 14:40:04 -07:00
Teng Li	3791bd12c8	PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128 ) Summary: Added MPI group support. And this will make all previous group test cases of MPI passed. Also, release the MPI thread level support by serializing different PG's MPI ops. This is required. The build is fixed too Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128 Differential Revision: D9602188 Pulled By: teng-li fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497	2018-08-31 12:39:56 -07:00
Teng Li	56539f5fe1	PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871 ) Summary: The PR includes: (1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed` (2) `env://` init method functionality (3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`. (4) The old `test_distributed.py' is now moved to `test_distributed_thd` (5) Miscellaneous bug fixes. (6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d. (7) CI config to test MPI, NCCL, and Gloo backend of c10d Now all the distributed test including c10d DDP can pass with the c10d frontend API TODO: (in a separate PR) MPI subgroup support, once this is added, CI group test will be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871 Differential Revision: D9554514 Pulled By: teng-li fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd	2018-08-29 12:55:57 -07:00
Edward Yang	51f154e072	Fix Python lint errors. (#10441 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/10441 Reviewed By: Yangqing Differential Revision: D9285502 Pulled By: ezyang fbshipit-source-id: 12c94b28bee9cade930c8f260577e81ea1915269	2018-08-11 21:08:50 -07:00
Jason Gauci	31646edfff	Increase GLOO rendevous timeout Summary: Increase GLOO rendevous timeout Reviewed By: teng-li Differential Revision: D9273544 fbshipit-source-id: 5c22c1d18df3032f019ff12e2a720aea7c390f15	2018-08-10 18:40:18 -07:00
anderspapitto	48e90e3339	Build system changes (#8627 ) * All changes needed to get rid of process_github.sh * allow thnn_h_path	2018-06-20 17:45:26 -04:00
Tongzhou Wang	85ee94b7be	Add memory leak check in CUDA tests (#7270 ) * Add memory leak check in CUDA tests * Tracking multi-GPU too * fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test * add a comment * skip if cuda * 1. Change the wrapper to a method in common.py:TestCase 2. Refactor common constants/method that initialize CUDA context into common_cuda.py 3. Update some test files to use TEST_CUDA and TEST_MULTIGPU * Fix MaxUnpool3d forward memory leak * Fix MultiLabelMarginCriterion forward memory leak * Fix MultiMarginLoss backward memory leak * default doCUDAMemoryCheck to False * make the wrapper skip-able * use TEST_MULTIGPU * add align_corners=True/False tests for Upsample; fix TEST_CUDNN * finalize interface * VolumetricMaxUnpooling_updateOutput * fix test_nccl * rename THC caching allocator methods to be clearer * make the wrapped function a method * address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp * fix renamed var	2018-05-31 15:09:54 -04:00
Adam Paszke	3238db6247	Show skipped distributed tests as skipped (#7624 ) Previously, tests that have been skipped because their backend was missing would show up as succeeded, which has been very confusing.	2018-05-17 00:23:46 +02:00
xhzhao	f2c9975378	Add DistributedDataParallelCPU (#5919 )	2018-04-17 15:36:47 +02:00
Ailing	30157971f0	Update dist test to use multi gpus (#6337 ) * update dist test to use multi gpus * add nccl to jenkins * address comment * make lint happy * convert range object to list	2018-04-16 14:10:27 -04:00
Simeon Monov	9b111f1a88	Fix worldsize use in test_distributed with MPI backend (#6301 ) WORLD_SIZE is not used for MPI tests and the check fails for the group tests	2018-04-05 09:28:53 -04:00
Ailing	2f64e1cdf6	Add second iteration in test_DistributedDataParallel (#5830 )	2018-03-18 00:27:45 +01:00
Myle Ott	f5f6258288	Enable additional tensor types in Gloo backend (#5483 )	2018-03-15 14:53:24 +01:00
Ailing	92596197fc	add end to end test for DistributedDataParallel (#5182 ) * add end to end test for DistributedDataParallel * address comments * skip subgroup tests when less than 3 processes * set process number based on available gpus * add single gpu;cleanup WORLD_SIZE * fix comments	2018-03-08 22:07:34 -05:00
Ailing	ff3f689239	Add mote tests for Nccl backend (#4796 )	2018-01-28 12:36:59 +01:00
Teng Li	926ed2b280	Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435 ) * Implemented NCCL Distributed Backend for PyTorch with new dist APIs * Let FindNCCL to determine the NCCL version * Let NCCL2 Backend use ATEN instead deprecated THPP * Let distributed parallel model use a single reduction thread for NCCL backend * Caching the sockets, bug fix, refactoring, and addressed Adam's comments * Make BcastNcclID take a single param and bug fix for all_gather * Removed barrier function, added warning for users, and not exposing experimental func to users * Use the simplest single bucket working solution for distriubted data parallel model with rebase * Cleanup, fixes and further addressed Adam's comments * Used PySequence_Fast in distributed csrc * Removed the limitation that each group is only bound to a given device sequence * Used THPObjectPtr for PySequence_Fast	2017-11-29 15:57:02 -05:00
Adam Paszke	2a8603c5e1	Make distributed recv return sender rank	2017-09-25 12:11:52 -04:00
Soumith Chintala	674e1f2ba1	increase test subprocess timeout	2017-08-27 21:11:08 -04:00
gchanan	5b8e2ad2a6	test_distributed cuda tests don't skip if cuda not available. (#2476 ) test_distributed cuda tests don't skip if cuda not available.	2017-08-17 17:45:32 -04:00
gchanan	0985eaf373	Add ability to specify init_method for test_distributed. (#2465 ) * Add ability to specify init_method for test_distributed. * Move init_method specification to test run line. * Run for gloo tests as well. * Better status message for gloo test.	2017-08-16 17:04:21 -04:00

1 2

63 Commits