Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d
**Now all the distributed test including c10d DDP can pass with the c10d frontend API**
TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871
Differential Revision: D9554514
Pulled By: teng-li
fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
* Add memory leak check in CUDA tests
* Tracking multi-GPU too
* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test
* add a comment
* skip if cuda
* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU
* Fix MaxUnpool3d forward memory leak
* Fix MultiLabelMarginCriterion forward memory leak
* Fix MultiMarginLoss backward memory leak
* default doCUDAMemoryCheck to False
* make the wrapper skip-able
* use TEST_MULTIGPU
* add align_corners=True/False tests for Upsample; fix TEST_CUDNN
* finalize interface
* VolumetricMaxUnpooling_updateOutput
* fix test_nccl
* rename THC caching allocator methods to be clearer
* make the wrapped function a method
* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp
* fix renamed var
* add end to end test for DistributedDataParallel
* address comments
* skip subgroup tests when less than 3 processes
* set process number based on available gpus
* add single gpu;cleanup WORLD_SIZE
* fix comments
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast
* Add ability to specify init_method for test_distributed.
* Move init_method specification to test run line.
* Run for gloo tests as well.
* Better status message for gloo test.