* add end to end test for DistributedDataParallel
* address comments
* skip subgroup tests when less than 3 processes
* set process number based on available gpus
* add single gpu;cleanup WORLD_SIZE
* fix comments
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast
* Add ability to specify init_method for test_distributed.
* Move init_method specification to test run line.
* Run for gloo tests as well.
* Better status message for gloo test.