* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast