* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast
* enable size from ATen type
* temp commit aten thd
* port copy, math
* port random
* changes after rebase
* lapack bind
* thd and csrc compile
* fix min/max reductions in DataChannelTCP
* clean up changes
* re-enable tensor constructors
* port MPI to at::Tensor
* fix storage methods to not cast to thpp storage ptrs
In many "non-Python" headers, we include Python.h because we need
to declare a pointer to PyObject, and solely because of that. It
would be a lot better if we had a simpler version of Python.h that
just declared PyObject available for pointers, without anything
else. This is what torch/csrc/utils/python_stub.h does.
The good thing about not including Python.h is that it is easy to
be warning-less; no more ugly insertions of Python.h on headers
where it has no good reason to be.
This makes PyTorch warning clean again.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>