Commit Graph

26 Commits

Author SHA1 Message Date
Christian Sarofeen
6db9f6dc78 Enable half communication for distributed (#4091) 2017-12-13 13:00:12 +01:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
Trevor Killeen
b544882335 ATen in THD (Part I) (#2288)
* enable size from ATen type

* temp commit aten thd

* port copy, math

* port random

* changes after rebase

* lapack bind

* thd and csrc compile

* fix min/max reductions in DataChannelTCP

* clean up changes

* re-enable tensor constructors

* port MPI to at::Tensor

* fix storage methods to not cast to thpp storage ptrs
2017-11-01 09:59:02 -04:00
Edward Z. Yang
3696300fcf Include Python.h less using a new stub header.
In many "non-Python" headers, we include Python.h because we need
to declare a pointer to PyObject, and solely because of that.  It
would be a lot better if we had a simpler version of Python.h that
just declared PyObject available for pointers, without anything
else.  This is what torch/csrc/utils/python_stub.h does.

The good thing about not including Python.h is that it is easy to
be warning-less; no more ugly insertions of Python.h on headers
where it has no good reason to be.

This makes PyTorch warning clean again.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-10-19 23:04:19 -04:00
Adam Paszke
2a8603c5e1 Make distributed recv return sender rank 2017-09-25 12:11:52 -04:00
Adam Paszke
714351ff39 Officially enable process-group mode 2017-06-12 22:02:11 -04:00
Adam Paszke
b37f18be53 Free GIL when entering THD functions 2017-06-12 21:58:38 -04:00
Adam Paszke
095ddc7d08 THD updates and bug fixes
* Add keepdim
* Fix DataChannel signature
* Fix incorrect locking
* Use current stream in DataChannelGloo
2017-06-12 21:58:38 -04:00
Adam Paszke
c6c9e61169 Implement THD tensor copies 2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
c41555fb0a Add rank parameter; Fix MW mode initalization 2017-06-02 23:42:11 +02:00
Adam Paszke
447d9287bf Refactor multicast and change env init method 2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
e685277299 Add address discovery; Bug fixes; 2017-06-02 23:42:11 +02:00
Adam Paszke
8ea7c87c29 Improve init methods 2017-06-02 23:42:11 +02:00
Janusz Marcinkiewicz
09c0d9c51c Add multiple initalization methods for DataChannels 2017-06-02 23:42:11 +02:00
Adam Paszke
181d2f41bd Add initial Python wrappers for THDTensors 2017-06-02 23:42:11 +02:00
Adam Paszke
4ebf3ff46d Add base for CUDA allReduce and broadcast in DataChannelGloo 2017-05-01 01:49:10 -07:00
Adam Paszke
7e8830c3d5 Initial gloo bindings 2017-05-01 01:49:09 -07:00
Sam Gross
4c1cdb6148 Refactor Python string utility function 2017-04-28 21:25:26 +02:00
Adam Paszke
79232c24e2 Fixes after rebase 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
962084c8e8 Add Data Channel receive from any source (#52) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
76520512e7 DataChannel tests rewrite (#42); DataChannel isend and irecv implementation (#44) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
ac1f68127a Add barrier, scatter, gather and allGather implementations + groups (#34) 2017-01-31 01:58:09 +01:00
Adam Paszke
60d1852c7b Major improvements to master-worker mode
* Fixed all undefined symbol errors
* Implemented storage interface and THStorage class
* RPC improvements
* Code refactor
2017-01-31 01:58:09 +01:00
Adam Paszke
ea876eb6d5 Add initial bindings for master-worker mode 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
5e6fcd02b5 Implement data channel groups (#25) 2017-01-31 01:58:09 +01:00
Adam Paszke
55632d81d2 Add Python wrappers for process group mode 2017-01-31 01:58:09 +01:00