Commit Graph

71 Commits

Author SHA1 Message Date
Myle Ott
f5f6258288 Enable additional tensor types in Gloo backend (#5483) 2018-03-15 14:53:24 +01:00
Ailing
92596197fc add end to end test for DistributedDataParallel (#5182)
* add end to end test for DistributedDataParallel

* address comments

* skip subgroup tests when less than 3 processes

* set process number based on available gpus

* add single gpu;cleanup WORLD_SIZE

* fix comments
2018-03-08 22:07:34 -05:00
Ailing
ff3f689239 Add mote tests for Nccl backend (#4796) 2018-01-28 12:36:59 +01:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
Adam Paszke
2a8603c5e1 Make distributed recv return sender rank 2017-09-25 12:11:52 -04:00
Soumith Chintala
674e1f2ba1 increase test subprocess timeout 2017-08-27 21:11:08 -04:00
gchanan
5b8e2ad2a6 test_distributed cuda tests don't skip if cuda not available. (#2476)
test_distributed cuda tests don't skip if cuda not available.
2017-08-17 17:45:32 -04:00
gchanan
0985eaf373 Add ability to specify init_method for test_distributed. (#2465)
* Add ability to specify init_method for test_distributed.

* Move init_method specification to test run line.

* Run for gloo tests as well.

* Better status message for gloo test.
2017-08-16 17:04:21 -04:00
Adam Paszke
8915e2710c Refactor scatter/gather and add distributed docs 2017-07-12 14:47:36 -04:00
lynic
ebdec9a837 Skip distributed tests if not supported (#2004) 2017-07-07 11:06:56 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
714351ff39 Officially enable process-group mode 2017-06-12 22:02:11 -04:00
Adam Paszke
4ebf3ff46d Add base for CUDA allReduce and broadcast in DataChannelGloo 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
f07f13c6e9 Change Store exception handling 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
310d08c37b Fix store and all operations 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
2b340e7d50 Add python tests; Remove broken prefix store creation 2017-05-01 01:49:09 -07:00
Adam Paszke
a3e11d606b Fix linter errors 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
962084c8e8 Add Data Channel receive from any source (#52) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
76520512e7 DataChannel tests rewrite (#42); DataChannel isend and irecv implementation (#44) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
ac1f68127a Add barrier, scatter, gather and allGather implementations + groups (#34) 2017-01-31 01:58:09 +01:00
Mateusz Piotrowski
3e3501c98d Integration tests of the THD Python interface (#28) 2017-01-31 01:58:09 +01:00