Commit Graph

34 Commits

Author SHA1 Message Date
Teng Li
220c9e52b9 Distributed Data Parallel CPU module for C10D (#11168)
Summary:
Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed.

We will keep both in the first release and remove the THD one once c10d is stable enough.

Test fully covered just as THD too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11168

Differential Revision: D9674963

Pulled By: teng-li

fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7
2018-09-05 21:59:31 -07:00
Pieter Noordhuis
9a0effb92c Update send/recv tests to reflect intended use (#11275)
Summary:
The existing tests had every rank run send to every other rank and only
then switch to recv mode. This only works if the send operations are
non-blocking and the passed tensors are immediately copied to some kind
of send buffer. Instead, every send must be matched with a recv on the
other side, because from the API perspective they may block.

E.g. imagine a 1GB tensor being sent to every other rank. It can only go
through if there is a recv on the other side, or it will deadlock.

This change reflects this in the send/recv unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11275

Differential Revision: D9658197

Pulled By: pietern

fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974
2018-09-05 14:40:04 -07:00
Teng Li
3791bd12c8 PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128)
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.

Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.

The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128

Differential Revision: D9602188

Pulled By: teng-li

fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
2018-08-31 12:39:56 -07:00
Teng Li
56539f5fe1 PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871)
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

**Now all the distributed test including c10d DDP can pass with the c10d frontend API**

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871

Differential Revision: D9554514

Pulled By: teng-li

fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
2018-08-29 12:55:57 -07:00
Edward Yang
51f154e072 Fix Python lint errors. (#10441)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10441

Reviewed By: Yangqing

Differential Revision: D9285502

Pulled By: ezyang

fbshipit-source-id: 12c94b28bee9cade930c8f260577e81ea1915269
2018-08-11 21:08:50 -07:00
Jason Gauci
31646edfff Increase GLOO rendevous timeout
Summary: Increase GLOO rendevous timeout

Reviewed By: teng-li

Differential Revision: D9273544

fbshipit-source-id: 5c22c1d18df3032f019ff12e2a720aea7c390f15
2018-08-10 18:40:18 -07:00
anderspapitto
48e90e3339 Build system changes (#8627)
* All changes needed to get rid of process_github.sh

* allow thnn_h_path
2018-06-20 17:45:26 -04:00
Tongzhou Wang
85ee94b7be
Add memory leak check in CUDA tests (#7270)
* Add memory leak check in CUDA tests

* Tracking multi-GPU too

* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test

* add a comment

* skip if cuda

* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU

* Fix MaxUnpool3d forward memory leak

* Fix MultiLabelMarginCriterion forward memory leak

* Fix MultiMarginLoss backward memory leak

* default doCUDAMemoryCheck to False

* make the wrapper skip-able

* use TEST_MULTIGPU

* add align_corners=True/False tests for Upsample; fix TEST_CUDNN

* finalize interface

* VolumetricMaxUnpooling_updateOutput

* fix test_nccl

* rename THC caching allocator methods to be clearer

* make the wrapped function a method

* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp

* fix renamed var
2018-05-31 15:09:54 -04:00
Adam Paszke
3238db6247
Show skipped distributed tests as skipped (#7624)
Previously, tests that have been skipped because their backend was
missing would show up as succeeded, which has been very confusing.
2018-05-17 00:23:46 +02:00
xhzhao
f2c9975378 Add DistributedDataParallelCPU (#5919) 2018-04-17 15:36:47 +02:00
Ailing
30157971f0 Update dist test to use multi gpus (#6337)
* update dist test to use multi gpus

* add nccl to jenkins

* address comment

* make lint happy

* convert range object to list
2018-04-16 14:10:27 -04:00
Simeon Monov
9b111f1a88 Fix worldsize use in test_distributed with MPI backend (#6301)
WORLD_SIZE is not used for MPI tests and the check fails for
the group tests
2018-04-05 09:28:53 -04:00
Ailing
2f64e1cdf6 Add second iteration in test_DistributedDataParallel (#5830) 2018-03-18 00:27:45 +01:00
Myle Ott
f5f6258288 Enable additional tensor types in Gloo backend (#5483) 2018-03-15 14:53:24 +01:00
Ailing
92596197fc add end to end test for DistributedDataParallel (#5182)
* add end to end test for DistributedDataParallel

* address comments

* skip subgroup tests when less than 3 processes

* set process number based on available gpus

* add single gpu;cleanup WORLD_SIZE

* fix comments
2018-03-08 22:07:34 -05:00
Ailing
ff3f689239 Add mote tests for Nccl backend (#4796) 2018-01-28 12:36:59 +01:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
Adam Paszke
2a8603c5e1 Make distributed recv return sender rank 2017-09-25 12:11:52 -04:00
Soumith Chintala
674e1f2ba1 increase test subprocess timeout 2017-08-27 21:11:08 -04:00
gchanan
5b8e2ad2a6 test_distributed cuda tests don't skip if cuda not available. (#2476)
test_distributed cuda tests don't skip if cuda not available.
2017-08-17 17:45:32 -04:00
gchanan
0985eaf373 Add ability to specify init_method for test_distributed. (#2465)
* Add ability to specify init_method for test_distributed.

* Move init_method specification to test run line.

* Run for gloo tests as well.

* Better status message for gloo test.
2017-08-16 17:04:21 -04:00
Adam Paszke
8915e2710c Refactor scatter/gather and add distributed docs 2017-07-12 14:47:36 -04:00
lynic
ebdec9a837 Skip distributed tests if not supported (#2004) 2017-07-07 11:06:56 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
714351ff39 Officially enable process-group mode 2017-06-12 22:02:11 -04:00
Adam Paszke
4ebf3ff46d Add base for CUDA allReduce and broadcast in DataChannelGloo 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
f07f13c6e9 Change Store exception handling 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
310d08c37b Fix store and all operations 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
2b340e7d50 Add python tests; Remove broken prefix store creation 2017-05-01 01:49:09 -07:00
Adam Paszke
a3e11d606b Fix linter errors 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
962084c8e8 Add Data Channel receive from any source (#52) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
76520512e7 DataChannel tests rewrite (#42); DataChannel isend and irecv implementation (#44) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
ac1f68127a Add barrier, scatter, gather and allGather implementations + groups (#34) 2017-01-31 01:58:09 +01:00
Mateusz Piotrowski
3e3501c98d Integration tests of the THD Python interface (#28) 2017-01-31 01:58:09 +01:00