Commit Graph

31 Commits

Author SHA1 Message Date
Teng Li
56539f5fe1 PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871)
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

**Now all the distributed test including c10d DDP can pass with the c10d frontend API**

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871

Differential Revision: D9554514

Pulled By: teng-li

fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
2018-08-29 12:55:57 -07:00
Edward Yang
51f154e072 Fix Python lint errors. (#10441)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10441

Reviewed By: Yangqing

Differential Revision: D9285502

Pulled By: ezyang

fbshipit-source-id: 12c94b28bee9cade930c8f260577e81ea1915269
2018-08-11 21:08:50 -07:00
Jason Gauci
31646edfff Increase GLOO rendevous timeout
Summary: Increase GLOO rendevous timeout

Reviewed By: teng-li

Differential Revision: D9273544

fbshipit-source-id: 5c22c1d18df3032f019ff12e2a720aea7c390f15
2018-08-10 18:40:18 -07:00
anderspapitto
48e90e3339 Build system changes (#8627)
* All changes needed to get rid of process_github.sh

* allow thnn_h_path
2018-06-20 17:45:26 -04:00
Tongzhou Wang
85ee94b7be
Add memory leak check in CUDA tests (#7270)
* Add memory leak check in CUDA tests

* Tracking multi-GPU too

* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test

* add a comment

* skip if cuda

* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU

* Fix MaxUnpool3d forward memory leak

* Fix MultiLabelMarginCriterion forward memory leak

* Fix MultiMarginLoss backward memory leak

* default doCUDAMemoryCheck to False

* make the wrapper skip-able

* use TEST_MULTIGPU

* add align_corners=True/False tests for Upsample; fix TEST_CUDNN

* finalize interface

* VolumetricMaxUnpooling_updateOutput

* fix test_nccl

* rename THC caching allocator methods to be clearer

* make the wrapped function a method

* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp

* fix renamed var
2018-05-31 15:09:54 -04:00
Adam Paszke
3238db6247
Show skipped distributed tests as skipped (#7624)
Previously, tests that have been skipped because their backend was
missing would show up as succeeded, which has been very confusing.
2018-05-17 00:23:46 +02:00
xhzhao
f2c9975378 Add DistributedDataParallelCPU (#5919) 2018-04-17 15:36:47 +02:00
Ailing
30157971f0 Update dist test to use multi gpus (#6337)
* update dist test to use multi gpus

* add nccl to jenkins

* address comment

* make lint happy

* convert range object to list
2018-04-16 14:10:27 -04:00
Simeon Monov
9b111f1a88 Fix worldsize use in test_distributed with MPI backend (#6301)
WORLD_SIZE is not used for MPI tests and the check fails for
the group tests
2018-04-05 09:28:53 -04:00
Ailing
2f64e1cdf6 Add second iteration in test_DistributedDataParallel (#5830) 2018-03-18 00:27:45 +01:00
Myle Ott
f5f6258288 Enable additional tensor types in Gloo backend (#5483) 2018-03-15 14:53:24 +01:00
Ailing
92596197fc add end to end test for DistributedDataParallel (#5182)
* add end to end test for DistributedDataParallel

* address comments

* skip subgroup tests when less than 3 processes

* set process number based on available gpus

* add single gpu;cleanup WORLD_SIZE

* fix comments
2018-03-08 22:07:34 -05:00
Ailing
ff3f689239 Add mote tests for Nccl backend (#4796) 2018-01-28 12:36:59 +01:00
Teng Li
926ed2b280 Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435)
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs

* Let FindNCCL to determine the NCCL version

* Let NCCL2 Backend use ATEN instead deprecated THPP

* Let distributed parallel model use a single reduction thread for NCCL backend

* Caching the sockets, bug fix, refactoring, and addressed Adam's comments

* Make BcastNcclID take a single param and bug fix for all_gather

* Removed barrier function, added warning for users, and not exposing experimental func to users

* Use the simplest single bucket working solution for distriubted data parallel model with rebase

* Cleanup, fixes and further addressed Adam's comments

* Used PySequence_Fast in distributed csrc

* Removed the limitation that each group is only bound to a given device sequence

* Used THPObjectPtr for PySequence_Fast
2017-11-29 15:57:02 -05:00
Adam Paszke
2a8603c5e1 Make distributed recv return sender rank 2017-09-25 12:11:52 -04:00
Soumith Chintala
674e1f2ba1 increase test subprocess timeout 2017-08-27 21:11:08 -04:00
gchanan
5b8e2ad2a6 test_distributed cuda tests don't skip if cuda not available. (#2476)
test_distributed cuda tests don't skip if cuda not available.
2017-08-17 17:45:32 -04:00
gchanan
0985eaf373 Add ability to specify init_method for test_distributed. (#2465)
* Add ability to specify init_method for test_distributed.

* Move init_method specification to test run line.

* Run for gloo tests as well.

* Better status message for gloo test.
2017-08-16 17:04:21 -04:00
Adam Paszke
8915e2710c Refactor scatter/gather and add distributed docs 2017-07-12 14:47:36 -04:00
lynic
ebdec9a837 Skip distributed tests if not supported (#2004) 2017-07-07 11:06:56 -04:00
Adam Paszke
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
Adam Paszke
714351ff39 Officially enable process-group mode 2017-06-12 22:02:11 -04:00
Adam Paszke
4ebf3ff46d Add base for CUDA allReduce and broadcast in DataChannelGloo 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
f07f13c6e9 Change Store exception handling 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
310d08c37b Fix store and all operations 2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz
2b340e7d50 Add python tests; Remove broken prefix store creation 2017-05-01 01:49:09 -07:00
Adam Paszke
a3e11d606b Fix linter errors 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
962084c8e8 Add Data Channel receive from any source (#52) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
76520512e7 DataChannel tests rewrite (#42); DataChannel isend and irecv implementation (#44) 2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz
ac1f68127a Add barrier, scatter, gather and allGather implementations + groups (#34) 2017-01-31 01:58:09 +01:00
Mateusz Piotrowski
3e3501c98d Integration tests of the THD Python interface (#28) 2017-01-31 01:58:09 +01:00