pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Teng Li	220c9e52b9	Distributed Data Parallel CPU module for C10D (#11168 ) Summary: Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed. We will keep both in the first release and remove the THD one once c10d is stable enough. Test fully covered just as THD too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11168 Differential Revision: D9674963 Pulled By: teng-li fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7	2018-09-05 21:59:31 -07:00
Pieter Noordhuis	9a0effb92c	Update send/recv tests to reflect intended use (#11275 ) Summary: The existing tests had every rank run send to every other rank and only then switch to recv mode. This only works if the send operations are non-blocking and the passed tensors are immediately copied to some kind of send buffer. Instead, every send must be matched with a recv on the other side, because from the API perspective they may block. E.g. imagine a 1GB tensor being sent to every other rank. It can only go through if there is a recv on the other side, or it will deadlock. This change reflects this in the send/recv unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11275 Differential Revision: D9658197 Pulled By: pietern fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974	2018-09-05 14:40:04 -07:00
Teng Li	3791bd12c8	PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128 ) Summary: Added MPI group support. And this will make all previous group test cases of MPI passed. Also, release the MPI thread level support by serializing different PG's MPI ops. This is required. The build is fixed too Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128 Differential Revision: D9602188 Pulled By: teng-li fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497	2018-08-31 12:39:56 -07:00
Teng Li	56539f5fe1	PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871 ) Summary: The PR includes: (1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed` (2) `env://` init method functionality (3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`. (4) The old `test_distributed.py' is now moved to `test_distributed_thd` (5) Miscellaneous bug fixes. (6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d. (7) CI config to test MPI, NCCL, and Gloo backend of c10d Now all the distributed test including c10d DDP can pass with the c10d frontend API TODO: (in a separate PR) MPI subgroup support, once this is added, CI group test will be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871 Differential Revision: D9554514 Pulled By: teng-li fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd	2018-08-29 12:55:57 -07:00
Edward Yang	51f154e072	Fix Python lint errors. (#10441 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/10441 Reviewed By: Yangqing Differential Revision: D9285502 Pulled By: ezyang fbshipit-source-id: 12c94b28bee9cade930c8f260577e81ea1915269	2018-08-11 21:08:50 -07:00
Jason Gauci	31646edfff	Increase GLOO rendevous timeout Summary: Increase GLOO rendevous timeout Reviewed By: teng-li Differential Revision: D9273544 fbshipit-source-id: 5c22c1d18df3032f019ff12e2a720aea7c390f15	2018-08-10 18:40:18 -07:00
anderspapitto	48e90e3339	Build system changes (#8627 ) * All changes needed to get rid of process_github.sh * allow thnn_h_path	2018-06-20 17:45:26 -04:00
Tongzhou Wang	85ee94b7be	Add memory leak check in CUDA tests (#7270 ) * Add memory leak check in CUDA tests * Tracking multi-GPU too * fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test * add a comment * skip if cuda * 1. Change the wrapper to a method in common.py:TestCase 2. Refactor common constants/method that initialize CUDA context into common_cuda.py 3. Update some test files to use TEST_CUDA and TEST_MULTIGPU * Fix MaxUnpool3d forward memory leak * Fix MultiLabelMarginCriterion forward memory leak * Fix MultiMarginLoss backward memory leak * default doCUDAMemoryCheck to False * make the wrapper skip-able * use TEST_MULTIGPU * add align_corners=True/False tests for Upsample; fix TEST_CUDNN * finalize interface * VolumetricMaxUnpooling_updateOutput * fix test_nccl * rename THC caching allocator methods to be clearer * make the wrapped function a method * address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp * fix renamed var	2018-05-31 15:09:54 -04:00
Adam Paszke	3238db6247	Show skipped distributed tests as skipped (#7624 ) Previously, tests that have been skipped because their backend was missing would show up as succeeded, which has been very confusing.	2018-05-17 00:23:46 +02:00
xhzhao	f2c9975378	Add DistributedDataParallelCPU (#5919 )	2018-04-17 15:36:47 +02:00
Ailing	30157971f0	Update dist test to use multi gpus (#6337 ) * update dist test to use multi gpus * add nccl to jenkins * address comment * make lint happy * convert range object to list	2018-04-16 14:10:27 -04:00
Simeon Monov	9b111f1a88	Fix worldsize use in test_distributed with MPI backend (#6301 ) WORLD_SIZE is not used for MPI tests and the check fails for the group tests	2018-04-05 09:28:53 -04:00
Ailing	2f64e1cdf6	Add second iteration in test_DistributedDataParallel (#5830 )	2018-03-18 00:27:45 +01:00
Myle Ott	f5f6258288	Enable additional tensor types in Gloo backend (#5483 )	2018-03-15 14:53:24 +01:00
Ailing	92596197fc	add end to end test for DistributedDataParallel (#5182 ) * add end to end test for DistributedDataParallel * address comments * skip subgroup tests when less than 3 processes * set process number based on available gpus * add single gpu;cleanup WORLD_SIZE * fix comments	2018-03-08 22:07:34 -05:00
Ailing	ff3f689239	Add mote tests for Nccl backend (#4796 )	2018-01-28 12:36:59 +01:00
Teng Li	926ed2b280	Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435 ) * Implemented NCCL Distributed Backend for PyTorch with new dist APIs * Let FindNCCL to determine the NCCL version * Let NCCL2 Backend use ATEN instead deprecated THPP * Let distributed parallel model use a single reduction thread for NCCL backend * Caching the sockets, bug fix, refactoring, and addressed Adam's comments * Make BcastNcclID take a single param and bug fix for all_gather * Removed barrier function, added warning for users, and not exposing experimental func to users * Use the simplest single bucket working solution for distriubted data parallel model with rebase * Cleanup, fixes and further addressed Adam's comments * Used PySequence_Fast in distributed csrc * Removed the limitation that each group is only bound to a given device sequence * Used THPObjectPtr for PySequence_Fast	2017-11-29 15:57:02 -05:00
Adam Paszke	2a8603c5e1	Make distributed recv return sender rank	2017-09-25 12:11:52 -04:00
Soumith Chintala	674e1f2ba1	increase test subprocess timeout	2017-08-27 21:11:08 -04:00
gchanan	5b8e2ad2a6	test_distributed cuda tests don't skip if cuda not available. (#2476 ) test_distributed cuda tests don't skip if cuda not available.	2017-08-17 17:45:32 -04:00
gchanan	0985eaf373	Add ability to specify init_method for test_distributed. (#2465 ) * Add ability to specify init_method for test_distributed. * Move init_method specification to test run line. * Run for gloo tests as well. * Better status message for gloo test.	2017-08-16 17:04:21 -04:00
Adam Paszke	8915e2710c	Refactor scatter/gather and add distributed docs	2017-07-12 14:47:36 -04:00
lynic	ebdec9a837	Skip distributed tests if not supported (#2004 )	2017-07-07 11:06:56 -04:00
Adam Paszke	d9d50f80c7	Rename arguments to distributed collectives	2017-06-12 22:02:11 -04:00
Adam Paszke	714351ff39	Officially enable process-group mode	2017-06-12 22:02:11 -04:00
Adam Paszke	4ebf3ff46d	Add base for CUDA allReduce and broadcast in DataChannelGloo	2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz	f07f13c6e9	Change Store exception handling	2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz	310d08c37b	Fix store and all operations	2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz	2b340e7d50	Add python tests; Remove broken prefix store creation	2017-05-01 01:49:09 -07:00
Adam Paszke	a3e11d606b	Fix linter errors	2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz	962084c8e8	Add Data Channel receive from any source (#52 )	2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz	76520512e7	DataChannel tests rewrite (#42 ); DataChannel `isend` and `irecv` implementation (#44 )	2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz	ac1f68127a	Add barrier, scatter, gather and allGather implementations + groups (#34 )	2017-01-31 01:58:09 +01:00
Mateusz Piotrowski	3e3501c98d	Integration tests of the THD Python interface (#28 )	2017-01-31 01:58:09 +01:00

34 Commits