pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Tongzhou Wang	044d00516c	Rename DistBackend -> Backend (#11830 ) Summary: Also add docs for get_backend, Backend, and reduce_op fixes #11803 cc The controller you requested could not be found. pietern apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830 Differential Revision: D9927991 Pulled By: SsnL fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48	2018-11-07 11:58:12 -08:00
sli	9d9e5f8d1e	Solve bug of DistributedDataParallel (#13248 ) Summary: Fixed bug [https://github.com/facebookresearch/maskrcnn-benchmark/issues/52](https://github.com/facebookresearch/maskrcnn-benchmark/issues/52) Pull Request resolved: https://github.com/pytorch/pytorch/pull/13248 Reviewed By: pietern Differential Revision: D12830451 Pulled By: teng-li fbshipit-source-id: ab33faf3f6f4545f8fe07da7ecbeb2f0a2ea23f0	2018-10-29 15:19:55 -07:00
Tongzhou Wang	8ad69a80e3	Test scripts only run cases defined in the running script (#13250 ) Summary: 1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it. 2. Adds an assertion in `load_tests` that each script only runs cases defined in itself. cc yf225 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250 Differential Revision: D12823734 Pulled By: SsnL fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b	2018-10-29 13:57:40 -07:00
Pieter Noordhuis	2a6431ba2d	Use fixed MASTER_PORT in test_distributed (#13109 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13109 The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes #12876. Differential Revision: D10850614 fbshipit-source-id: c19f12bb4916141187ee8ddb52880f5f418310dc	2018-10-25 08:51:34 -07:00
Pieter Noordhuis	917b203b01	Assert spawned processes terminating in distributed tests (#13071 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13071 In the case where a process got stuck and timed out on joining, we would see a None != 1 assertion error in the code path where the exit statuses are compared. This implies that the first process exited with exit code 1 and another one didn't exit at all. With this commit the error message is more descriptive. Differential Revision: D10785266 fbshipit-source-id: c8cc02d07ea4fdc6f5374afd9a0aac72218fe61d	2018-10-24 16:03:36 -07:00
Teng Li	d120b9af5a	Make c10d pickling/unpickling work (#12694 ) Summary: This fixes the issue for https://github.com/pytorch/pytorch/issues/12168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12694 Differential Revision: D10468717 Pulled By: teng-li fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048	2018-10-19 16:42:36 -07:00
James Sun	f4944f0f8a	Rename test/common.py to test/common_utils.py (#12794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380	2018-10-17 23:04:29 -07:00
Edward Yang	3bfa7258b3	Don't serialize hooks (#11705 ) Summary: Fixes #11683. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/11705 Differential Revision: D9833057 Pulled By: ezyang fbshipit-source-id: 18af9bcd77b088326738d567100fbe4a4c869dd6	2018-10-16 20:11:03 -07:00
Pieter Noordhuis	1c77f9e543	Support torch.distributed.barrier in gloo backend Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11844 Reviewed By: colesbury, SsnL Differential Revision: D9929055 Pulled By: pietern fbshipit-source-id: 3a34a179cb80f495f18aa926c0f9513924737d8e	2018-09-20 09:25:59 -07:00
Tongzhou Wang	540ef9b1fc	Add distributed get_backend (#11715 ) Summary: I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`. cc mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715 Reviewed By: pietern Differential Revision: D9889646 Pulled By: SsnL fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2	2018-09-18 10:56:24 -07:00
Pieter Noordhuis	24a8c13f36	Add barrier to fix distributed test flakiness (#11775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11775 This should fix #11582. Reviewed By: ezyang Differential Revision: D9885546 fbshipit-source-id: 3544f42ebe8b595cdf6941859c67484d3ea9b3f8	2018-09-17 17:31:45 -07:00
Pieter Noordhuis	7535d98ec4	Add message tag parameter to send/recv Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490 Reviewed By: teng-li Differential Revision: D9828116 Pulled By: pietern fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7	2018-09-14 10:55:37 -07:00
Tongzhou Wang	02c4cd3c8a	Skip flaky distributed tests (#11594 ) Summary: context: https://github.com/pytorch/pytorch/issues/11582 cc pietern The controller you requested could not be found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11594 Differential Revision: D9798871 Pulled By: SsnL fbshipit-source-id: 9f9e1871c7fd9505ca898865eb8068fab4d3416d	2018-09-12 14:57:57 -07:00
Wei Yang	54107ae8cf	convert output_device at data_parallel from torch.device to index (#10189 ) Summary: - fixes #9984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189 Differential Revision: D9545390 Pulled By: weiyangfb fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e	2018-09-11 20:27:07 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00
Pieter Noordhuis	3d2862526b	Support send/recv for the gloo process group (#11387 ) Summary: This change removes the skips for the existing send/recv tests in the backwards compatibility layer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11387 Reviewed By: teng-li Differential Revision: D9729330 Pulled By: pietern fbshipit-source-id: f8899219a94d806386d03e9ef53bff622d8658a3	2018-09-07 20:25:18 -07:00
Teng Li	576807ce1a	flaky test fix trial (#11391 ) Summary: Add a barrier() to wait for all PG created before destroy Pull Request resolved: https://github.com/pytorch/pytorch/pull/11391 Differential Revision: D9727383 Pulled By: teng-li fbshipit-source-id: 689d62c978e642b68f4949dcf29982e34869ada4	2018-09-07 14:10:06 -07:00
Teng Li	7726b36489	Full-fledged group testings and fixes for c10d frontend APIs (#11318 ) Summary: Fixed a few bugs that were not tested in the c10d frontend APIs, including get_rank, get_world_size, and destroy_process_group of a given group. These APIs are added to the CI tests. Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths. Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318 Reviewed By: pietern Differential Revision: D9675896 Pulled By: teng-li fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc	2018-09-06 18:26:11 -07:00
Teng Li	220c9e52b9	Distributed Data Parallel CPU module for C10D (#11168 ) Summary: Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed. We will keep both in the first release and remove the THD one once c10d is stable enough. Test fully covered just as THD too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11168 Differential Revision: D9674963 Pulled By: teng-li fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7	2018-09-05 21:59:31 -07:00
Pieter Noordhuis	9a0effb92c	Update send/recv tests to reflect intended use (#11275 ) Summary: The existing tests had every rank run send to every other rank and only then switch to recv mode. This only works if the send operations are non-blocking and the passed tensors are immediately copied to some kind of send buffer. Instead, every send must be matched with a recv on the other side, because from the API perspective they may block. E.g. imagine a 1GB tensor being sent to every other rank. It can only go through if there is a recv on the other side, or it will deadlock. This change reflects this in the send/recv unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11275 Differential Revision: D9658197 Pulled By: pietern fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974	2018-09-05 14:40:04 -07:00
Teng Li	3791bd12c8	PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128 ) Summary: Added MPI group support. And this will make all previous group test cases of MPI passed. Also, release the MPI thread level support by serializing different PG's MPI ops. This is required. The build is fixed too Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128 Differential Revision: D9602188 Pulled By: teng-li fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497	2018-08-31 12:39:56 -07:00
Teng Li	56539f5fe1	PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871 ) Summary: The PR includes: (1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed` (2) `env://` init method functionality (3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`. (4) The old `test_distributed.py' is now moved to `test_distributed_thd` (5) Miscellaneous bug fixes. (6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d. (7) CI config to test MPI, NCCL, and Gloo backend of c10d Now all the distributed test including c10d DDP can pass with the c10d frontend API TODO: (in a separate PR) MPI subgroup support, once this is added, CI group test will be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871 Differential Revision: D9554514 Pulled By: teng-li fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd	2018-08-29 12:55:57 -07:00
Edward Yang	51f154e072	Fix Python lint errors. (#10441 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/10441 Reviewed By: Yangqing Differential Revision: D9285502 Pulled By: ezyang fbshipit-source-id: 12c94b28bee9cade930c8f260577e81ea1915269	2018-08-11 21:08:50 -07:00
Jason Gauci	31646edfff	Increase GLOO rendevous timeout Summary: Increase GLOO rendevous timeout Reviewed By: teng-li Differential Revision: D9273544 fbshipit-source-id: 5c22c1d18df3032f019ff12e2a720aea7c390f15	2018-08-10 18:40:18 -07:00
anderspapitto	48e90e3339	Build system changes (#8627 ) * All changes needed to get rid of process_github.sh * allow thnn_h_path	2018-06-20 17:45:26 -04:00
Tongzhou Wang	85ee94b7be	Add memory leak check in CUDA tests (#7270 ) * Add memory leak check in CUDA tests * Tracking multi-GPU too * fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test * add a comment * skip if cuda * 1. Change the wrapper to a method in common.py:TestCase 2. Refactor common constants/method that initialize CUDA context into common_cuda.py 3. Update some test files to use TEST_CUDA and TEST_MULTIGPU * Fix MaxUnpool3d forward memory leak * Fix MultiLabelMarginCriterion forward memory leak * Fix MultiMarginLoss backward memory leak * default doCUDAMemoryCheck to False * make the wrapper skip-able * use TEST_MULTIGPU * add align_corners=True/False tests for Upsample; fix TEST_CUDNN * finalize interface * VolumetricMaxUnpooling_updateOutput * fix test_nccl * rename THC caching allocator methods to be clearer * make the wrapped function a method * address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp * fix renamed var	2018-05-31 15:09:54 -04:00
Adam Paszke	3238db6247	Show skipped distributed tests as skipped (#7624 ) Previously, tests that have been skipped because their backend was missing would show up as succeeded, which has been very confusing.	2018-05-17 00:23:46 +02:00
xhzhao	f2c9975378	Add DistributedDataParallelCPU (#5919 )	2018-04-17 15:36:47 +02:00
Ailing	30157971f0	Update dist test to use multi gpus (#6337 ) * update dist test to use multi gpus * add nccl to jenkins * address comment * make lint happy * convert range object to list	2018-04-16 14:10:27 -04:00
Simeon Monov	9b111f1a88	Fix worldsize use in test_distributed with MPI backend (#6301 ) WORLD_SIZE is not used for MPI tests and the check fails for the group tests	2018-04-05 09:28:53 -04:00
Ailing	2f64e1cdf6	Add second iteration in test_DistributedDataParallel (#5830 )	2018-03-18 00:27:45 +01:00
Myle Ott	f5f6258288	Enable additional tensor types in Gloo backend (#5483 )	2018-03-15 14:53:24 +01:00
Ailing	92596197fc	add end to end test for DistributedDataParallel (#5182 ) * add end to end test for DistributedDataParallel * address comments * skip subgroup tests when less than 3 processes * set process number based on available gpus * add single gpu;cleanup WORLD_SIZE * fix comments	2018-03-08 22:07:34 -05:00
Ailing	ff3f689239	Add mote tests for Nccl backend (#4796 )	2018-01-28 12:36:59 +01:00
Teng Li	926ed2b280	Implemented NCCL Distributed Backend for PyTorch with new dist APIs (#3435 ) * Implemented NCCL Distributed Backend for PyTorch with new dist APIs * Let FindNCCL to determine the NCCL version * Let NCCL2 Backend use ATEN instead deprecated THPP * Let distributed parallel model use a single reduction thread for NCCL backend * Caching the sockets, bug fix, refactoring, and addressed Adam's comments * Make BcastNcclID take a single param and bug fix for all_gather * Removed barrier function, added warning for users, and not exposing experimental func to users * Use the simplest single bucket working solution for distriubted data parallel model with rebase * Cleanup, fixes and further addressed Adam's comments * Used PySequence_Fast in distributed csrc * Removed the limitation that each group is only bound to a given device sequence * Used THPObjectPtr for PySequence_Fast	2017-11-29 15:57:02 -05:00
Adam Paszke	2a8603c5e1	Make distributed recv return sender rank	2017-09-25 12:11:52 -04:00
Soumith Chintala	674e1f2ba1	increase test subprocess timeout	2017-08-27 21:11:08 -04:00
gchanan	5b8e2ad2a6	test_distributed cuda tests don't skip if cuda not available. (#2476 ) test_distributed cuda tests don't skip if cuda not available.	2017-08-17 17:45:32 -04:00
gchanan	0985eaf373	Add ability to specify init_method for test_distributed. (#2465 ) * Add ability to specify init_method for test_distributed. * Move init_method specification to test run line. * Run for gloo tests as well. * Better status message for gloo test.	2017-08-16 17:04:21 -04:00
Adam Paszke	8915e2710c	Refactor scatter/gather and add distributed docs	2017-07-12 14:47:36 -04:00
lynic	ebdec9a837	Skip distributed tests if not supported (#2004 )	2017-07-07 11:06:56 -04:00
Adam Paszke	d9d50f80c7	Rename arguments to distributed collectives	2017-06-12 22:02:11 -04:00
Adam Paszke	714351ff39	Officially enable process-group mode	2017-06-12 22:02:11 -04:00
Adam Paszke	4ebf3ff46d	Add base for CUDA allReduce and broadcast in DataChannelGloo	2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz	f07f13c6e9	Change Store exception handling	2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz	310d08c37b	Fix store and all operations	2017-05-01 01:49:10 -07:00
Janusz Marcinkiewicz	2b340e7d50	Add python tests; Remove broken prefix store creation	2017-05-01 01:49:09 -07:00
Adam Paszke	a3e11d606b	Fix linter errors	2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz	962084c8e8	Add Data Channel receive from any source (#52 )	2017-01-31 01:58:09 +01:00
Janusz Marcinkiewicz	76520512e7	DataChannel tests rewrite (#42 ); DataChannel `isend` and `irecv` implementation (#44 )	2017-01-31 01:58:09 +01:00

1 2

52 Commits