pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Pieter Noordhuis	9b69da2b55	Allow for iterations where no module parameter is used (#19821 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19821 It is possible that not a single parameter is used during an iteration. If this is the case, the `prepare_for_backward` function marks all parameters as unused, kicks off reduction of all buckets, and finalizes the reduction. This is different from the prior implementation where we assumed that autograd would produce a gradient for at least a single parameter. We then used the autograd callback mechanism to queue a finalizer callback. Now, this finalizer may be executed in line. Reviewed By: mrshenli Differential Revision: D15113272 fbshipit-source-id: dc91458b569cd8c106ddaeea558464b515683550	2019-04-27 22:57:59 -07:00
Max Wang	c5845c4482	Add support for reduce-scatter in c10d (#18844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844 ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18844 Add support for reduce-scatter in c10d * #18820 Refactor ProcessGroupNCCL collective primitives Reviewed By: mrshenli Differential Revision: D14768369 fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156	2019-04-26 13:46:57 -07:00
Pieter Noordhuis	0d8a3610c5	Multiple module outputs and multiple calls to backward (#19799 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19799 A module that returns multiple outputs and where the called may end up doing multiple calls to torch.autograd.backward did not work with DistributedDataParallel. It expected the first call to torch.autograd.backward to provide gradients for ALL parameters that expect gradients and were used in computing the module output. If you have outputs with disjoint autograd graphs it is fine to call torch.autograd.backward on both and fill in the module's parameter gradients in separate chunks. With this change we delay queuing the finalizer callback until we have marked all buckets as ready, instead of queueing it the first time we receive an autograd hook. This returns the current implementation to be functionally equivalent to the DistributedDataParallel implementation before #18953 was merged. Reviewed By: mrshenli Differential Revision: D15097045 fbshipit-source-id: 2df023319713bc31e29a8b45108c78e6593fccd4	2019-04-26 08:20:10 -07:00
Pieter Noordhuis	6325b6e44e	Make finding unused model parameters optional (#19515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515 This is still done by default, but can now be disabled by specifying `find_unused_parameters=False`. There are use cases where finding unused parameters results in erroneous behavior, because a subset of model parameters is used outside the `forward` function. One can argue that doing this is not a good idea, but we should not break existing use cases without an escape hatch. This configuration parameter is that escape hatch. Reviewed By: bddppq Differential Revision: D15016381 fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0	2019-04-19 17:23:36 -07:00
Pieter Noordhuis	a5c4348d54	Recursively find tensors in DDP module output (#19360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360 We'll return the output object verbatim since it is a freeform object. We need to find any tensors in this object, though, because we need to figure out which parameters were used during this forward pass, to ensure we short circuit reduction for any unused parameters. Before this commit only lists were handled and the functionality went untested. This commit adds support for dicts and recursive structures, and also adds a test case. Closes #19354. Reviewed By: mrshenli Differential Revision: D14978016 fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8	2019-04-18 14:57:09 -07:00
Shen Li	6732358bf9	Allow DDP to wrap multi-GPU modules (#19271 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19271 allow DDP to take multi-gpu models Reviewed By: pietern Differential Revision: D14822375 fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c	2019-04-17 21:21:54 -07:00
Pieter Noordhuis	a0263ec047	Make DistributedDataParallel use new reducer (#18953 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953 This removes Python side bucketing code from DistributedDataParallel and replaces it with calls to the new C++ based bucketing and reducing code. To confirm this is working well, we ran a test with both the previous implementation and the new implementation, and confirmed they are numerically equivalent. Performance is improved by a couple percent or more, including the single machine multiple GPU runs. Closes #13273. Reviewed By: mrshenli Differential Revision: D14580911 fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2	2019-04-15 12:44:38 -07:00
Shen Li	6b0ca8eae5	Fix flaky store timeout test (#19114 ) Summary: ~Sometimes, `init_process_group()`, `store.get()`, and `destory_process_group()` can take more than a few seconds. Hence, removing thread join timeout.~ The error was due to `Address already in use` when starting TPC backend. The solution is to catch the error and report it to the `retry_on_address_already_in_use_error` decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19114 Reviewed By: ezyang Differential Revision: D14872680 Pulled By: mrshenli fbshipit-source-id: fc504d02853ca73f76288c0ade564ab20bc01f7e	2019-04-10 20:35:36 -07:00
Shen Li	8f9b11cf33	Propagate ProcessGroup timeout to Store (#16571 ) Summary: closes #16520 Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks! Questions: 1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion? 2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571 Differential Revision: D13954527 Pulled By: mrshenli fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87	2019-04-09 12:36:28 -07:00
Pieter Noordhuis	edc7b4726b	Increase default c10d/ProcessGroupGloo test timeout (#18916 ) Summary: See #18659. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18916 Differential Revision: D14808749 Pulled By: pietern fbshipit-source-id: 9a9c8beddb2dbbb1bf4c5e575743d9e1fa3f07fa	2019-04-05 12:16:30 -07:00
Pieter Noordhuis	ce92cf9bd1	Add tests for reducer class (#18845 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845 This adds a few CPU only test cases for the reducer class. Reviewed By: mrshenli Differential Revision: D14768432 fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662	2019-04-05 09:07:29 -07:00
Edward Yang	2934153f35	Correctly call superclass setUp in TestCase subclasses. (#18291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18291 ghimport-source-id: d6e95e899bd320407967df41435801e54864ba62 Stack from [ghstack](https://github.com/ezyang/ghstack): * #18292 Add test for #17271 (torch.exp incorrect for 2*31 size tensor) #18291 Correctly call superclass setUp in TestCase subclasses. This makes PYTORCH_TEST_SKIP_FAST work correctly for more tests, reducing the wasted testing effort on our slow_test job. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14567643 fbshipit-source-id: 40cf1d6556e0dd0a0550ff3d9ffed8b6000f8191	2019-03-22 07:46:44 -07:00
Edward Yang	879bf65811	Disable flaky test Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16274 Reviewed By: pietern Differential Revision: D13788036 fbshipit-source-id: a9b7353fb0655908e6d47387cc77af33e9471aed	2019-01-23 11:57:44 -08:00
Teng Li	b4bc55beef	TCP init method race condition fix (#15684 ) Summary: This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it. This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself. This should fix: https://github.com/pytorch/pytorch/issues/15638 Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage. I had to make rendezvous test in c10d the world size of 1, since it is a single process code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684 Differential Revision: D13570904 Pulled By: teng-li fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589	2019-01-18 02:29:38 -08:00
Jane Wang	f8455ed754	add gloo support for gather on GPU (#14916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14916 as titled Reviewed By: pietern Differential Revision: D13267832 fbshipit-source-id: 3b89d08af93f74941f17ff892c33fc2a4a023c19	2018-12-11 21:21:10 -08:00
Jane Wang	0552326846	add gloo scatter support on GPU (#14917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14917 as titled Reviewed By: pietern Differential Revision: D13271560 fbshipit-source-id: 0187a3390f8ebd72a2c074e7a651432159d427c0	2018-12-11 17:11:13 -08:00
Jane Wang	483ba553bd	add gloo allgather support on GPU (#14576 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14576 as titled Reviewed By: pietern Differential Revision: D13266063 fbshipit-source-id: e262f77d63724a7504a7112907bbfba49612fe75	2018-12-10 14:32:54 -08:00
Teng Li	bfa666eb0d	Skipping two c10d tests only if there are multi-GPUs (#14860 ) Summary: Otherwise, these tests will fail, even though there are never meant to run on single GPU machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14860 Differential Revision: D13369060 Pulled By: teng-li fbshipit-source-id: 8a637a6d57335491ba8602cd09927700b2bbf8a0	2018-12-06 17:28:07 -08:00
Pieter Noordhuis	67dcf10631	Increase test timeout (#14814 ) Summary: It is possible that some sort of contention causes process scheduling delays which in turn cause the timeout to not be hit. Increased sleep here will decrease the probability of this happening. Fixes #14555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14814 Differential Revision: D13351924 Pulled By: pietern fbshipit-source-id: 1222cf0855408dfcb79f30f94694c790ee998cf9	2018-12-05 17:18:11 -08:00
Pieter Noordhuis	c02b3e7cea	Retry test on address already in use error (#14815 ) Summary: Thanks nairbv for the suggestion. Also see #14589. Fixes #14703. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14815 Differential Revision: D13351913 Pulled By: pietern fbshipit-source-id: d11a4152505d0ce15592b13e417bb80551476a61	2018-12-05 17:09:46 -08:00
Pieter Noordhuis	7da2448d62	Fix multi-argument allreduce in ProcessGroupGloo (#14688 ) Summary: If multiple arguments are specified to c10d allreduce, they are interpreted as if they are expanding the ranks in the process group. Therefore, not only is every argument to allreduce an input that must be considered, it is also an output. The problem that this commit fixes is that they were not correctly considered as outputs. The upstream problem is tracked in facebookincubator/gloo#152. Once this is fixed there we can remove the copies that this commit adds. This fixes #14676. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14688 Differential Revision: D13294405 Pulled By: pietern fbshipit-source-id: 078a2a0a0ff12d051392461438f1496201ec3cb9	2018-12-03 09:41:17 -08:00
Teng Li	0d3cb91d8c	Make env init_method support both env and args for rank and size (#14494 ) Summary: Fixing: https://github.com/pytorch/pytorch/issues/14446 This was a supported behavior in old torch.distributed. We want to support it in the new release. Test should cover all combination of scenario when we have either env or arg set up for rank or size or both Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494 Differential Revision: D13253433 Pulled By: teng-li fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848	2018-11-29 18:48:20 -08:00
Jane Wang	dc7498c84d	add gloo support for reduce on GPU (#14443 ) Summary: as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/14443 Reviewed By: pietern Differential Revision: D13222907 Pulled By: janewangfb fbshipit-source-id: f418c5d84880196f97089114d02957cf739243f8	2018-11-29 16:19:39 -08:00
Teng Li	7d07fcd215	Fixed SyncParam/QueueReduction/SyncReduction test for 2+ GPUs (#14452 ) Summary: Fixed: https://github.com/pytorch/pytorch/issues/14445 Also bumped up timeout to 30 seconds, since on 8-GPU machines, DDP test will take more than 15 seconds sometimes. Tested on 8 GPU machines: ``` tengli@learnfair062:~/pytorch/test$ python test_c10d.py --verbose test_dist_broadcast_coalesced_gloo (__main__.DistributedDataParallelTest) ... ok test_dist_broadcast_coalesced_nccl (__main__.DistributedDataParallelTest) ... skipped 'Test skipped due to known issues' test_fp16 (__main__.DistributedDataParallelTest) ... ok test_gloo_backend (__main__.DistributedDataParallelTest) ... ok test_nccl_backend (__main__.DistributedDataParallelTest) ... ok test_queue_reduction (__main__.DistributedDataParallelTest) ... ok test_sync_params_no_buffers (__main__.DistributedDataParallelTest) ... ok test_sync_params_with_buffers (__main__.DistributedDataParallelTest) ... ok test_sync_reduction (__main__.DistributedDataParallelTest) ... ok test_set_get (__main__.FileStoreTest) ... ok test_set_get (__main__.PrefixFileStoreTest) ... ok test_set_get (__main__.PrefixTCPStoreTest) ... ok test_allgather_basics (__main__.ProcessGroupGlooTest) ... ok test_allgather_checks (__main__.ProcessGroupGlooTest) ... ok test_allreduce_basics (__main__.ProcessGroupGlooTest) ... ok test_allreduce_basics_cuda (__main__.ProcessGroupGlooTest) ... ok test_allreduce_checks (__main__.ProcessGroupGlooTest) ... ok test_allreduce_stress (__main__.ProcessGroupGlooTest) ... ok test_allreduce_stress_cuda (__main__.ProcessGroupGlooTest) ... ok test_broadcast_basics (__main__.ProcessGroupGlooTest) ... ok test_broadcast_basics_cuda (__main__.ProcessGroupGlooTest) ... ok test_broadcast_checks (__main__.ProcessGroupGlooTest) ... ok test_broadcast_stress (__main__.ProcessGroupGlooTest) ... ok test_broadcast_stress_cuda (__main__.ProcessGroupGlooTest) ... ok test_gather_basics (__main__.ProcessGroupGlooTest) ... ok test_gather_checks (__main__.ProcessGroupGlooTest) ... ok test_reduce_basics (__main__.ProcessGroupGlooTest) ... ok test_reduce_checks (__main__.ProcessGroupGlooTest) ... ok test_scatter_basics (__main__.ProcessGroupGlooTest) ... ok test_scatter_checks (__main__.ProcessGroupGlooTest) ... ok test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... ok test_timeout_kwarg (__main__.ProcessGroupGlooTest) ... ok test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_barrier (__main__.ProcessGroupNCCLTest) ... ok test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_common_errors (__main__.RendezvousEnvTest) ... ok test_nominal (__main__.RendezvousEnvTest) ... ok test_common_errors (__main__.RendezvousFileTest) ... ok test_nominal (__main__.RendezvousFileTest) ... ok test_common_errors (__main__.RendezvousTCPTest) ... ok test_nominal (__main__.RendezvousTCPTest) ... ok test_unknown_handler (__main__.RendezvousTest) ... ok test_address_already_in_use (__main__.TCPStoreTest) ... ok test_set_get (__main__.TCPStoreTest) ... ok ---------------------------------------------------------------------- Ran 46 tests in 162.980s OK (skipped=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14452 Differential Revision: D13230652 Pulled By: teng-li fbshipit-source-id: 88580fe55b3a4fbc7a499ca3b591958f11623bf8	2018-11-27 21:58:34 -08:00
Pieter Noordhuis	2cc35c161a	Barrier synchronizes with prior work before completing (#14386 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14386 See #13573, #14142, and #14271 for discussion. This change updates ProcessGroupGloo to ensure that all prior operations have completed before executing the barrier. Reviewed By: manojkris Differential Revision: D13205022 fbshipit-source-id: 673e7e6ca357dc843874d6dd8da590832e1de7fa	2018-11-27 10:46:42 -08:00
Pieter Noordhuis	9598d380b0	Make ProcessGroup::Work::wait() throw (#14298 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14298 This is a breaking API change for users of the C++ c10d API. The work object defined wait() to return a boolean. If the work completed successfully it would return true, if it didn't it would return false. It was then up to the user to call the exception() function to figure out what went wrong. This has proven suboptimal as it allows users to forget about failure handling and errors may be ignored. The work class is semantically very similar to std::future, where a call to get() may throw if the underlying std::promise has set an exception. This commit changes the semantic of the work class to be similar to this and turns wait() into a void function that throws if the work completes with an exception. The exception() function can still be used to retrieve the exception if isSuccess() returns false, but now returns an std::exception_ptr instead of a reference to a std::exception. Reviewed By: manojkris Differential Revision: D13158475 fbshipit-source-id: 9cd8569b9e7cbddc867a5f34c6fd0b7be85581b8	2018-11-27 10:46:40 -08:00
Pieter Noordhuis	936c2bba23	Use new style barrier support in c10d/gloo (#14294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14294 This is the final collective to be ported to the new style where there is no longer a need to keep a cached algorithm instance around. There is a follow up change incoming to remove the algorithm caching functionality in ProcessGroupGloo. Reviewed By: manojkris Differential Revision: D13111509 fbshipit-source-id: f3ea0d955a62029fc4e7cfc09055e4957e0943ac	2018-11-27 10:46:32 -08:00
Teng Li	6f3002a50e	Fixed c10d test (#14389 ) Summary: Most likely a typo. Tested on 8-GPU machine ``` tengli@learnfair062:~/pytorch/test$ python test_c10d.py ProcessGroupNCCLTest.test_barrier . ---------------------------------------------------------------------- Ran 1 test in 29.341s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14389 Differential Revision: D13207207 Pulled By: teng-li fbshipit-source-id: aaffe14237076fe19d94e2fa4d9c093397f07bb9	2018-11-26 16:46:33 -08:00
Teng Li	b26f82b0ec	Robust NCCL barrier improvement to cover all devices combinations (#14271 ) Summary: This covers the very edgy case when we run the same NCCL process group with multiple GPU combinations instead of the last GPU combination. We always keep track of what GPUs have been used previously in the NCCL process group and barrier() itself will synchronize on each GPU's NCCL stream. Test covered as well. Tested on 8-GPU machine Pull Request resolved: https://github.com/pytorch/pytorch/pull/14271 Differential Revision: D13164993 Pulled By: teng-li fbshipit-source-id: 81e04352740ea50b5e943369e74cfcba40bb61c1	2018-11-21 18:23:55 -08:00
Pieter Noordhuis	ff4f4a0a35	Retry test on "Address already in use" error (#13911 ) Summary: This fixes #13907. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13911 Differential Revision: D13046256 Pulled By: pietern fbshipit-source-id: bab70cd73ef868e23d4857b06e72830ad29ddb4f	2018-11-14 15:23:03 -08:00
Teng Li	97036d3c30	FileStore auto deletes file and FileStore::add bug fix (#13708 ) Summary: This addressed: https://github.com/pytorch/pytorch/issues/11874 and we will have the identical file init_method behavior as the previous THD file init. Also the FileStore::add bug is pretty annoying. Two bugs: (1) Add doesn't append to the end of the file. (2) Cache doesn't get updated. Both are fixed and tests are covered. I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708 Reviewed By: pietern Differential Revision: D12972810 Pulled By: teng-li fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48	2018-11-14 01:34:22 -08:00
Pieter Noordhuis	786f9ba6ea	Remove potential infinite loop from test_c10d.py (#13816 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13816 If common.find_free_port() returns the same port over and over again, and the TCPStore fails to bind to it over and over again, this function has the potential to loop forever. If we can't find a free port after 10 tries, we are safe to assume something is wrong... Differential Revision: D13017700 fbshipit-source-id: 2139a0ea0f30ce08b5571f80ae0551f1fa7ba4a2	2018-11-10 17:58:13 -08:00
Teng Li	1413dd4bfc	Added the finer bucketing option for DDP (#13607 ) Summary: We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway. Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607 Differential Revision: D12944515 Pulled By: teng-li fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508	2018-11-07 12:00:55 -08:00
Pieter Noordhuis	1906305c07	Consolidate argument checkers (#13623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13623 Moves the bulk of shared argument checkers in the gloo backend to Utils.hpp. Reviewed By: teng-li Differential Revision: D12934598 fbshipit-source-id: 7b80e67ccc3425f21498c30fbe7837af314f96f2	2018-11-06 11:52:38 -08:00
Teng Li	bad8235a3a	Disabling NCCL coalesced bcast test since it hangs in CI (#13606 ) Summary: Functionality test shouldn't be affected since we have both backends testing for the same thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13606 Differential Revision: D12937185 Pulled By: teng-li fbshipit-source-id: 03d897b6690f7932654fdb7d11a07016dfffa751	2018-11-05 20:34:15 -08:00
Teng Li	74819087de	Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496 ) Summary: When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen? It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix. Now fp16 DDP training should be fully working now. Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that. Fixed: https://github.com/pytorch/pytorch/issues/12150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496 Differential Revision: D12920985 Pulled By: teng-li fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c	2018-11-05 16:22:15 -08:00
Pieter Noordhuis	7faca2a217	Add new style broadcast support in c10d/gloo (#13497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13497 This replaces the existing broadcast implementation with the new style collective call in the gloo backend. The CUDA path copies CUDA tensors to CPU tensors and then runs the CPU broadcast implementation. Reviewed By: teng-li Differential Revision: D12890013 fbshipit-source-id: 43f346fb2814f421bedc7babf89169703a46bb9c	2018-11-05 13:52:07 -08:00
Pieter Noordhuis	d2f26a450e	Add new style allreduce support in c10d/gloo (#13426 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13426 This replaces the existing allreduce implementation with the new style collective call in the gloo backend. This is the first one to include both a CPU and a CUDA path. The CUDA path copies CUDA tensors to CPU tensors and then runs the CPU allreduce implementation. This is not much different from the current situation in the case where there is a single input tensor per call (which is the case when called from DistributedDataParallel). Reviewed By: teng-li Differential Revision: D12855689 fbshipit-source-id: 574281d762dd29149fa7f634fb71f8f6a9787598	2018-11-05 13:52:05 -08:00
Pieter Noordhuis	d50dd47ccd	Add reduce support in c10d/gloo (#13425 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13425 This adds support for the new style reduce collective call in the gloo backend. Reviewed By: teng-li Differential Revision: D12869404 fbshipit-source-id: 93c641e6aba3b03c796bda80737547c565cfa571	2018-11-05 13:52:02 -08:00
Pieter Noordhuis	8f0f97749c	Add allgather support in c10d/gloo (#13424 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13424 This adds support for the allgather collective call in the gloo backend. The gloo implementation does not support multiple inputs per rank (nor one or more outputs per rank), so we use a temporary flattened buffer and unflatten once the collective finishes. Reviewed By: teng-li Differential Revision: D12832009 fbshipit-source-id: 2f5c1934a338589cef1d3192bd92ada135fecd7a	2018-11-05 13:52:01 -08:00
Pieter Noordhuis	75c2b34c86	Add gather support in c10d/gloo (#13423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13423 This adds support for the gather collective call in the gloo backend. The gloo implementation does not yet support the mode where the root has multiple output tensors (one per rank), so we use a temporary flattened buffer and unflatten on the root once the collective finishes. Reviewed By: teng-li Differential Revision: D12811647 fbshipit-source-id: 90fe8af8c390090b7d4ef43aa74f4e3e67ab9d0b	2018-11-05 13:51:59 -08:00
Pieter Noordhuis	9cfe9418e6	Add scatter support in c10d/gloo (#13422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13422 This adds support for the scatter collective call in the gloo backend. This is the first of the new style collectives that do not expect to be created once and used many times. This commit contains some shortcuts to make this new style work side by side with the existing implementations (such as the std::tuple with nullptr's). These shortcuts are temporary until we have moved over all collectives to this new style. Reviewed By: teng-li Differential Revision: D12310219 fbshipit-source-id: 32e68717f819d5980f0e469d297204948351cefc	2018-11-05 13:51:57 -08:00
Tongzhou Wang	8ad69a80e3	Test scripts only run cases defined in the running script (#13250 ) Summary: 1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it. 2. Adds an assertion in `load_tests` that each script only runs cases defined in itself. cc yf225 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250 Differential Revision: D12823734 Pulled By: SsnL fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b	2018-10-29 13:57:40 -07:00
Zachary DeVito	dae7616078	Shard all of tests based on how many tests exist. (#13160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160 Reduces pytorch_core build from 2 hours to 30 minutes Reviewed By: soumith, dzhulgakov Differential Revision: D10524261 fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9	2018-10-26 18:20:34 -07:00
Pieter Noordhuis	46ef2b2898	Ignore flake8 warnings in test_c10d.py (#13159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13159 These lint violations are intentional. Reviewed By: ezyang Differential Revision: D10862131 fbshipit-source-id: 70ad4b0a360cb12d050805fd7b1080dfe4566e86	2018-10-26 09:17:57 -07:00
Pieter Noordhuis	526460fc8b	Use default timeout of 30 minutes for gloo backend (#13056 ) Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056 Reviewed By: teng-li Differential Revision: D10558746 Pulled By: pietern fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb	2018-10-25 16:35:53 -07:00
Teng Li	c250f6f3d5	DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954 ) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65	2018-10-24 21:37:13 -07:00
Teng Li	8d3e7e2fcb	Move DDP queue_reduction to C++ (#12852 ) Summary: fully working version by using continuing on goldsborough 's initial version. waiting on the stream guard to be merged before adding more stream perf logics into the c++ version Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852 Differential Revision: D10468696 Pulled By: teng-li fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f	2018-10-22 16:07:46 -07:00
Pieter Noordhuis	9c617140f7	Try to reduce c10d test flakiness (#12782 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12782 We have seen the "Address already in use" error popup a few times when instantiating the TCPStore. The port that it uses is dynamically generated through common.find_free_port(), which binds a new socket to a random port, closes the socket, and returns the port that the OS had assigned. If some other process grabs that port in the time between closing the socket and the TCPStore binding to it, the bind error shows up. This commit changes most tests to use the FileStore instead and includes a retry when testing the TCPStore. Differential Revision: D10433401 fbshipit-source-id: 8dd575ac91a3cddd1cc41ddb0ff4311ddc58c813	2018-10-18 12:12:33 -07:00
James Sun	f4944f0f8a	Rename test/common.py to test/common_utils.py (#12794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380	2018-10-17 23:04:29 -07:00

1 2

70 Commits