pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rohan Varma	1c8fcc44cb	[Opt Overlap] Support optimizing partial set of parameters (#71608 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608 Per title ghstack-source-id: 147577178 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33696382 fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f (cherry picked from commit `436b547fb0`)	2022-01-26 19:33:49 +00:00
Rohan Varma	d3354602fc	[Easy] DDP typo fix (#71607 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71607 Per title ghstack-source-id: 147577177 Test Plan: N/a Reviewed By: cbalioglu Differential Revision: D33694038 fbshipit-source-id: 5a5a618f13bc8b91127169efcebb90b5a36474a1 (cherry picked from commit `62f17f116d`)	2022-01-26 07:32:04 +00:00
Rohan Varma	10ca760c0a	[Opt Overlap] Implement register_fused_optim in DDP (#71606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71606 Per title ghstack-source-id: 147577172 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33694037 fbshipit-source-id: a148d5ce6031f0cc20f33785cfe2c27d1fc2d682 (cherry picked from commit `ace3261e0c`)	2022-01-26 07:32:04 +00:00
Yanli Zhao	4b3cf1eaf7	[BE]Clarify how to check memory saving if using gradient_as_bucket_view (#71483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71483 claify that peak memory saving should be checked after first iteration when using gradient_as_bucket_view ghstack-source-id: 147271113 Test Plan: unit test Reviewed By: rohan-varma Differential Revision: D33662424 fbshipit-source-id: f760da38e166ae85234e526ddf1526269ea25d42 (cherry picked from commit `a40dda20da`)	2022-01-20 19:38:41 +00:00
Yanli Zhao	1c61d8c43f	[PT1.11] make static graph to be stable (#71459 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459 1. add static_graph feature to DDP constructor; 2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor 3. four cases are covered: static_graph = False, _set_static_graph() is called; static_graph = False, _set_static_graph() is not called; static_graph = True, _set_static_graph() is not called; static_graph = True, _set_static_graph() is called; ghstack-source-id: 147263797 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D33646738 fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855 (cherry picked from commit `dc246a1129`)	2022-01-20 19:38:41 +00:00
Rohan Varma	fcd1375b2b	[DDP][BE][Docs] Clarify checkpoint support (#68827 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68827 Add a note about current checkpoint support with DDP. Note that this does not include the features enabled with _set_static_graph yet, as it is an undocumented private API. Once we support static graph as beta feature in OSS we can add to the note here. ghstack-source-id: 144285041 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D32624957 fbshipit-source-id: e21d156a1c4744b6e2a807b5b5289ed26701886f	2021-11-30 12:37:37 -08:00
Santiago Castro	f776f30780	Keep the sequence or mapping type in `default_collate` (#68779 ) Summary: `default_collate`, `default_convert`, and `pin_memory` convert sequences into lists. I believe they should keep the original type when possible (e.g., I have a class that inherits from `list`, which comes from a 3rd party library that I can't change, and provides extra functionality). Note it's easy to do when the type supports an iterable in its creation but it's not always the case (e.g., `range`). Even though this can be accomplished if using a custom `default_collate`/`default_convert`, 1) this is behavior they should support out-of-the-box IMHO, and 2) `pin_memory` still does it. cc VitalyFedyunin ejguan NivekT Pull Request resolved: https://github.com/pytorch/pytorch/pull/68779 Reviewed By: wenleix Differential Revision: D32651129 Pulled By: ejguan fbshipit-source-id: 17c390934bacc0e4ead060469cf15dde815550b4	2021-11-29 13:14:20 -08:00
Yifan Xiong	c7eaec86f0	[NCCL] Patch bfloat16 support (#67843 ) Summary: Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is still not complete to enable bfloat16 for allreduce in end-to-end training. This patch does the followings: * fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in v2.10.3-1 (commit 7e51592) * update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL operations like all reduce can use it * enable unit tests for bfloat16 datatype if possible cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843 Reviewed By: H-Huang Differential Revision: D32248132 Pulled By: mrshenli fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525	2021-11-09 13:46:13 -08:00
James Reed	80178d6152	[DDP] Fix some issues with code example in DDP docstring (#67883 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67883 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D32190946 Pulled By: jamesr66a fbshipit-source-id: a376324b95cbe833ffa606ecdfc6156432880f70	2021-11-05 17:32:45 -07:00
Rohan Varma	bff64e84cd	[DDP] Track models with sync bn (#66680 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680 Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models with sync BN so we can find workflows that use them and target for perf optimization. ghstack-source-id: 140875182 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D31679477 fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e	2021-10-18 22:31:52 -07:00
Rohan Varma	38f5144eae	Fix https://github.com/pytorch/pytorch/issues/61982 (#66015 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015 Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of tensors in DDPSink. Only applies once for static_graph and generally for unused params which already has overhead, so perf hit should not be an issue. Will verify with benchmark. Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D31346633 fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b	2021-10-07 18:11:18 -07:00
Rohan Varma	71704349aa	[DDP] Allow await of custom buffer reduction in backward (#64515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64515 For performance reasons, we would like to ensure that we can await user collectives as part of custom buffer reduction in parallel to other work. As a result, add support to return futures from custom buffer hooks and await those futures at end of backwards pass. Also added some docs to clarify how to use these APIs. ghstack-source-id: 138793803 Test Plan: I Reviewed By: zhaojuanmao Differential Revision: D30757761 fbshipit-source-id: e1a2ead9ca850cb345fbee079cf0614e91bece44	2021-09-23 13:02:53 -07:00
Wanchao Liang	2f67579864	[ddp] use named_params and named_buffers explicitly (#65181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65181 This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). ghstack-source-id: 138701159 Test Plan: wait for ci Reviewed By: divchenko, rohan-varma Differential Revision: D31007085 fbshipit-source-id: 4e1c4fbc07110163fb9b09b043ef7b4b75150f18	2021-09-22 17:32:54 -07:00
Rohan Varma	5739f77775	[DDP] Refactor and remove sync_params (#64514 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64514 sync_params is a misnomer since we don't actually synchroniz parameters. While removing this I realized `self._check_and_sync_module_buffers` does almost everything we need it to, so just refactored that and made DDP forward call into it. ghstack-source-id: 138684982 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30751231 fbshipit-source-id: add7c684f5c6c71dad9e9597c7759849fa74f47a	2021-09-22 14:12:51 -07:00
Rohan Varma	ce5981e431	[DDP] Custom buffer reduction (#64513 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64513 Proposal: https://github.com/pytorch/pytorch/issues/63041 Support custom buffer reduction in DDP via hook ghstack-source-id: 138655663 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D30751152 fbshipit-source-id: 257a9d46bb178d8812d4ea5a4d9c6140b8a1791f	2021-09-22 14:11:35 -07:00
Jessica Choi	f24bd43375	Changing type and name of local_used_maps to reflect that it is only one map (#65380 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65380 Fixing bugs that arise when running setup.py develop cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D31104844 Pulled By: jaceyca fbshipit-source-id: acfd4cf316c71177df758ca55b470f51a17f776b	2021-09-22 11:35:33 -07:00
Jessica Choi	158b8bdc8a	Cleaning up DDP SPMD in reducer.cpp (#64113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113 Since there is only one model replica per process, `replicas` can be simplified from `std::vector<std::vector<at::Tensor>>` to `std::vector<at::Tensor>` in the Reducer class. Test Plan: All tests are passing `pytest test/distributed/test_c10d_gloo.py -vs` Imported from OSS Reviewed By: mrshenli Differential Revision: D30615965 fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51	2021-09-21 16:13:18 -07:00
Rohan Varma	45bd0f6181	Back out "Revert D30745960: [DDP] Remove SPMD from self.modules_buffers" (#64778 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64778 Original commit changeset: d3f3fb813c45 ghstack-source-id: 138326910 Test Plan: ci Reviewed By: H-Huang Differential Revision: D30849443 fbshipit-source-id: 15dab8a959a29d2e2fefac6ad52b8d8168eacc02	2021-09-17 12:28:36 -07:00
Rohan Varma	70f286c1e2	Back out "Revert D30745961: [DDP] Remove self.modules_params" (#64777 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64777 Original commit changeset: 59f7cc50d369 ghstack-source-id: 138326909 Test Plan: ci Reviewed By: H-Huang Differential Revision: D30849442 fbshipit-source-id: bb87ba83935374d8a3ebbc29365df1417dd4f26f	2021-09-17 12:28:34 -07:00
Rohan Varma	61dfcbf4bc	Back out "Revert D30745921: [DDP] Fix when buffers are reassigned in module" (#64776 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64776 Original commit changeset: 343ead86bf1e ghstack-source-id: 138326914 Test Plan: ci Reviewed By: H-Huang Differential Revision: D30849444 fbshipit-source-id: 9a72805416fe7d6c68e51bdcdb88f6e1fecb614d	2021-09-17 12:28:32 -07:00
Howard Huang	459653a0f6	Revert D30745921: [DDP] Fix when buffers are reassigned in module Test Plan: revert-hammer Differential Revision: D30745921 (`d59ecc02df`) Original commit changeset: 25eb1edbf445 fbshipit-source-id: 343ead86bf1e2d0b2d4124be331ea2fa437303ad	2021-09-09 08:23:16 -07:00
Howard Huang	5bc53ac5ef	Revert D30745961: [DDP] Remove self.modules_params Test Plan: revert-hammer Differential Revision: D30745961 (`8c09510294`) Original commit changeset: 32d102502570 fbshipit-source-id: 59f7cc50d369b6cc2856cf4ebd0f58b96202336d	2021-09-09 08:23:14 -07:00
Howard Huang	f1aaf8afcd	Revert D30745960: [DDP] Remove SPMD from self.modules_buffers Test Plan: revert-hammer Differential Revision: D30745960 (`1553259520`) Original commit changeset: 66a8f9847e9f fbshipit-source-id: d3f3fb813c45ac1b0ff15c6154b2e99e5dbab433	2021-09-09 08:22:12 -07:00
Rohan Varma	1553259520	[DDP] Remove SPMD from self.modules_buffers (#64474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64474 No need for a nested list here. ghstack-source-id: 137526312 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30745960 fbshipit-source-id: 66a8f9847e9fe1e02c51b79647e93bf7665cf4d9	2021-09-08 19:16:15 -07:00
Rohan Varma	8c09510294	[DDP] Remove self.modules_params (#64473 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64473 Unused after SPMD deprecated. ghstack-source-id: 137526305 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30745961 fbshipit-source-id: 32d102502570291e01579e5b47a6d74dc71013bb	2021-09-08 19:16:13 -07:00
Rohan Varma	d59ecc02df	[DDP] Fix when buffers are reassigned in module (#64472 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64472 Sometimes, user module can reassign tensor buffer, as in: ``` self.buffer = torch.randn(1, 2) # in init self.buffer += 1 # in forward ``` in this case, `self.modules_buffers` will become outdated and we should repopulate self.modules_buffers if we need to sync module buffers. See https://github.com/pytorch/pytorch/issues/63916 for full description of the issue. ghstack-source-id: 137526309 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30745921 fbshipit-source-id: 25eb1edbf445703a481802e07f3058d38ea6fc64	2021-09-08 19:14:55 -07:00
Yinbin Ma	0d437fe6d0	BF16 allreduce hook (#63260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260 Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7. Reviewed By: SciPioneer Differential Revision: D30238317 fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb	2021-08-18 20:53:49 -07:00
Rohan Varma	5fb79f61a8	[DDP] Dont set thread local state in reducer autograd hook. (#62996 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996 No need to set this because autograd engine already propagates TLS states. ghstack-source-id: 135438220 Test Plan: CI Reviewed By: albanD Differential Revision: D30202078 fbshipit-source-id: e5e917269a03afd7a6b8e61f28b45cdb71ac3e64	2021-08-10 10:50:16 -07:00
Rohan Varma	3df4870343	[Reland][DDP] Support not all outputs used in loss calculation (#61753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753 Reland of https://github.com/pytorch/pytorch/pull/57081. Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`. In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression. We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire. Added the unittests that were part of the reverted diff. ghstack-source-id: 135388925 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29726179 fbshipit-source-id: 54c8819e0aa72c61554104723a5b9c936501e719	2021-08-09 22:29:11 -07:00
Rohan Varma	80091cb0f7	[DDP] Allow tuning of first bucket (#62748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62748 Previously after buckets were rebuilt the first bucket size was always defaulted to 1MB, this diff allows first bucket to be tuned like the rest of the bucket sizes can. Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as expected: I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size limits: 1, 1048, 1048 bytes. ghstack-source-id: 135074696 Test Plan: CI Reviewed By: SciPioneer, wanchaol Differential Revision: D30110041 fbshipit-source-id: 96f76bec012de129d1645e7f50e266d4b255ec66	2021-08-05 16:35:07 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Andrew Gu	62a90c227f	Make _Join, _Joinable, _JoinHook public (#62605 ) Summary: Overview: This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605 Test Plan: `DistributedDataParallel.join()`: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` `ZeroRedundancyOptimizer`: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing. `Join`: ``` gpurun4 python test/distributed/algorithms/test_join.py ``` Reviewed By: mrshenli Differential Revision: D30055544 Pulled By: andwgu fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026	2021-08-03 12:20:11 -07:00
Rohan Varma	4d5607bb25	[Reland][DDP] log bucket sizes (#62625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625 reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race. Test Plan: ci Reviewed By: SciPioneer Differential Revision: D30058217 fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091	2021-08-03 10:55:46 -07:00
Andrew Gu	51f687fd4b	Add overlap with DDP to ZeRO (two approaches) (#62157 ) Summary: Overview: This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration. Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157 Test Plan: The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass: - ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`) - `test_ddp_with_zero_step_parity_gpu` - `test_ddp_with_zero_step_interleaved_parity_gpu` These were tested on the AI AWS cluster. An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302. Both approaches have been verified using an internal accuracy benchmark. Reviewed By: mrshenli Differential Revision: D29971046 Pulled By: andwgu fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8	2021-08-02 08:33:34 -07:00
Yi Wang	32b37ba246	[DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457 Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor. Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`. ghstack-source-id: 134771419 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type Reviewed By: rohan-varma Differential Revision: D30007390 fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f	2021-07-30 20:51:34 -07:00
Yi Wang	72295da6c3	Reformat (#62456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456 as title ghstack-source-id: 134771417 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30006493 fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef	2021-07-30 20:50:19 -07:00
Eli Uriegas	bd9f35313a	Revert D29922299: [DDP] log bucket sizes Test Plan: revert-hammer Differential Revision: D29922299 (`5429f68f00`) Original commit changeset: 538b331c96e7 fbshipit-source-id: 3595fe04e8dea38bc9d05e8c70f2dcd2ad496ced	2021-07-30 20:27:36 -07:00
Rohan Varma	5429f68f00	[DDP] log bucket sizes (#62232 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232 Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf. Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after). ghstack-source-id: 134663867 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29922299 fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761	2021-07-30 18:07:04 -07:00
Rohan Varma	1f2b96e7c4	[DDP] Make compute_bucket_assignment_by_size return per bucket sizes (#62231 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231 `compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket. This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact. ghstack-source-id: 134480575 Test Plan: CI Reviewed By: mrshenli Differential Revision: D29919056 fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19	2021-07-28 20:21:01 -07:00
Rohan Varma	10c6811a6b	[DDP] Run test_ddp_new_tensor_in_fwd with static graph (#61992 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992 This test previously was not enabled for static graph but to ensure this feature is supported with DDPSink, enable it for static graph which currently passes outputs to DDPSink. ghstack-source-id: 134471406 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29830887 fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc	2021-07-28 09:49:12 -07:00
Andrew Gu	3e3acf8a9a	Minor documentation fixes (#61785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29746648 Pulled By: andwgu fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6	2021-07-19 09:01:29 -07:00
Andrew Gu	813b887dad	Fix indent (#61784 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61784 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29746647 Pulled By: andwgu fbshipit-source-id: f42d3a0864a8291941d695a0cf575a5737cbb35c	2021-07-19 09:00:25 -07:00
Rohan Varma	f1114364ad	[DDP] Enhance comm hook docs (#61677 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61677 Specify return type more clearly, 2) Misc fixes ghstack-source-id: 133657895 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29701384 fbshipit-source-id: 7f77b99065bd2977153f397745e07b75bbdd7a94	2021-07-16 08:35:49 -07:00
Rohan Varma	7177509380	Revert [DDP] Support not all outputs used in loss calculation (#61497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61497 Reverts [DDP] Support not all outputs used in loss calculation ghstack-source-id: 133589153 Test Plan: CI, ping authors to run their workflow on this diff Reviewed By: zhaojuanmao Differential Revision: D29642892 fbshipit-source-id: 81a15b9ab3329602f34d3758bb0799005a053d4f	2021-07-15 10:28:14 -07:00
Rohan Varma	25f9c35dd7	Revert [DDP] Support for multiple backwards (#61401 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61401 Reverts https://github.com/pytorch/pytorch/pull/59359, which is causing a few internal issues in DDP training. We will evaluate the internal use cases and reland it after reconsidering the design. Also moves `prepare_for_backward` back into forward pass instead of DDP Sink for `find_unused_parameters`. This ensures that hooks will always fire in the backwards pass, which is behavior that internal training workloads rely on. Calling `prepare_for_backward` in DDPSink autograd function is not the best solution since other autograd threads may have been executing which can cause races. ghstack-source-id: 133589152 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29608948 fbshipit-source-id: f060f41cd103573ddff8da50cdbb6c56768dab46	2021-07-15 10:28:13 -07:00
Andrew Gu	57feb35474	Refactor non-joined process computation (#61555 ) Summary: Overview: This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania. Changes: This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.) The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance. This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition. Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user. New API: Now, the example usage would look like: ``` ddp_model = DistributedDataParallel(...) zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...) with _Join([ddp_model, zero_optim]): ... ``` Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example: ``` with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False): ... ``` They will be forwarded to every `_join_hook()` function via `kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`). Recap:** After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names): - Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context. - We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`. - To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed. - We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`. - We call `Joinable.__init__(self)` in `C`'s constructor. - The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability. - Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555 Test Plan: I ran the existing DDP join tests: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` I ran the ZeRO join tests: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu ``` Reviewed By: zou3519 Differential Revision: D29690359 Pulled By: andwgu fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7	2021-07-14 08:20:40 -07:00
Andrew Gu	179249084b	Refactor DDP join() API, adding hooks (#60757 ) Summary: Targets https://github.com/pytorch/pytorch/issues/54318. Overview: DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.) There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager. The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications. The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same. Notes: - The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing. - `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks. - `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`). - Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden. - If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications. - If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757 Test Plan: The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- ``` Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually. Reviewed By: iramazanli, mrshenli Differential Revision: D29624636 Pulled By: andwgu fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7	2021-07-09 08:29:20 -07:00
Yi Wang	4beb5f9ad6	[DDP Comm Hook] Fix some comments (#61376 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61376 After SPMD is retired, the API of `get_tensors` becomes `get_tensor`. Fix some comments that refer to the obsolete API. The `allreduce` hook example does not do division inside, which actually is incorrect. ghstack-source-id: 133174272 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D29596857 fbshipit-source-id: 2046b185225cd6d1d104907b5f9b4009b6e87c99	2021-07-08 12:30:24 -07:00
Rohan Varma	43fb39c3eb	[DDP] Make uneven inputs work with comm. hook (#61020 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61020 Makes uneven input support with `join` context manager work with custom communication hooks. This will ensure that the two features can work well together. Added relevant unittests to test allreduce and powerSGD hooks. Instead of calling `allreduce`, the join manager now calls into `_run_reduction_hook` which will automatically run whatever hook is installed. ghstack-source-id: 132950108 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29480028 fbshipit-source-id: c91dc467a62c5f1e0ec702a2944ae3deb10f93f4	2021-07-02 18:48:21 -07:00
Rohan Varma	94b730681f	[DDP] Refactor uneven inputs to take GradBucket (#61019 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61019 Changes uneven input logic of running allreduce to using `GradBucket` structure. This is to enable support for comm. hook with join in the next diff. ghstack-source-id: 132950107 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D29480027 fbshipit-source-id: 7c42c53653052f71b86a75e14a5fc7ae656433f7	2021-07-02 18:47:23 -07:00
Rohan Varma	b21df03f3b	[DDP] Remove SPMD from get_bucket_tensors (#61017 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61017 Removes SPMD nested vector logic from this codepath. This is mostly in preparation for the next diffs in this stack which enable support for join with comm. hook. ghstack-source-id: 132924223 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29477360 fbshipit-source-id: f8132a94b1abfe28586aa78ac47e13a7ce6bb137	2021-07-01 20:40:53 -07:00
Rohan Varma	60509f8921	Update DDP documentation to mention outputs not used in loss is supported (#60275 ) Summary: We recently landed a change to ensure that when running under ``find_unused_parameters=True``, not all module outputs have to be used in loss computation and DDP will work as expected. Mention this update in the documentation and add some additional clarification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60275 Reviewed By: SciPioneer Differential Revision: D29502609 Pulled By: rohan-varma fbshipit-source-id: ddb3129cff9492018e61813413b30711af212309	2021-07-01 11:56:53 -07:00
Rohan Varma	12b63f4046	[DDP] Fix case where new tensors with no grad_fn are returned in DDP forward. (#60882 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60882 Fixes https://github.com/pytorch/pytorch/issues/60733, which identified an issue with a previous PR that resulted in DDP no longer supporting cases where newly created tensors are returned that don't have a grad_fn. The result of this is the grad_fn is set to that of the `DDPSink` custom backward which results in errors during the backwards pass. This PR fixes the issue by ensuring we don't touch the `grad_fn` of the tensors if it is `None`. Added relevant tests as well. ghstack-source-id: 132632515 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29423822 fbshipit-source-id: a9e01046c7be50aa43ffb955f6e0f48fef4bc881	2021-06-29 12:50:48 -07:00
Rohan Varma	d5df274ea5	[DDP] Support for multiple backwards (#59359 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359 Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`. ghstack-source-id: 131774159 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28855226 fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c	2021-06-18 09:23:57 -07:00
Rohan Varma	acd914f039	Fix Pipe + DDP for unused parameters, static graph (#60118 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118 Pipe + DDP has a few issues: 1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248 2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081 The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording. To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908. to test: All tests in pipe_with_ddp_test pass. The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks. ghstack-source-id: 131688187 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29167283 fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8	2021-06-17 13:41:51 -07:00
Mike Ruberry	5686fe5817	Revert D29154971: Training resnext with msuru_suru_union and ig_msuru_suru_union datasets Test Plan: revert-hammer Differential Revision: D29154971 (`9f68f93aca`) Original commit changeset: d534d830020f fbshipit-source-id: a3d16acc8e6b66a6010b501c28dbe295f573bc86	2021-06-16 15:33:14 -07:00
Zhuangzhuang Zhang	9f68f93aca	Training resnext with msuru_suru_union and ig_msuru_suru_union datasets Summary: We updated the training scripts and re-trained the Resnext model with msuru_suru_union and ig_msuru_suru_union datasets Test Plan: Main command line to run: ./deeplearning/projects/classy_vision/fb/projects/msuru_suru/scripts/train_cluster.sh Config we used is msuru_suru_config.json, which is "Normal ResNeXt101 with finetunable head". Experiments: - msuru_suru_union f279939874 - Train/test split - msuru_suru_union_dataset_train_w_shard: 143,632,674 rows - msuru_suru_union_dataset_test_w_shard: 1,831,236 rows - Results {F625232741} {F625232819} - ig_msuru_suru_union f279964200 - Train/test split - ig_msuru_suru_union_dataset_train_w_shard: 241,884,760 rows - ig_msuru_suru_union_dataset_test_w_shard: 3,477,181 rows - Results {F625234126} {F625234457} Differential Revision: D29154971 fbshipit-source-id: d534d830020f4f8e596bb6b941966eb84a1e8adb	2021-06-16 11:22:50 -07:00
Rohan Varma	eb55b086b7	[DDP] Log some python-side errors (#59284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284 Logs a few python-side errors to DDP logging. TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba? ghstack-source-id: 130412973 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28820290 fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0	2021-06-02 19:49:26 -07:00
Rohan Varma	79aeca0b00	[DDP] Log when errors happen (#59281 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281 Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has occured in this iteration, and the other fields (performance stats) are not guaranteed to be updated. Errors encountered in python-side DDP will be added in the next diff. ghstack-source-id: 130412974 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28652717 fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589	2021-06-02 19:48:26 -07:00
Rohan Varma	2a78e896a0	[DDP] use work.result() in _check_global_requires_backward_grad_sync (#59065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59065 Cleaner to use work.result() instead of sending back the tensor from this function. ghstack-source-id: 130338813 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28551203 fbshipit-source-id: d871fed78be91f0647687ea9d6fc86e576dc53a6	2021-06-02 17:19:07 -07:00
Andrew Gu	5a42a97c49	Add NCCL_ASYNC_ERROR_HANDLING as an environment variable (#59109 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/57878. This adds `NCCL_ASYNC_ERROR_HANDLING` as a DDP relevant environment variable and includes a check for that variable in the test `test_dump_DDP_relevant_env_vars()`. Notably, the modified test now checks for the new variable but does not check for any of the other previously-existing relevant environment variables that were not already tested for (e.g. `NCCL_BLOCKING_WAIT`). The change was tested via the following on an AI AWS cluster: `WORLD_SIZE=2 BACKEND=nccl gpurun pytest test/distributed/test_distributed_spawn.py -k test_dump_DDP_relevant_env_vars -vs` Pull Request resolved: https://github.com/pytorch/pytorch/pull/59109 Reviewed By: H-Huang, SciPioneer Differential Revision: D28761148 Pulled By: andwgu fbshipit-source-id: 7be4820e61a670b001408d0dd273f65029b1d2fe	2021-06-01 14:02:41 -07:00
Sureyya Emre Kurt	3d2b55553b	Retiring _module_copies field in DDP reducer. (#59094 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59094 Commented out _module_copies fields and changed for loops accordingly Test Plan: Test cases mentioned in T91292908 passed succesfully Reviewed By: SciPioneer Differential Revision: D28736135 fbshipit-source-id: 1857102f0c57a734026f3025e9653d8fad57d0b6	2021-05-27 15:09:14 -07:00
Rohan Varma	1d67c6d639	[DDP] Remove train call to module copies (#58595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58595 No longer needed since this list is always of size 1. ghstack-source-id: 129498229 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28548426 fbshipit-source-id: 7d6dba92fff685ec7f52ba7a3d350e36405e2578	2021-05-20 22:34:20 -07:00
Rohan Varma	faa7d3793d	[DDP] Support not all outputs used in loss calculation (#57081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081 Changes in this diff: Enable passthrough autograd function when find_unused_parameters=True. With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True. Enhance process of unused parameter checking to account for outputs not being used in loss. The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched. Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after. Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance. ghstack-source-id: 129425139 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28048628 fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61	2021-05-20 08:34:33 -07:00
Alexander Golynski	bc30c3165c	Update docs for get_future support (#58107 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58107 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28387374 Pulled By: agolynski fbshipit-source-id: 70052afbb0b07ba341ea55f7ec30f7d9759b7bd4	2021-05-12 18:29:28 -07:00
Yanli Zhao	166a8df65f	[reland] make ddp logging api to be private (#58089 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58089 make ddp logging api to be private ghstack-source-id: 128796419 Test Plan: unit test Reviewed By: rohan-varma Differential Revision: D28365412 fbshipit-source-id: 374c01d443ffb47a3706f59e296d6e47eb5f4c85	2021-05-12 16:45:13 -07:00
Rohan Varma	a0ac80ec76	[DDP] Don't find tensors if static graph (#58105 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58105 When find_unused_parameters=True but static_graph is also set, static graph handles unused parameter accounting, so this code path is not needed ghstack-source-id: 128736289 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28371954 fbshipit-source-id: 0b42a9c0fd2bba26a0de288436e9c7139e292578	2021-05-12 11:40:18 -07:00
Rohan Varma	c52700dbcd	[wip] enhance DDPSink to work for general outputs (#57073 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57073 Enhances use of DDPSink to work for all output types DDP supports as per https://github.com/pytorch/pytorch/issues/55876. TODO: Add additional testing for tuple, list, dict return types ghstack-source-id: 128726768 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27756985 fbshipit-source-id: 2e0408649fb2d6a46d6c33155a24c4c1723dd799	2021-05-12 09:45:10 -07:00
Kimish Patel	ad4cd6ef89	Revert D28338485: make ddp logging api to be private Test Plan: revert-hammer Differential Revision: D28338485 (`ac44569b0d`) Original commit changeset: bd2ae7c78904 fbshipit-source-id: d383f42a2051457147dec42ea273ed4fa82ffa1f	2021-05-11 12:12:51 -07:00
Yanli Zhao	ac44569b0d	make ddp logging api to be private (#57999 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57999 make ddp logging api to be private ghstack-source-id: 128607185 Test Plan: unit test Reviewed By: rohan-varma Differential Revision: D28338485 fbshipit-source-id: bd2ae7c78904e93eed88be91876f5a832b5b7886	2021-05-11 10:37:03 -07:00
Yanli Zhao	ea421fb249	enable static graph training in DDP (#55248 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248 This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph. 1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads. 2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer. 3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication. 4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration. 5. if the graph states change, DDP will throw errors ghstack-source-id: 128599464 Test Plan: unit tests. adding more tests Reviewed By: rohan-varma Differential Revision: D27539964 fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246	2021-05-11 10:23:25 -07:00
Rohan Varma	fe3c63d9d3	[DDP] fix param to name mapping (#57771 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57771 This mapping didn't work properly when certain parameters didn't require grad. Fixed that and added a test. ghstack-source-id: 128527537 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28265636 fbshipit-source-id: 7b342ce012b2b7e33058b4c619ffb98992ed05b7	2021-05-10 11:47:46 -07:00
Rohan Varma	d115e81a32	Fix document around DDP uneven inputs (#57448 ) Summary: Typo fix and additional clarifications on the API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57448 Reviewed By: SciPioneer Differential Revision: D28153264 Pulled By: rohan-varma fbshipit-source-id: 9bd35d918299ad7e080785d755f97b966f826615	2021-05-10 09:33:59 -07:00
Rohan Varma	57f72b8433	[DDP] Uneven inputs: option to throw early (#56755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56755 Rehash of https://github.com/pytorch/pytorch/pull/47488 Adds a flag to ddp join() context manager that enables throwing a StopIteration across all ranks when this flag is specified. To do this, we implement the design in #47250. When running with this flag, we schedule an additional allreduce in the case that a joined rank needs to throw a StopIteration. In non-joined ranks forward pass, we match this allreduce and if at least one rank tells us to throw, we raise a StopIteration. Tested by modifying existing tests, as well as adding additional tests validating that this works with SyncBatchNorm models and a model with custom collectives in the forward pass. Currently running perf benchmarks, will post when those are available, but we expect a small (~2%) perf reduction when enabling this feature due to the blocking allreduce. Hence we will only recommend it for models with collective comm. ghstack-source-id: 127883115 Test Plan: Ci Reviewed By: SciPioneer Differential Revision: D27958369 fbshipit-source-id: c26f7d315d95f17bbdc28b4a0561916fcbafb7ca	2021-05-02 15:41:50 -07:00
Yanli Zhao	3f81912885	static graph api skeleton (#54995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995 provide an DDP private API to explicitly set the training is static, also set this flag in logger ghstack-source-id: 127755713 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D27444965 fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359	2021-04-30 11:07:26 -07:00
Yanli Zhao	2c8ea63cbb	add a test for grad view with torch amp (#56730 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56730 add a test to verify DDP with torch map will result in the same results when using grad_as_bucket_view=true and false. torch.amp scale factor does not have dependencies on old gradients, thus it is not affected by grad_as_bucket_view=true or false, see how torch.amp is implemeted here https://github.com/pytorch/pytorch/pull/33366/files. This diff verified ddp can work as expected with amp.GradScaler and amp.autocast when when using grad_as_bucket_view=true and false. ghstack-source-id: 127526358 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D27950132 fbshipit-source-id: 8ed26935fdcb4514fccf01bb510e31bf6aedac69	2021-04-29 10:06:07 -07:00
Yanli Zhao	1e77ba36db	change ddpLoggingData struct to map or dict (#56641 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641 currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData. With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field. Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps. ghstack-source-id: 127482694 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D27923723 fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1	2021-04-28 06:43:25 -07:00
Yi Wang	07653b7fe0	[SPMD] Remove ddp_gpu_size field from SyncBatchNorm (#55946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55946 As `ddp_gpu_size` field of `SyncBatchNorm` will always be 1 for GPU modules, remove this field and the relevant code. ghstack-source-id: 126883498 Test Plan: waitforbuildbot Reviewed By: zhaojuanmao Differential Revision: D27746021 fbshipit-source-id: b4518c07e6f0c6943fbd7a7548500a7d4337126c	2021-04-19 21:41:29 -07:00
Mike Guo	5b4c3a9da1	record Torch DP and DDP modules forward (#55578 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/55578 Reviewed By: gdankel Differential Revision: D27862392 Pulled By: ilia-cher fbshipit-source-id: 18545d23e35a97c8f760707fecb696a24d47dc0a	2021-04-19 17:52:59 -07:00
Michael Carilli	a24b17248f	Short circuits DistributedDataParallel._recursive_to's copy and stream syncs if input is already on the right device (#55624 ) Summary: ^ Pull Request resolved: https://github.com/pytorch/pytorch/pull/55624 Reviewed By: pbelevich, agolynski Differential Revision: D27836170 Pulled By: rohan-varma fbshipit-source-id: 954bf336d70f9e80c045a6a96c1d8843c7f1cf2c	2021-04-18 14:08:08 -07:00
Rohan Varma	51e7a371f5	[DDP] Param to name mapping in Reducer (#55075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075 Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient. Use case: 1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration 2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case. 3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error. Implementation details: 0) The following is only enabled for debug modes of INFO or DETAIL. 1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names. 2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer 3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called. 4) When some params go unused, we take a set difference to detect the unused params. 5) Unittests to test the logged unused params, as well as for nested modules, are added ghstack-source-id: 126581051 Test Plan: CI, UT Reviewed By: zhaojuanmao Differential Revision: D27356394 fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332	2021-04-15 09:19:50 -07:00
Yi Wang	d398a705c6	Clang-format batchnorm.py and distributed.py (#55971 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55971 Per title ghstack-source-id: 126454339 Test Plan: N/A Reviewed By: zhaojuanmao Differential Revision: D27752315 fbshipit-source-id: 64ca5dea7b2689037594a6bd9a75641a9bb817c1	2021-04-13 18:40:23 -07:00
Yi Wang	4b09756d26	[SPMD] Move a comment (#55877 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55877 Address a comment in: `10bc1dae40 (r610930244)` ghstack-source-id: 126369525 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D27729567 fbshipit-source-id: 5509ebfba2b741cd3532c69044227e5af0fb54fc	2021-04-13 05:53:31 -07:00
Yi Wang	3e9cbe5ef7	[SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353 Remove all the code branches that will only be executed when `device_ids > 1`. Some helper functions are also removed: 1. `_verify_replicas_within_process` and `verify_replicas_within_process` 2. `_replicate_modules_within_process` 3. `parallel_apply` The next step is deprecating `_module_copies` field. ghstack-source-id: 126201121 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D27552201 fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c	2021-04-09 17:27:56 -07:00
Yi Wang	b986a76d91	Clang-format distributed.py (#55254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55254 ghstack-source-id: 125680320 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D27542846 fbshipit-source-id: 700c3e59a9df98233fdb27054b472f5cb33eb604	2021-04-05 16:48:22 -07:00
Yi Wang	e589247a19	[SPMD] Change assertions to raising value errors in distributed.py (#54825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54825 These assertions are tested in test_c10d.py Context: https://github.com/pytorch/pytorch/pull/54454#discussion_r602657818 ghstack-source-id: 125602462 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config Reviewed By: rohan-varma Differential Revision: D27381649 fbshipit-source-id: 9b994e9c2acf796770c2f2af2cebdd5561834d14	2021-04-02 15:13:45 -07:00
Yi Wang	6a40339920	[SPMD] Error out SPMD mode (#54454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54454 According to the pitch in https://github.com/pytorch/pytorch/issues/47012 1. Let DDP error out if `device_ids` contains multiple devices. 2. If device_ids is not specified, DDP will use the provided model (module argument in DDP constructor) as-is, regardless if the model is on one GPU or multiple GPUs or on CPU. 3. Remove the assertion that prevents SPMD in DDP `join()` method, because now SPMD is already forbidden by the constructor. Also remove the relevant unit test `test_ddp_uneven_inputs_replicated_error`. #Closes: https://github.com/pytorch/pytorch/issues/47012 ghstack-source-id: 125644392 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_cuda buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_rnn buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_ids_not_allowed buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_single_device_module_device_ids_None buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_module_device_ids_None buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config waitforbuildbot Reviewed By: pritamdamania87 Differential Revision: D27226092 fbshipit-source-id: 3ee1e4bc46e5e362fc82cf7a24b2fafb34fcf1b9	2021-04-02 15:11:59 -07:00
Rohan Varma	3575e71be8	[DDP Logging] Log use of uneven inputs API (#54919 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919 Log the use of uneven inputs API for better tracking and use case detection. ghstack-source-id: 125446499 Test Plan: CI, added ut Reviewed By: zhaojuanmao, SciPioneer Differential Revision: D27410764 fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e	2021-04-01 16:22:32 -07:00
Rohan Varma	8c13dde458	[DDP] Remove redundant pass statement (#54219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54219 There is no need for this ``pass``. ghstack-source-id: 125124311 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27105234 fbshipit-source-id: 95496fa785fdc66a6c3c8ceaa14af565588325df	2021-03-29 14:15:39 -07:00
Yi Wang	6e7a3c1fdd	Clang-format distributed.py (#54402 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54402 ghstack-source-id: 124497872 Test Plan: N/A Reviewed By: zhaojuanmao Differential Revision: D27225942 fbshipit-source-id: 277f466554fbc034fb76de161bf4b3b7c431daf7	2021-03-22 11:39:58 -07:00
Shen Li	ef9ee46756	Avoid modifying rebuild buckets state in no_grad context (#54159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54159 See https://github.com/pytorch/pytorch/issues/54059 for discussion. In short, users might want to run evaluation on a single rank in `torch.no_grad()` mode. When this happens, we need to make sure that we skip all rebuild bucket logics, as the forward only runs on one rank and not all peers can sure the bucket configuration sync communication. Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D27119666 Pulled By: mrshenli fbshipit-source-id: 4b2f8cce937cdd893e89d8d10c9267d255ba52ea	2021-03-17 19:50:29 -07:00
Rohan Varma	e09e97ebf9	[DDP] add _distributed_rank helper function (#53795 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53795 There are 4 calls in ddp implementation to dist.get_rank(), move these to a helper property to ensure that users don't actually call `dist.get_rank()` instead of `dist.get_rank(self.process_group)`. Keeping API private for now because not sure if there is a user need to call `model.distributed_rank`, but can make it public if we think it's a useful api. ghstack-source-id: 123640713 Test Plan: Ci Reviewed By: mrshenli Differential Revision: D26972368 fbshipit-source-id: a5f1cac243bca5c6f90a44f74d39cfffcc2b9a5a	2021-03-11 21:20:05 -08:00
Rohan Varma	0c2fe02ec1	[DDP] Fix wrong call to dist.get_rank() (#53793 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53793 This call should pass in the process group so it works appropriately for subgroups instead of whole world being passed into DDP. Aside: This wasn't caught by tests since we don't have good testing around passing subgroups into DDP, I believe nearly all tests use the entire world. Should we add better testing for subgroups which may potentially bring up more subtle bugs? ghstack-source-id: 123640712 Test Plan: CI Reviewed By: mrshenli Differential Revision: D26972367 fbshipit-source-id: 8330bd51e2ad66841e4c12e96b67d3e78581ec74	2021-03-11 21:18:31 -08:00
Yi Wang	d726ce6668	Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224 Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata. One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239. #Closes: https://github.com/pytorch/pytorch/issues/41048/ ghstack-source-id: 123722976 Test Plan: buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint Reviewed By: rohan-varma, mrshenli Differential Revision: D26798495 fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8	2021-03-11 18:43:33 -08:00
Yanli Zhao	a08fc1a7fc	allow users to set sample rate and add per iteration latency breakdowns (#53145 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145 add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g. if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged) ghstack-source-id: 123443369 Test Plan: unit test Reviewed By: SciPioneer Differential Revision: D26763957 fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e	2021-03-10 11:35:18 -08:00
Michael Carilli	e787872a47	[RELAND] Deduplicate shared params before constructing Reducer in DDP (#53279 ) Summary: Original PR https://github.com/pytorch/pytorch/pull/51929 seemed to trigger failures in `pytorch_linux_xenial_py3_clang5_asan_test2`. Resubmitting to figure out why, and hopefully reland. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53279 Reviewed By: mrshenli Differential Revision: D26916701 Pulled By: zhaojuanmao fbshipit-source-id: 75c74c8ad8ad24154eb59eddb2b222da0a09897e	2021-03-10 07:56:20 -08:00
Rohan Varma	14fa47631b	[DDP Logging] Log comm. hook in ddp logging (#52966 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966 Logs registerd comm hook if there is one, else logs "builtin_allreduce" ghstack-source-id: 123174803 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D26709388 fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490	2021-03-05 11:23:26 -08:00
Rohan Varma	68134374cb	Refactor/fix DDP model check during init (#52887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887 This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP. There were a few things that could be improved with the way we verify model across processes in DDP initialization: 1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters. 2. Refactor DDP init in the following ways: - Run model consistency check before creating reducer, 2 - add helper functions to build params to pass into reducer - add helper function to call `_verify_model_across_ranks` - move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors). ghstack-source-id: 123171877 Test Plan: CI/unittest buck test mode/dev-nosan //caffe2/test/distributed:c10d BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks Reviewed By: zhaojuanmao Differential Revision: D26565290 fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608	2021-03-05 11:21:45 -08:00
Mike Ruberry	30a8a13a7d	Revert D26625807: [pytorch][PR] Deduplicate shared params before constructing Reducer in DDP Test Plan: revert-hammer Differential Revision: D26625807 (`5c15a5bb46`) Original commit changeset: f5f5959fef90 fbshipit-source-id: c875cc86b8fd21d9d64f934559f8e3126ed1d23d	2021-03-03 20:05:47 -08:00
Yi Wang	68b62493b8	[Gradient Compression] Make GradBucket class public (#53099 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099 Publish GradBucket APIs for publishing DDP communication hooks. s/_GradBucket/GradBucket ghstack-source-id: 123030921 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D26721121 fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a	2021-03-03 19:22:15 -08:00
Michael Carilli	5c15a5bb46	Deduplicate shared params before constructing Reducer in DDP (#51929 ) Summary: Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens. We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further). This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929 Reviewed By: zou3519 Differential Revision: D26625807 Pulled By: zhaojuanmao fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3	2021-03-03 10:13:24 -08:00
Shen Li	d697090260	Add a note in DDP doc to point to ZeroRedundancyOptimizer (#53113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53113 Test Plan: Imported from OSS Reviewed By: blefaudeux Differential Revision: D26752339 Pulled By: mrshenli fbshipit-source-id: 7a082f1007bc550eabb82b559d020bbe717fa497	2021-03-02 14:18:06 -08:00
Yanli Zhao	d0795ab358	log newly added construction and runtime stats at randomly selected iterations (#51394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51394 log newly added construction and runtime stats at randomly selected iterations ghstack-source-id: 121934040 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D26161885 fbshipit-source-id: add6e02c1a03e6f74f08b9a9aecf90fa81631d60	2021-02-19 00:15:04 -08:00
Yanli Zhao	c75fa39b6c	add stats that can only be collected at runtime (#51386 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386 add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data 1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices 2. use at::cuda::event API for safer calls 3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results. 4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls ghstack-source-id: 121933566 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D26158645 fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178	2021-02-19 00:13:11 -08:00
Rohan Varma	6dabe0b291	[Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031 Closes https://github.com/pytorch/pytorch/issues/52020 Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread. ghstack-source-id: 121818080 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26356192 fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1	2021-02-17 12:21:37 -08:00
Rohan Varma	a86027ded3	Use side-stream in CPU to GPU copies in DDP (#50180 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50180 Resolves the regression in https://github.com/pytorch/pytorch/issues/49819 by adding copy over background stream similar to scatter. For internal use cases, this is gated with an env var that maintains the previous behavior when it is off. Test Plan: CI Reviewed By: mrshenli, ngimel Differential Revision: D25818170 fbshipit-source-id: e50c76c035504b2a44e2be084701cee45c90df75	2021-02-13 00:57:32 -08:00
Yanli Zhao	18e0a61388	add more logging fields that can be set in construction time (#51260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260 add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type ghstack-source-id: 121260224 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D26118245 fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9	2021-02-09 21:58:58 -08:00
Yi Wang	4b3c99ce4a	[Resubmission] Add a documentation page for DDP communication hooks (#51773 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773 Resubmission of #51715. Minor changes: 1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py. 2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst. Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels . It seems that `python_doc_test` was flaky. The previous error message was not informative: https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch. Rebasing to a new master branch seems to get this fixed: https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps Screenshot: {F369899792} ghstack-source-id: 121199613 Test Plan: View locally Reviewed By: mingzhe09088 Differential Revision: D26272687 fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db	2021-02-06 21:22:04 -08:00
Natalia Gimelshein	d3023d86ba	Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks Test Plan: revert-hammer Differential Revision: D26249330 (`e62aabac43`) Original commit changeset: ab973390ddb7 fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd	2021-02-04 22:38:06 -08:00
Yi Wang	e62aabac43	[Gradient Compression] Add a documentation page for DDP communication hooks (#51715 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715 Add a documentation page for DDP communication hooks. Screenshot: {F369781049} Test Plan: View locally Reviewed By: pritamdamania87 Differential Revision: D26249330 fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e	2021-02-04 18:53:53 -08:00
Yanli Zhao	250c71121b	Create a DDPLoggingData and expose it to python interface (#50622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622 1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields 2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files 3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications 4. Unit test tested the logging data can be set and got as expected 5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc ghstack-source-id: 120275870 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D25930527 fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99	2021-01-25 15:23:07 -08:00
Pritam Damania	f39f258dfd	Ensure DDP + Pipe works with find_unused_parameters. (#49908 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49908 As described in https://github.com/pytorch/pytorch/issues/49891, DDP + Pipe doesn't work with find_unused_parameters. This PR adds a simple fix to enable this functionality. This only currently works for Pipe within a single host and needs to be re-worked once we support cross host Pipe. ghstack-source-id: 119573413 Test Plan: 1) unit tests added. 2) waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25719922 fbshipit-source-id: 948bcc758d96f6b3c591182f1ec631830db1b15c	2021-01-11 16:52:37 -08:00
Samuel Marks	e6779d4357	[*.py] Rename "Arguments:" to "Args:" (#49736 ) Summary: I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings. ```sh (pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" \| paste -s -d+ -- \| bc)"; done Args: 1095 Arguments: 0336 ``` It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per: - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md) - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md) - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst) Therefore, only `Args:` is valid. This PR replaces them throughout the codebase. PS: For related PRs, see tensorflow/tensorflow/pull/45420 PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736 Reviewed By: albanD Differential Revision: D25710534 Pulled By: soumith fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619	2020-12-28 09:34:47 -08:00
Rohan Varma	c9f6e70c09	Refactor DDP uneven inputs control flags (#47394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394 This is a preliminary refactor for the next diff that will add an additional flag to control whether we throw a StopIteration or not. We basically move the flags for ddp uneven inputs to a simple class. ghstack-source-id: 116428177 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D24739509 fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319	2020-11-11 16:51:56 -08:00
Zhicheng Chen	3dd266304c	Fix inaccurate note in DistributedDataParallel (#47156 ) Summary: Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ). Here are some toy code to illustrate my point: * non-DistributedDataParallel version ```python import torch if __name__ == "__main__": torch.manual_seed(0) inp = torch.randn(1,16) inp = torch.cat([inp, inp], dim=0) model = torch.nn.Linear(16, 2) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() loss = loss_func(model(inp), torch.tensor([0, 0])) loss.backward() opti.step() print("grad:", model.weight.grad) print("updated weight:\n", model.weight) ``` * DistributedDataParallel version ```python import os import torch import torch.nn as nn import torch.distributed as dist from torch.multiprocessing import Process def run(rank, size): torch.manual_seed(0) x = torch.randn(1,16) model = torch.nn.Linear(16, 2) model = torch.nn.parallel.DistributedDataParallel(model) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.tensor([0]) loss = loss_func(y, label) loss.backward() opti.step() if rank == 0: print("grad:", model.module.weight.grad) print("updated weight:\n", model.module.weight) def init_process(rank, size, fn, backend="gloo"): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 process = [] for rank in range(size): p = Process(target=init_process, args=(rank, size, run)) p.start() process.append(p) for p in process: p.join() ``` Both of these two pieces of code have the same output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156 Reviewed By: mruberry Differential Revision: D24675199 Pulled By: mrshenli fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb	2020-11-09 17:42:57 -08:00
Yi Wang	fccfe7bd1a	[Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158 1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation. 2. Typo fix. 3. Reformat default_hooks.py. 4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.) The new style can be used for testing any new comm hook like PowerSGD easily. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 116012600 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D24669639 fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939	2020-11-06 00:28:09 -08:00
Yi Wang	f91fcefc81	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115783237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl //arvr/projects/eye_tracking/Masquerade:python_test USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install Reviewed By: mrshenli Differential Revision: D24700959 fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a	2020-11-03 18:33:50 -08:00
Yi Wang	b1b77148ac	Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234 Revert the diff because of https://github.com/pytorch/pytorch/issues/47153 Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115720415 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24691866 fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539	2020-11-02 20:51:18 -08:00
Yi Wang	ee0033af9b	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115629230 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl Reviewed By: pritamdamania87 Differential Revision: D24471910 fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a	2020-10-30 23:19:42 -07:00
Rohan Varma	ecdbea77bc	Fix DDP documentation (#46861 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46861 Noticed that in the DDP documentation: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel there were some examples with `torch.nn.DistributedDataParallel`, fix this to read `torch.nn.parallel.DistributedDataParallel`. ghstack-source-id: 115453703 Test Plan: ci Reviewed By: pritamdamania87, SciPioneer Differential Revision: D24534486 fbshipit-source-id: 64b92dc8a55136c23313f7926251fe825a2cb7d5	2020-10-29 09:13:47 -07:00
Rohan Varma	7245d2c939	Avoid scatter for single-device case in DDP (#46304 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304 In the case that a single process operates only on one GPU, we can avoid this scatter and instead replace it with a recursive version of `to` which transfers the input tensors to the correct device. The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved). ghstack-source-id: 114896677 Test Plan: Added unittest, and CI Reviewed By: pritamdamania87 Differential Revision: D24296377 fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d	2020-10-22 08:29:37 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Emilio Castillo	d38a71d579	`torch.nn.modules.LazyModuleMixin` and `torch.nn.LazyLinear` (Shape Inference II) (#44538 ) Summary: Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`. The main differences with the previous PR are two; Now `torch.nn.Module` remains untouched. We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side. As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`. This can cause problems when we create complex modules such as ``` class MyNetwork(torch.nn.Module): def __init__(self): super(MyNetwork, self).__init__() self.conv = torch.nn.Conv2d(20, 4, 2) self.linear = torch.nn.LazyLinear(10) def forward(self, x): y = self.conv(x).clamp(min=0) return self.linear(y) ``` Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed). TODO: Add convolutions once the design is OK Fix docstrings Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538 Reviewed By: ngimel Differential Revision: D24162854 Pulled By: albanD fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c	2020-10-19 13:13:54 -07:00
Rohan Varma	181afd5220	Add an option to DDP to take a list of parameters to ignore upfront. (#44826 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826 As described in https://github.com/pytorch/pytorch/issues/43690, there is a need for DDP to be able to ignore certain parameters in the module (not install allreduce hooks) for certain use cases. `find_unused_parameters` is sufficient from a correctness perspective, but we can get better performance with this upfront list if users know which params are unused, since we won't have to traverse the autograd graph every iteration. To enable this, we add a field `parameters_to_ignore` to DDP init and don't pass in that parameter to reducer if that parameter is in the given list. ghstack-source-id: 113210109 Test Plan: Added unittest Reviewed By: xw285cornell, mrshenli Differential Revision: D23740639 fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314	2020-09-30 11:52:50 -07:00
Shen Li	c5ade5f698	Fix no_sync docs (#45455 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D23973365 Pulled By: mrshenli fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb	2020-09-28 20:48:09 -07:00
Shen Li	6967e6295e	Fix DDP docs (#45454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D23973367 Pulled By: mrshenli fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0	2020-09-28 20:43:22 -07:00
Yanli Zhao	c6500bcf14	[reland] Make grad point to bucket buffer in DDP to save memory usage (#44344 ) Summary: [test all] Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112845787 Test Plan: 1. When grad_is_view=false: a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588 https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli 2. When grad_is_view=true: a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli Reviewed By: mrshenli Differential Revision: D23588186 fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5	2020-09-24 20:54:51 -07:00
Rohan Varma	e57a08119b	Add a warning log when there is high skew of uneven inputs in DDP training (#45238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238 Adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. ghstack-source-id: 112773552 Test Plan: CI Reviewed By: mrshenli Differential Revision: D23719270 fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622	2020-09-24 09:50:44 -07:00
Bugra Akyildiz	1b059f2c6d	Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914 ) Summary: We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place. Fixes https://github.com/pytorch/pytorch/issues/43960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914 Reviewed By: rohan-varma Differential Revision: D23798888 Pulled By: bugra fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577	2020-09-22 06:28:47 -07:00
Yanli Zhao	e14b2080be	[reland] move rebuild buckets from end of first iteration to beginning of second iteration (#44798 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798 [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112279261 ghstack-source-id: 112279261 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D23735185 fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e	2020-09-17 17:10:21 -07:00
Ailing Zhang	fb085d90e3	Revert D23583017: move rebuild buckets from end of first iteration to beginning of second iteration Test Plan: revert-hammer Differential Revision: D23583017 (`f5d231d593`) Original commit changeset: ef67f79437a8 fbshipit-source-id: fd914b7565aba6a5574a32b31403525abb80ff07	2020-09-15 15:10:52 -07:00
Yanli Zhao	f5d231d593	move rebuild buckets from end of first iteration to beginning of second iteration (#44326 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326 Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112011490 Test Plan: unit tests Reviewed By: mrshenli Differential Revision: D23583017 fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83	2020-09-15 09:51:33 -07:00
Yi Wang	ace81b6794	Remove an extra empty line in the warning comments. (#44622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622 Remove an extra empty line in the warning comments.Remove an extra empty line. Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D23674070 fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd	2020-09-14 11:15:35 -07:00
Rohan Varma	41f62b17e7	Fix DDP join() API in the case of model.no_sync() (#44427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427 Closes https://github.com/pytorch/pytorch/issues/44425 DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s. ghstack-source-id: 111786479 Reviewed By: pritamdamania87 Differential Revision: D23609070 fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76	2020-09-10 18:31:40 -07:00
Rohan Varma	3806c939bd	Polish DDP join API docstrings (#43973 ) Summary: Polishes DDP join api docstrings and makes a few minor cosmetic changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973 Reviewed By: zou3519 Differential Revision: D23467238 Pulled By: rohan-varma fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56	2020-09-03 13:39:45 -07:00
Rohan Varma	4e4626a23d	Join-based API to support DDP uneven inputs (#42577 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577 Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes: #### Approach 1) Add a context manager `torch.nn.parallel.distributed.join` 2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined. 3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros 4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness. 5) At the end of training, the last joined process is selected to be the "authoritative" model copy We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes. #### How is it tested? We have tests covering the following models/scenarios: - [x] Simple linear model - [x] Large convolutional model - [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm) - [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm) - [x] Model with unused params (with find unused parameters=True) - [x] Scenarios where different processes iterate for a varying number of different iterations. - [x] Test consistency in tie-breaking when multiple ranks are the last ones to join - [x] Test that we divide by the effective world_size (no. of unjoined processes) #### Performance implications ###### Trunk vs PR patched, 32 GPUs, batch size = 32 P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s ###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s ###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s ###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32 P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s ###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32 P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce. #### Limitations 1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP. 2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought. 3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs. ghstack-source-id: 111033819 Reviewed By: mrshenli Differential Revision: D22893859 fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe	2020-08-31 13:29:03 -07:00
Haoran Li	f35e069622	Back out "Make grad point to bucket buffer in DDP to save memory usage" (#43557 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557 backout the diff that caused some errors in pytext distributed training Test Plan: Tested by rayhou who verified reverting the diff works Differential Revision: D23320238 fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e	2020-08-25 18:04:39 -07:00
Yanli Zhao	97d594b9f7	Make grad point to bucket buffer in DDP to save memory usage (#41954 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in https://github.com/pytorch/pytorch/pull/41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 110260297 Test Plan: unit tests, For roberta_base model with ~1GB parameters, peak memory dropped ~1GB (8250MB-7183MB). Per iteration latency (0.982s ->0.909s), 8% speed up https://www.internalfb.com/intern/fblearner/details/211713882?tab=operator_details https://www.internalfb.com/intern/fblearner/details/211772923?tab=operator_details For resnet model with ~97M parameters, peak memory dropped ~100MB (3089MB -> 2988MB). Per iteration latency has no change (0.122s -> 0.123s) https://www.internalfb.com/intern/fblearner/details/211713577?tab=operator_details https://www.internalfb.com/intern/fblearner/details/211712582?tab=operator_details accuracy benchmark is expected as well https://www.internalfb.com/intern/fblearner/details/213237067?tab=Outputs Reviewed By: mrshenli Differential Revision: D22707857 fbshipit-source-id: b5e767cfb34ccb3d067db2735482a86d59aea7a4	2020-08-20 15:33:44 -07:00
Sinan Nasir	6e1127ea3f	[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110208431 Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues. Reviewed By: pritamdamania87 Differential Revision: D23055807 fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5	2020-08-19 19:42:22 -07:00
Sinan Nasir	752f433a24	DDP communication hook: skip dividing grads by world_size if hook registered. (#42400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400 mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109548696 Update: We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Test Plan: python test/distributed/test_c10d.py and perf benchmark tests. Reviewed By: ezyang Differential Revision: D22883905 fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e	2020-08-10 13:55:42 -07:00
Sinan Nasir	0a804be47d	[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335 Main goal: For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596). ghstack-source-id: 109461507 Test Plan: ```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py Couldn't download test skip set, leaving all tests enabled... ..............................s.....................................................s................................ ---------------------------------------------------------------------- Ran 117 tests in 298.042s OK (skipped=2) ``` ### Facebook Internal: 2\. HPC PT trainer run to validate no regression. Check the QPS number: Master: QPS after 1000 iters: around ~34100 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912 ``` ``` [0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs) getFuture/new design: QPS after 1000 iters: around ~34030 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) getFuture/new design Run 2: QPS after 1000 iters: around ~34200 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) getFuture/old design (Regression): QPS after 1000 iters: around ~31150 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (`d904ea5972`)" --trainers 16 --trainer-version 1cb5cbb ``` ``` priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (`d904ea5972`).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159 ``` 3\. `flow-cli` tests; roberta_base; world_size=4: Master: f210039922 ``` total: 32 GPUs -- 32 GPUs: p25: 0.908 35/s p50: 1.002 31/s p75: 1.035 30/s p90: 1.051 30/s p95: 1.063 30/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 452/s p50: 0.071 449/s p75: 0.072 446/s p90: 0.072 445/s p95: 0.072 444/s backward: 32 GPUs -- 32 GPUs: p25: 0.821 38/s p50: 0.915 34/s p75: 0.948 33/s p90: 0.964 33/s p95: 0.976 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2035/s p75: 0.016 2027/s p90: 0.016 2019/s p95: 0.016 2017/s ``` getFuture new design: f210285797 ``` total: 32 GPUs -- 32 GPUs: p25: 0.952 33/s p50: 1.031 31/s p75: 1.046 30/s p90: 1.055 30/s p95: 1.070 29/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 449/s p50: 0.072 446/s p75: 0.072 445/s p90: 0.072 444/s p95: 0.072 443/s backward: 32 GPUs -- 32 GPUs: p25: 0.865 37/s p50: 0.943 33/s p75: 0.958 33/s p90: 0.968 33/s p95: 0.982 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2033/s p75: 0.016 2022/s p90: 0.016 2018/s p95: 0.016 2017/s ``` Reviewed By: ezyang Differential Revision: D22833298 fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208	2020-08-07 18:48:35 -07:00
Nikita Shulga	56fc7d0345	Fix doc build (#42559 ) Summary: Add space between double back quotes and left curly bracket Otherwise doc generation failed with `Inline literal start-string without end-string.` This regression was introduced by `b56db305cf` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559 Reviewed By: glaringlee Differential Revision: D22931527 Pulled By: malfet fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055	2020-08-04 15:15:15 -07:00
Zhicheng Chen	b56db305cf	Improve the documentation of DistributedDataParallel (#42471 ) Summary: Fixes #{issue number} It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly. Here is some toy code to illustrate my point: * non-DistributedDataParallel version ```python import torch import torch.nn as nn x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1) print("input:", x) model = nn.Linear(in_features=1, out_features=1, bias=False) model.weight.data.zero_() model.weight.data.add_(1.0) opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.zeros(4, 1, dtype=torch.float) loss = torch.sum((y - label)*2) loss.backward() opti.step() print("grad:", model.weight.grad) print("updated weight:\n", model.weight) # OUTPUT # $ python test.py # input: tensor([[-1.], # [ 2.], # [-3.], # [ 4.]]) # grad: tensor([[60.]]) # updated weight: # Parameter containing: # tensor([[0.9400]], requires_grad=True) ``` DistributedDataParallel version ```python import os import torch import torch.nn as nn import torch.distributed as dist from torch.multiprocessing import Process def run(rank, size): x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1) print("input:", x) model = nn.Linear(in_features=1, out_features=1, bias=False) model.weight.data.zero_() model.weight.data.add_(1.0) model = torch.nn.parallel.DistributedDataParallel(model) opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.zeros(2, 1, dtype=torch.float) loss = torch.sum((y.view(-1, 1) - label)**2) loss.backward() opti.step() if rank == 0: print("grad:", model.module.weight.grad) print("updated weight:\n", model.module.weight) def init_process(rank, size, fn, backend="gloo"): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 process = [] for rank in range(size): p = Process(target=init_process, args=(rank, size, run)) p.start() process.append(p) for p in process: p.join() # OUTPUT # $ python test_d.py # input: tensor([[-3.], # [ 4.]])input: tensor([[-1.], # [ 2.]]) # grad: tensor([[30.]]) # updated weight: # Parameter containing: # tensor([[0.9700]], requires_grad=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42471 Reviewed By: glaringlee Differential Revision: D22923340 Pulled By: mrshenli fbshipit-source-id: 40b8c8ba63a243f857cd5976badbf7377253ba82	2020-08-04 08:36:42 -07:00
Yanli Zhao	79cfd85987	grad detach_ only when it has grad_fn in zero_grad call (#41283 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283 in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function ghstack-source-id: 108702289 Test Plan: unit test Reviewed By: mrshenli Differential Revision: D22487315 fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38	2020-07-29 11:40:13 -07:00
Jongsoo Park	73ff252913	Back out "[NCCL] DDP communication hook: getFuture()" (#42152 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152 Original commit changeset: 8c059745261d Test Plan: . Reviewed By: ajtulloch, jianyuh Differential Revision: D22786183 fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca	2020-07-28 10:05:35 -07:00
Shen Li	c76fada4a8	Let DDP.train() return self to stay consistent with nn.Module (#42131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42131 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D22775311 Pulled By: mrshenli fbshipit-source-id: ac9e6cf8b2381036a2b6064bd029dca361a81777	2020-07-27 18:22:13 -07:00
Sinan Nasir	d904ea5972	[NCCL] DDP communication hook: getFuture() (#41596 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108409748 Test Plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). Reviewed By: izdeby Differential Revision: D22583690 fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a	2020-07-24 11:22:44 -07:00
Sinan Nasir	d5ae4a07ef	DDP Communication Hook Main Structure (#40848 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40848 Sub-tasks 1 and 2 of [39272](https://github.com/pytorch/pytorch/issues/39272) ghstack-source-id: 107787878 Test Plan: 1\. Perf tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP. Execute the following command with diff patched/unpatched (with V25): * Unpatched Runs: ``` hg checkout D22514243 flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_masterD22514243 --run-as-secure-group pytorch_distributed ``` * Run 1 (unpatched): `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 59 s f204539235 ``` sum: 8 GPUs: p25: 0.156 205/s p50: 0.160 200/s p75: 0.164 194/s p90: 0.169 189/s p95: 0.173 185/s fwds: 8 GPUs: p25: 0.032 1011/s p50: 0.032 1006/s p75: 0.032 1000/s p90: 0.032 992/s p95: 0.033 984/s bwds: 8 GPUs: p25: 0.121 265/s p50: 0.125 256/s p75: 0.129 248/s p90: 0.134 239/s p95: 0.137 232/s opts: 8 GPUs: p25: 0.003 11840/s p50: 0.003 11550/s p75: 0.004 8037/s p90: 0.006 5633/s p95: 0.007 4631/s ``` * Run 2 (unpatched): `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 1 s f204683840 ``` sum: 8 GPUs: p25: 0.145 220/s p50: 0.147 217/s p75: 0.150 213/s p90: 0.154 207/s p95: 0.157 204/s fwds: 8 GPUs: p25: 0.032 1015/s p50: 0.032 1009/s p75: 0.032 1002/s p90: 0.032 994/s p95: 0.032 990/s bwds: 8 GPUs: p25: 0.107 297/s p50: 0.111 288/s p75: 0.115 278/s p90: 0.119 268/s p95: 0.122 262/s opts: 8 GPUs: p25: 0.003 11719/s p50: 0.004 9026/s p75: 0.006 5160/s p90: 0.009 3700/s p95: 0.010 3184/s ``` * Patched Runs: ``` hg checkout D22328310 flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_localD22328310 --run-as-secure-group pytorch_distributed ``` * Run 1 (patched): `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 30 s f204544541 ``` sum: 8 GPUs: p25: 0.148 216/s p50: 0.152 210/s p75: 0.156 205/s p90: 0.160 200/s p95: 0.163 196/s fwds: 8 GPUs: p25: 0.032 1011/s p50: 0.032 1005/s p75: 0.032 999/s p90: 0.032 991/s p95: 0.033 984/s bwds: 8 GPUs: p25: 0.112 286/s p50: 0.116 275/s p75: 0.120 265/s p90: 0.125 256/s p95: 0.128 250/s opts: 8 GPUs: p25: 0.003 11823/s p50: 0.003 10948/s p75: 0.004 7225/s p90: 0.007 4905/s p95: 0.008 3873/s ``` * Run 2 (patched): `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 14 s f204684520 ``` sum: 8 GPUs: p25: 0.146 219/s p50: 0.147 217/s p75: 0.150 214/s p90: 0.152 210/s p95: 0.153 208/s fwds: 8 GPUs: p25: 0.032 1013/s p50: 0.032 1008/s p75: 0.032 1002/s p90: 0.032 996/s p95: 0.032 990/s bwds: 8 GPUs: p25: 0.107 299/s p50: 0.110 290/s p75: 0.114 280/s p90: 0.117 274/s p95: 0.119 269/s opts: 8 GPUs: p25: 0.003 11057/s p50: 0.005 6490/s p75: 0.008 4110/s p90: 0.010 3309/s p95: 0.010 3103/s ``` * Run 3 (patched): `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 54 s f204692872 ``` sum: 8 GPUs: p25: 0.145 220/s p50: 0.147 217/s p75: 0.150 213/s p90: 0.154 207/s p95: 0.156 204/s fwds: 8 GPUs: p25: 0.032 1001/s p50: 0.032 995/s p75: 0.032 988/s p90: 0.033 980/s p95: 0.033 973/s bwds: 8 GPUs: p25: 0.108 295/s p50: 0.111 287/s p75: 0.114 280/s p90: 0.119 269/s p95: 0.121 264/s opts: 8 GPUs: p25: 0.003 11706/s p50: 0.003 9257/s p75: 0.005 6333/s p90: 0.008 4242/s p95: 0.009 3554/s ``` * Memory: * Unpatched: ``` CUDA Memory Summary After first iteration: \|===========================================================================\| \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| Active memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| GPU reserved memory \| 3490 MB \| 3490 MB \| 3490 MB \| 0 B \| \| from large pool \| 3434 MB \| 3434 MB \| 3434 MB \| 0 B \| \| from small pool \| 56 MB \| 56 MB \| 56 MB \| 0 B \| \|---------------------------------------------------------------------------\| \| Non-releasable memory \| 315332 KB \| 343472 KB \| 2295 MB \| 1987 MB \| \| from large pool \| 311166 KB \| 340158 KB \| 2239 MB \| 1935 MB \| \| from small pool \| 4166 KB \| 4334 KB \| 56 MB \| 52 MB \| \|---------------------------------------------------------------------------\| \| Allocations \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| Active allocs \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| GPU reserved segments \| 102 \| 102 \| 102 \| 0 \| \| from large pool \| 74 \| 74 \| 74 \| 0 \| \| from small pool \| 28 \| 28 \| 28 \| 0 \| \|---------------------------------------------------------------------------\| \| Non-releasable allocs \| 34 \| 54 \| 430 \| 396 \| \| from large pool \| 15 \| 48 \| 208 \| 193 \| \| from small pool \| 19 \| 19 \| 222 \| 203 \| \|===========================================================================\| ``` * Patched: ``` CUDA Memory Summary After first iteration: \|===========================================================================\| \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| Active memory \| 428091 KB \| 2892 MB \| 9825 MB \| 9407 MB \| \| from large pool \| 374913 KB \| 2874 MB \| 9752 MB \| 9386 MB \| \| from small pool \| 53178 KB \| 52 MB \| 73 MB \| 21 MB \| \|---------------------------------------------------------------------------\| \| GPU reserved memory \| 3490 MB \| 3490 MB \| 3490 MB \| 0 B \| \| from large pool \| 3434 MB \| 3434 MB \| 3434 MB \| 0 B \| \| from small pool \| 56 MB \| 56 MB \| 56 MB \| 0 B \| \|---------------------------------------------------------------------------\| \| Non-releasable memory \| 315332 KB \| 343472 KB \| 2295 MB \| 1987 MB \| \| from large pool \| 311166 KB \| 340158 KB \| 2239 MB \| 1935 MB \| \| from small pool \| 4166 KB \| 4334 KB \| 56 MB \| 52 MB \| \|---------------------------------------------------------------------------\| \| Allocations \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| Active allocs \| 704 \| 705 \| 1390 \| 686 \| \| from large pool \| 60 \| 131 \| 395 \| 335 \| \| from small pool \| 644 \| 645 \| 995 \| 351 \| \|---------------------------------------------------------------------------\| \| GPU reserved segments \| 102 \| 102 \| 102 \| 0 \| \| from large pool \| 74 \| 74 \| 74 \| 0 \| \| from small pool \| 28 \| 28 \| 28 \| 0 \| \|---------------------------------------------------------------------------\| \| Non-releasable allocs \| 34 \| 54 \| 431 \| 397 \| \| from large pool \| 15 \| 48 \| 208 \| 193 \| \| from small pool \| 19 \| 19 \| 223 \| 204 \| \|===========================================================================\| ``` 2\. As of v18: `python test/distributed/test_c10d.py` ``` ....................s.....s.....................................................s................................ ---------------------------------------------------------------------- Ran 114 tests in 215.983s OK (skipped=3) ``` 3\. Additional tests in `python test/distributed/test_c10d.py`: * `test_ddp_comm_hook_future_passing_cpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it. * `_test_ddp_comm_hook_future_passing_gpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it. * `test_ddp_comm_hook_future_passing_gpu_gloo`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using gloo backend. * `test_ddp_comm_hook_future_passing_gpu_nccl`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using nccl backend. * `test_ddp_invalid_comm_hook_init`: This unit test makes sure that register_comm_hook properly checks the format of hook defined by user. The Python hook must be callable. This test also checks whether bucket annotation checked properly if defined. * `test_ddp_invalid_comm_hook_return_type`: This test checks whether return annotation checked properly if defined. It also checks whether an internal error is thrown if return type is incorrect and user hasn't specified any return type annotation. * `test_ddp_comm_hook_register_just_once`: DDP communication hook can only be registered once. This test validates whether the error is thrown properly when register_comm_hook is called more than once. Reviewed By: ezyang Differential Revision: D22328310 fbshipit-source-id: 77a6a71808e7b6e947795cb3fcc68c8c8f024549	2020-07-15 11:25:29 -07:00
Yi Huang (PyTorch)	4196605776	helper function to print out all DDP-relevant env vars (#41297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297 GH issue: https://github.com/pytorch/pytorch/issues/40105 Add a helper function to DDP to print out all relevant env vars for debugging Test Plan: test through unittest, example output: --- env:RANK=3 env:LOCAL_RANK=N/A env:WORLD_SIZE=N/A env:MASTER_PORT=N/A env:MASTER_ADDR=N/A env:CUDA_VISIBLE_DEVICES=N/A env:GLOO_SOCKET_IFNAME=N/A env:GLOO_DEVICE_TRANSPORT=N/A env:NCCL_SOCKET_IFNAME=N/A env:NCCL_BLOCKING_WAIT=N/A ... --- Reviewed By: mrshenli Differential Revision: D22490486 fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c	2020-07-13 14:03:04 -07:00
Shen Li	0edbe6b063	Add a link in RPC doc page to point to PT Distributed overview (#41108 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108 Test Plan: Imported from OSS Differential Revision: D22440751 Pulled By: mrshenli fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb	2020-07-08 14:00:05 -07:00
chengjun	8d570bc708	Decouple DataParallel/DistributedDataParallel from CUDA (#38454 ) Summary: Decouple DataParallel/DistributedDataParallel from CUDA to support more device types. - Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility - Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space. - Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs. Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454 Differential Revision: D22051557 Pulled By: mrshenli fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8	2020-07-07 12:48:16 -07:00
Sinan Nasir	15864d1703	Skip allreducing `local_used_maps_dev_` when `find_unused_param=False` Summary: 1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`. If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`. 2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942). 3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`. ghstack-source-id: 106693089 Test Plan: 1. Run `test/distributed/test_c10d.py` and make sure all tests pass. 2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test. 3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well. Imported from OSS Output of version 14: ``` ................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) tensor = torch.full([100, 100], self.rank) .test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at ../aten/src/ATen/native/TensorFactories.cpp:364.) self.assertEqual(torch.full([10, 10], self.world_size), tensor) .....s............................... ---------------------------------------------------------------------- Ran 108 tests in 214.210s OK (skipped=3) ``` Differential Revision: D22176231 fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836	2020-06-26 19:20:59 -07:00
Michael Carilli	8066fba226	[RELAND2] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40358 ) Summary: https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test. This PR is intended as the "master copy" for merge, and it runs on full CI. Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`). - https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master. In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself. - https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs. Both fail the same way, indicating failure is unrelated to this PR's other diffs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358 Differential Revision: D22165785 Pulled By: albanD fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e	2020-06-22 17:13:21 -07:00
Shen Li	30364f0b01	Remove obsolete warning message from DDP (#40190 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40190 Fixed by #36503 Test Plan: Imported from OSS Differential Revision: D22101516 Pulled By: mrshenli fbshipit-source-id: 9abd6dce602530c11b7fe623ac0f4d556dccc961	2020-06-17 17:58:21 -07:00
Alban Desmaison	08227fea4f	Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D22079377 Original commit changeset: 9bd2b7e0c34f fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59	2020-06-17 10:17:27 -07:00
Michael Carilli	1ec8ece2b9	[RELAND] Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#40129 ) Summary: https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1	2020-06-17 09:02:54 -07:00
Pritam Damania	15823ac6d5	Enhance DDP docstrings for DDP + RPC support. (#39916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39916 ghstack-source-id: 105999275 Test Plan: waitforbuildbot Differential Revision: D22013190 fbshipit-source-id: be3bb12b2281579610581b809c822ab6b027fa71	2020-06-16 20:05:13 -07:00
Alban Desmaison	f1e575a0bf	Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield `.grad`s that match weights' memory layout Test Plan: revert-hammer Differential Revision: D20496044 Original commit changeset: 248d680f4b1b fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d	2020-06-16 10:38:40 -07:00
Michael Carilli	2beb9690c3	Change AccumulateGrad to yield `.grad`s that match weights' memory layout (#34904 ) Summary: Currently, whether `AccumulateGrad` [steals](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)`) or [clones](`67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)`) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown. The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)). Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense. For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple. Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general: - Grads should be stashed with memory layouts matching their params. - Src and dst tensors on opposite ends of collectives should have matching dense layouts. This PR also updates autograd docs to describe potential BC-breaking changes below. ## BC notes ngimel albanD gchanan #### BC-breaking In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break. Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations. #### BC escape hatches At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point. One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides. ---------------------------- I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility: 1. make sure Reducer's ops sync with AccumulateGrad streams 2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do. 3. https://github.com/pytorch/pytorch/issues/38942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904 Differential Revision: D20496044 Pulled By: albanD fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217	2020-06-16 08:43:31 -07:00
Yanli Zhao	b98948e6dd	implement dynamic bucket order in DDP (#35137 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35137 bucket order is rebuilt dynamically in the first reduction backward pass when find_unused_parameters = false ghstack-source-id: 104794018 Test Plan: unit test Differential Revision: D20128537 fbshipit-source-id: fad73de965cdcb59a51c0a12b248271344584b9f	2020-05-28 12:59:52 -07:00
Shen Li	8d6a8d2b3f	Fix DDP bug in single process multiple device use cases (#36503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36503 Test Plan: Imported from OSS Differential Revision: D21179274 Pulled By: mrshenli fbshipit-source-id: 0afce30ae0ddda753d1e240584a0f80df9aec4c2	2020-04-22 15:06:28 -07:00
Shen Li	5afd816793	Add a warning for Single-Process Multi-GPU DDP (#36656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36656 Test Plan: Imported from OSS Differential Revision: D21042537 Pulled By: mrshenli fbshipit-source-id: fa3501dc2bba14550ec4f254612a80f61fe86a4a	2020-04-15 12:43:50 -07:00
Xiang Gao	df8d6eeb19	Update docs about DP and DDP for CUDA (#35063 ) Summary: We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063 Differential Revision: D20549621 Pulled By: ngimel fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543	2020-03-20 20:06:37 -07:00
danthe3rd	46539eee03	Ensure that DDP wrapped module has parameters that require gradients (#25858 ) Summary: …ent - see https://github.com/pytorch/pytorch/issues/25552 TEST PLAN ``` python test/run_test.py -f distributed ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/25858 Differential Revision: D17687542 Pulled By: danthe3rd fbshipit-source-id: 11bfe4142e72bb21382b30379fe10e60418c7ec9	2019-10-01 09:03:52 -07:00
Karl Ostmo	ef6356133e	Revert D16428208: [pytorch][PR] only scatter in forward if multi-device per process Differential Revision: D16428208 Original commit changeset: eaa3876b2b95 fbshipit-source-id: 9db3bc86bf419dd06fdaaff434f72b92ecb5a427	2019-07-27 22:41:20 -07:00
Adam Stooke	d6d7a5f075	only scatter in forward if multi-device per process (#22384 ) Summary: Scatter is unnecessary if only using one device, and it breaks on some custom data structures like namedtuple, so would like to avoid :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/22384 Differential Revision: D16428208 Pulled By: soumith fbshipit-source-id: eaa3876b2b95c1006ccaaacdb62f54c5280e730c	2019-07-26 17:30:34 -07:00
Adam Paszke	f1775796dd	Fix minor issues with #21736 (#22074 ) Summary: cc mrshenli Pull Request resolved: https://github.com/pytorch/pytorch/pull/22074 Differential Revision: D15965376 Pulled By: mrshenli fbshipit-source-id: 50ff96de6390817d8ea52c04322c6bee3d649b32	2019-06-24 15:18:26 -07:00
Pieter Noordhuis	77eda8de8e	Support sparse gradients in DistributedDataParallel (#22037 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22037 This adds support for sparse gradients to the reducer as well as to the DistributedDataParallel wrapper. Note that an out of band signal is needed whether or not a dense parameter (e.g. an embedding) is expected to receive a sparse gradient or not. This information is passed to the bucket assignment computation routine and the reducer as a vector of booleans. Every parameter for which we expect a sparse gradient is assigned its own bucket, as we cannot easily group multiple unrelated sparse tensors. Reviewed By: mrshenli Differential Revision: D15926383 fbshipit-source-id: 39c0d5dbd95bf0534314fdf4d44b2385d5321aaf	2019-06-24 07:34:12 -07:00
Shen Li	08facca1a1	Support accumulating DDP grads using a context manager (#21736 ) Summary: The first attempt and more discussions are available in https://github.com/pytorch/pytorch/issues/19577 #### Goal Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations. #### Concerns Our first attempt in https://github.com/pytorch/pytorch/issues/19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead. #### Proposed Solution Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP. It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required.. The application would then look like: ```python with ddp.no_sync(): for input in inputs: ddp(input).backward() ddp(one_more_input).backward() optimizer.step() ``` chenyangyu1988 myleott Pull Request resolved: https://github.com/pytorch/pytorch/pull/21736 Differential Revision: D15805215 Pulled By: mrshenli fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c	2019-06-20 12:23:52 -07:00
Edward Yang	cb4c213f55	Revert D15007365: Support sparse gradients in DistributedDataParallel Differential Revision: D15007365 Original commit changeset: f298e83fd3ca fbshipit-source-id: ef5e556d2df37f0c64652bd3563956afd8d9fd7f	2019-06-20 10:07:22 -07:00
Pieter Noordhuis	365de7bda1	Support sparse gradients in DistributedDataParallel (#19443 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19443 This adds support for sparse gradients to the reducer as well as to the DistributedDataParallel wrapper. Note that an out of band signal is needed whether or not a dense parameter (e.g. an embedding) is expected to receive a sparse gradient or not. This information is passed to the bucket assignment computation routine and the reducer as a vector of booleans. Every parameter for which we expect a sparse gradient is assigned its own bucket, as we cannot easily group multiple unrelated sparse tensors. Reviewed By: mrshenli Differential Revision: D15007365 fbshipit-source-id: f298e83fd3ca828fae9e80739e1db89d045c99ac	2019-06-20 07:06:28 -07:00
Shen Li	fa4ca4e70e	Emphasize all DDP forward() outputs must participate in computing loss (#20586 ) Summary: CC borguz chenyangyu1988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/20586 Reviewed By: ezyang Differential Revision: D15373674 Pulled By: mrshenli fbshipit-source-id: b986918b3592616a9bcc88fba1b8fd53016f68d7	2019-05-17 07:35:49 -07:00
Pieter Noordhuis	558c6c4d8a	Make DistributedDataParallel usable with CPU models (#20236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236 Use the new version of broadcast_coalesced that deals with both CPU and CUDA models. Add tests that evaluate correctness of DistributedDataParallel for CPU models. Closes #17757. Reviewed By: mrshenli Differential Revision: D15245428 fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a	2019-05-09 14:11:17 -07:00
Pieter Noordhuis	5525c419fc	Only call into reducer if torch.is_grad_enabled() (#19897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19897 During validation, gradient reduction is not needed, and autograd is never called. The model output will always be a detached tensor. After the new reducer was merged, this meant that it would find all model parameters unused, and kick off reduction for them. When #19799 and output where no parameters are used and it tries to kick off reduction of zeroed gradients. Test for `torch.is_grad_enabled()` and `self.training` before calling into the reducer. Reviewed By: mrshenli Differential Revision: D15118726 fbshipit-source-id: b0208f632a61cbe8110fa626fa427937b7f05924	2019-04-28 23:12:16 -07:00
Shen Li	b695e562e5	Make find_unused_parameters in DDP default to False (#19895 ) Summary: As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer. CC pietern soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895 Reviewed By: pietern Differential Revision: D15118563 Pulled By: mrshenli fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58	2019-04-28 21:22:26 -07:00
Pieter Noordhuis	6325b6e44e	Make finding unused model parameters optional (#19515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515 This is still done by default, but can now be disabled by specifying `find_unused_parameters=False`. There are use cases where finding unused parameters results in erroneous behavior, because a subset of model parameters is used outside the `forward` function. One can argue that doing this is not a good idea, but we should not break existing use cases without an escape hatch. This configuration parameter is that escape hatch. Reviewed By: bddppq Differential Revision: D15016381 fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0	2019-04-19 17:23:36 -07:00
Pieter Noordhuis	a5c4348d54	Recursively find tensors in DDP module output (#19360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360 We'll return the output object verbatim since it is a freeform object. We need to find any tensors in this object, though, because we need to figure out which parameters were used during this forward pass, to ensure we short circuit reduction for any unused parameters. Before this commit only lists were handled and the functionality went untested. This commit adds support for dicts and recursive structures, and also adds a test case. Closes #19354. Reviewed By: mrshenli Differential Revision: D14978016 fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8	2019-04-18 14:57:09 -07:00
Shen Li	6732358bf9	Allow DDP to wrap multi-GPU modules (#19271 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19271 allow DDP to take multi-gpu models Reviewed By: pietern Differential Revision: D14822375 fbshipit-source-id: 1eebfaa33371766d3129f0ac6f63a573332b2f1c	2019-04-17 21:21:54 -07:00
Pieter Noordhuis	a0263ec047	Make DistributedDataParallel use new reducer (#18953 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953 This removes Python side bucketing code from DistributedDataParallel and replaces it with calls to the new C++ based bucketing and reducing code. To confirm this is working well, we ran a test with both the previous implementation and the new implementation, and confirmed they are numerically equivalent. Performance is improved by a couple percent or more, including the single machine multiple GPU runs. Closes #13273. Reviewed By: mrshenli Differential Revision: D14580911 fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2	2019-04-15 12:44:38 -07:00
Shen Li	168c0797c4	Remind users to set map_location properly when using DDP Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19084 Differential Revision: D14861702 Pulled By: mrshenli fbshipit-source-id: 10ca4a9b41e707050a6bce228ccca4177c9fa4a6	2019-04-09 16:29:38 -07:00
Shen Li	5eb6a2be41	Avoid calling tensor.data.set_() in DDP Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18961 Differential Revision: D14811208 Pulled By: mrshenli fbshipit-source-id: c1c46dfa13e0a6ec83aefd35696ee31a7ea3d810	2019-04-09 14:18:24 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Elliot Waite	1e42720a77	Fix some typos in distributed.py. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17959 Differential Revision: D14437347 Pulled By: soumith fbshipit-source-id: 4c33571f56e9da687666516a310f91924cddd4d9	2019-03-13 09:28:03 -07:00
jiej	39669316a6	(#14267 ) Summary: - Summary: Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group. Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API) - User-facing api: 1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None) 2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, *process_group=None) - supported use case: DistributedDataParallel with single-gpu multi-process* a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below: 1. use layers directly: torch.nn.SyncBatchNorm(...) similar API as with torch.nn.BatchNormXd(...) with added argument `process_group` which is used to limit the scope of synchronization within each process group. Default value is None, which implies synchronization across all GPUs 2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group) recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm` preserving values of parameters/buffers. the utility function also allows user to specify process_group value to all converted layers. b. user wraps their model with `torch.distributed.parallel.DataParallelDistributed`, from this point, user should follow the general guidelines for DDP use guide - Error checking For use cases not supported, we error out: 1. Application launched without ddp: > import torch > sbn = torch.nn.SyncBatchNorm(10).cuda() > inp = torch.randn(5, 10, 3, 3).cuda() > sbn(inp) --> Error! > AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel 2. Application launched using DDP with multi-GPU per-process: > ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank) > ValueError: SyncBatchNorm is only supported for DDP with single GPU per process Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267 Differential Revision: D14270035 Pulled By: ezyang fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d	2019-03-06 13:39:11 -08:00
ZhuBaohe	19a6de328f	Correct docstring of vision/init functions Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17351 Differential Revision: D14276355 Pulled By: soumith fbshipit-source-id: 9b572b6a04eeb1e44cd93961edac76ed10f7b24e	2019-03-01 11:40:23 -08:00
Derek Kim	4171ef3728	Enhance the documentation for DistributedDataParallel from torch.nn.parallel.distributed (#16010 ) Summary: - a typo fixed - made the docs consistent with #5108 And maybe one more change is needed. According to the current docs > The batch size should be larger than the number of GPUs used locally. But shouldn't the batch size be larger than the number of GPUs used either locally or remotely? Sadly, I couldn't experiment this with my single GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16010 Differential Revision: D13709516 Pulled By: ezyang fbshipit-source-id: e44459a602a8a834fd365fe46e4063e9e045d5ce	2019-01-17 01:02:44 -08:00
Teng Li	f56217af3b	Doc improvement on DDP (#15440 ) Summary: I noticed that some users don't even know we have this support. Adding into the doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/15440 Differential Revision: D13531045 Pulled By: teng-li fbshipit-source-id: 9757c400c0010608758c754df04e603b36035a10	2018-12-20 14:51:57 -08:00
Teng Li	2d3cf98b49	Making dist.get_default_group private for PT1 release (#14767 ) Summary: When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private. All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design. We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767 Reviewed By: pietern Differential Revision: D13330655 Pulled By: teng-li fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c	2018-12-04 19:22:24 -08:00
Teng Li	cac03280f9	Fixed DistributedDataParallel state pickling for multi-gpus (#14690 ) Summary: Fixed: https://github.com/pytorch/pytorch/issues/14678 This PR fixed DDP doesn't work after save() and load() for multiple GPUs, because, it needs all these replicating logics and bucketing in the constructor. So I refactored some of the logics in the constructor to a helper function. And this will be used for load(). Added test too. Tested on 8 GPU machines. ``` tengli@learnfair062:~/pytorch/test$ python run_test.py -i distributed --verbose Test executor: ['/private/home/tengli/miniconda3/bin/python'] Selected tests: distributed Running test_distributed ... [2018-12-02 18:33:55.833580] /public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec Running distributed tests for the mpi backend test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.315s OK (skipped=15) Running distributed tests for the mpi backend with file init_method test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... test_Backend_enum_class (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_DistributedDataParallel (__main__.TestMPI) ... skipped 'Only Nccl & Gloo backend support DistributedDataParallel' test_DistributedDataParallelCPU (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA all gather' test_all_gather_full_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_group (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_gather_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports allgather multigpu' test_all_reduce_full_group_max (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_min (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_product (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_full_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_max (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_min (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_product (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_group_sum (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_max (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_min (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_all_reduce_product (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_all_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_full_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_full_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_barrier_group_cuda (__main__.TestMPI) ... skipped "MPI doesn't supports GPU barrier" test_barrier_timeout_full_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestMPI) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_cuda (__main__.TestMPI) ... skipped 'Only Gloo and Nccl backend supports CUDA allReduce' test_broadcast_full_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_broadcast_multigpu (__main__.TestMPI) ... skipped "MPI doesn't support broadcast multigpu" test_destroy_full_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_destroy_group (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_full_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_gather_group (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_backend (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_default_group (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_full_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_get_rank_size_group (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_irecv (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_isend (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_max (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_min (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_product (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_full_group_sum (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_max (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_min (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_product (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_group_sum (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_max (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_min (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_multigpu (__main__.TestMPI) ... skipped 'Only Nccl backend supports reduce multigpu' test_reduce_product (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_reduce_sum_cuda (__main__.TestMPI) ... skipped 'Only Nccl supports CUDA reduce' test_scatter (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_full_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_scatter_group (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_any_source (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok test_send_recv_with_tag (__main__.TestMPI) ... ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) ok ---------------------------------------------------------------------- Ran 68 tests in 6.415s OK (skipped=15) Running distributed tests for the nccl backend test_Backend_enum_class (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU' test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather' test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL' test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_multigpu (__main__.TestDistBackend) ... ok test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL' test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_cuda (__main__.TestDistBackend) ... ok test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_group_cuda (__main__.TestDistBackend) ... ok test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_cuda (__main__.TestDistBackend) ... ok test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped' test_destroy_full_group (__main__.TestDistBackend) ... ok test_destroy_group (__main__.TestDistBackend) ... ok test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_get_backend (__main__.TestDistBackend) ... ok test_get_default_group (__main__.TestDistBackend) ... ok test_get_rank (__main__.TestDistBackend) ... ok test_get_rank_size_full_group (__main__.TestDistBackend) ... ok test_get_rank_size_group (__main__.TestDistBackend) ... ok test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv' test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend' test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_multigpu (__main__.TestDistBackend) ... ok test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum_cuda (__main__.TestDistBackend) ... ok test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source' test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' ---------------------------------------------------------------------- Ran 68 tests in 69.549s OK (skipped=52) Running distributed tests for the nccl backend with file init_method test_Backend_enum_class (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... skipped 'nccl does not support DistributedDataParallelCPU' test_all_gather (__main__.TestDistBackend) ... skipped 'Only MPI supports CPU all gather' test_all_gather_cuda (__main__.TestDistBackend) ... skipped 'CUDA all gather skipped for NCCL' test_all_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_gather_multigpu (__main__.TestDistBackend) ... ok test_all_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_multigpu (__main__.TestDistBackend) ... skipped 'CUDA all_reduce multigpu skipped for NCCL' test_all_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_all_reduce_sum_cuda (__main__.TestDistBackend) ... skipped 'Only Gloo backend will have CUDA allReduce tested' test_barrier (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_cuda (__main__.TestDistBackend) ... ok test_barrier_full_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_full_group_cuda (__main__.TestDistBackend) ... ok test_barrier_group (__main__.TestDistBackend) ... skipped 'NCCL does not support CPU barrier' test_barrier_group_cuda (__main__.TestDistBackend) ... ok test_barrier_timeout_full_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_global (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_barrier_timeout_group (__main__.TestDistBackend) ... skipped 'Only gloo backend supports timeouts' test_broadcast (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_cuda (__main__.TestDistBackend) ... ok test_broadcast_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_broadcast_multigpu (__main__.TestDistBackend) ... skipped 'NCCL broadcast multigpu skipped' test_destroy_full_group (__main__.TestDistBackend) ... ok test_destroy_group (__main__.TestDistBackend) ... ok test_gather (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_gather_group (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_get_backend (__main__.TestDistBackend) ... ok test_get_default_group (__main__.TestDistBackend) ... ok test_get_rank (__main__.TestDistBackend) ... ok test_get_rank_size_full_group (__main__.TestDistBackend) ... ok test_get_rank_size_group (__main__.TestDistBackend) ... ok test_irecv (__main__.TestDistBackend) ... skipped 'Nccl does not support irecv' test_isend (__main__.TestDistBackend) ... skipped 'Nccl does not support isend' test_reduce_full_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_full_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_group_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_max (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_min (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_multigpu (__main__.TestDistBackend) ... ok test_reduce_product (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum (__main__.TestDistBackend) ... skipped 'Nccl does not support CPU tensors' test_reduce_sum_cuda (__main__.TestDistBackend) ... ok test_scatter (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_full_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_scatter_group (__main__.TestDistBackend) ... skipped 'Nccl does not support scatter' test_send_recv (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' test_send_recv_any_source (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv from any source' test_send_recv_with_tag (__main__.TestDistBackend) ... skipped 'Nccl does not support send/recv' ---------------------------------------------------------------------- Ran 68 tests in 70.381s OK (skipped=52) `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14690 Differential Revision: D13294169 Pulled By: teng-li fbshipit-source-id: 69ccac34c6c016899bfe8fbc50b48d4bfd1d3876	2018-12-03 12:04:26 -08:00
Teng Li	5268dd468c	Fixed DistributedDataParallel cannot kick off all-reduce in a corner case (#14675 ) Summary: Ok, this corner happens for translation guys, and it only happens in the following corner case: (1) when the module is registered a parameter that does not requires grad and (2) this registered parameter has a unique type (say, double, or half) and it's the only unique type such that itself alone will be put into a separate bucket. and (3) it is the last parameter that got registered in the module, such that its bucket reduction is the first to be kicked off. Once this corner case happens, since it does not require grad, the backward hook won't be kicked off. Now that all other buckets are waiting for its bucket to be kicked off, in this case, no bucket will be kicked off since it's blocked by the first bucket (the unique type parameter). This PR fixes two things: (1) Make sure that we will only bucket parameters that requires_grad (2) Make all-reduction checks in the next iteration. As long as we detect the previous iteration's all-reduction has not been fully kicked off, we will issue an error in the next iteration. (3) Also removed some unused variables With this bug fixed, the only case when this error can happen is when the user changed parameters later after wrapping up the module with DDP, like the case in: https://github.com/pytorch/pytorch/issues/12603 Test covered as well Without the first fix, I varied that the repro in fbcode hit this error message: ``` result = self.forward(input, *kwargs) File "/data/users/tengli/fbsource/fbcode/buck-out/dev/gen/language_technology/neural_mt/os/pytorch_translate/train#link-tree/torch/nn/parallel/distributed.py", line 312, in forward raise RuntimeError("Not all gradients are all-reduced from " RuntimeError: Not all gradients are all-reduced from the backward of the previous iteration. This is unexpected and fatal error. Please check and ensure that the model's parameters are not changed after you wrap up the model with DistributedDataParallel. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14675 Differential Revision: D13291083 Pulled By: teng-li fbshipit-source-id: 2539b699fae843f104b4b8d22721ae82502ba684	2018-12-02 17:13:07 -08:00
Teng Li	85d3fccee7	Removed redundant allreduce options in DDP (#14208 ) Summary: This somehow is not cleaned up after the C++ migration. Unused and can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14208 Differential Revision: D13132492 Pulled By: teng-li fbshipit-source-id: 0f05b6368174664ebb2560c037347c8eb45f7c38	2018-11-21 16:56:46 -08:00
Teng Li	4983397c02	Better documentation and warning (#13946 ) Summary: This is to address https://github.com/pytorch/pytorch/issues/12603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13946 Differential Revision: D13055254 Pulled By: teng-li fbshipit-source-id: 20a206ebd3456eac9dc50584664c4bca3ee955d1	2018-11-14 10:41:46 -08:00
Teng Li	dceec1de30	Distributed Data Parallel documentation for PT1 release (#13657 ) Summary: This should fix https://github.com/pytorch/pytorch/issues/12604 Make html and look through the html pages to make sure that everything looks good Pull Request resolved: https://github.com/pytorch/pytorch/pull/13657 Reviewed By: calebho Differential Revision: D12954250 Pulled By: teng-li fbshipit-source-id: 40e1925ec0cdce5e6a1d8ba29537937da8ef9194	2018-11-07 12:11:57 -08:00
Teng Li	1413dd4bfc	Added the finer bucketing option for DDP (#13607 ) Summary: We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway. Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607 Differential Revision: D12944515 Pulled By: teng-li fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508	2018-11-07 12:00:55 -08:00
Teng Li	74819087de	Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496 ) Summary: When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen? It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix. Now fp16 DDP training should be fully working now. Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that. Fixed: https://github.com/pytorch/pytorch/issues/12150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496 Differential Revision: D12920985 Pulled By: teng-li fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c	2018-11-05 16:22:15 -08:00
Teng Li	e475d3ede3	DDP multi-GPU segfault fix (#13291 ) Summary: Fix https://github.com/pytorch/pytorch/issues/13200 Tested on 8 GPU machines since CI doesn't have this many GPUs, so multi-GPU test won't be triggered ``` tengli@learnfair096:~/pytorch/test$ python run_test.py -i distributed --verbose Selected tests: distributed Running test_distributed ... [2018-10-29 20:32:46.355858] /public/apps/openmpi/2.1.1/gcc.5.4.0/bin/mpiexec Running distributed tests for the gloo backend test_DistBackend (__main__.TestDistBackend) ... ok test_DistributedDataParallel (__main__.TestDistBackend) ... ok test_DistributedDataParallelCPU (__main__.TestDistBackend) ... ok ``` Also I would like to bump up the bucket size of broadcast to higher for performance reasons Pull Request resolved: https://github.com/pytorch/pytorch/pull/13291 Differential Revision: D12842840 Pulled By: teng-li fbshipit-source-id: e8c50f15ebf2ab3e2cd1b51d365e41a6106b98fe	2018-10-31 00:43:42 -07:00
sli	9d9e5f8d1e	Solve bug of DistributedDataParallel (#13248 ) Summary: Fixed bug [https://github.com/facebookresearch/maskrcnn-benchmark/issues/52](https://github.com/facebookresearch/maskrcnn-benchmark/issues/52) Pull Request resolved: https://github.com/pytorch/pytorch/pull/13248 Reviewed By: pietern Differential Revision: D12830451 Pulled By: teng-li fbshipit-source-id: ab33faf3f6f4545f8fe07da7ecbeb2f0a2ea23f0	2018-10-29 15:19:55 -07:00
Teng Li	c250f6f3d5	DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954 ) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65	2018-10-24 21:37:13 -07:00
Teng Li	8d3e7e2fcb	Move DDP queue_reduction to C++ (#12852 ) Summary: fully working version by using continuing on goldsborough 's initial version. waiting on the stream guard to be merged before adding more stream perf logics into the c++ version Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852 Differential Revision: D10468696 Pulled By: teng-li fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f	2018-10-22 16:07:46 -07:00
Teng Li	d120b9af5a	Make c10d pickling/unpickling work (#12694 ) Summary: This fixes the issue for https://github.com/pytorch/pytorch/issues/12168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12694 Differential Revision: D10468717 Pulled By: teng-li fbshipit-source-id: 3df31d75eea19d6085af665f5350d3cb667a5048	2018-10-19 16:42:36 -07:00

... 2 3 4 5 6 ...

385 Commits