pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Rohan Varma	f5a71ec2d6	[Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604 Implement 2 helper functions: - as_functional_optim which takes in a torch.optim class type and arguments and creates the corresponding functional optimizer. - create_functional_optim which takes in the functional optimizer class type and constructs it. Note that as_functional_optim calls into create_functional_optim. The first will be used in future PRs as described in https://github.com/pytorch/pytorch/issues/67570 to create a functional optimizer from a traditional optimizer. The latter is used in _OptimizerHookState to create a functional optimizer. Both new helper functions are covered by unittests. ghstack-source-id: 147577170 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33688995 fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1 (cherry picked from commit `42fdae2991`)	2022-01-25 18:32:13 +00:00
Rohan Varma	281663955f	[Opt Overlap] Create Optimizer Hook State directly from functional optim (#71602 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71602 The design in https://github.com/pytorch/pytorch/issues/67570 requires `_OptimizerHookState` to be created directly from a functional optimizer. Add support and tests for this. Also refactor a few tests. ghstack-source-id: 147577175 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33687477 fbshipit-source-id: f3c789aa77773f918e01a8d0cf08739b2edf07b3 (cherry picked from commit `4851e1c6d4`)	2022-01-25 18:32:13 +00:00
Rohan Varma	9b3a56eecf	[Optimizer Overlap] Move hooks to own file (#71601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71601 Moves current prototype optimizer overlap to its own file for a better namespace. No code changes besides a few comment fixes. Note that this code is still prototype and not expected to be used by an end user. ghstack-source-id: 147458826 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33662678 fbshipit-source-id: 3cc931323230a4b66c02b9e6f744aaf5c48d4d34 (cherry picked from commit `5070595c7f`)	2022-01-23 00:04:32 +00:00
Rohan Varma	d8abe813bc	[LocalSGD] Move feature to Beta, clean up some docs (#71621 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621 Moves this feature to beta as discussed, and cleans up some docs. Synced offline with wayi1 who mentioned that the current names are preferred as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325. ghstack-source-id: 147382940 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D33700444 fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067 (cherry picked from commit `656e9809b2`)	2022-01-21 21:10:42 +00:00
Omar Younis	569aeec1bc	fix typo in debugging_hooks.py (#70956 ) Summary: I just fixed a small typo in the debugging_hooks documentation cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/70956 Reviewed By: jbschlosser Differential Revision: D33508898 Pulled By: dagitses fbshipit-source-id: fc5935e5a2e2ddc45657a22d3b33a11aba378d9b	2022-01-10 12:59:42 -08:00
Yi Wang	ed50a35cf8	[Model Averaging] Update the documentation of PeriodicModelAverager (#70974 ) Summary: Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974 Reviewed By: dagitses Differential Revision: D33474576 Pulled By: rohan-varma fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6	2022-01-07 13:20:42 -08:00
Rohan Varma	a197f3fe52	[FSDP/Checkpoint] Activation offload support in checkpoint_wrapper (#70165 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165 Implements activation offload support in checkpoint_wrapper API via save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation and instead compose offload + checkpoint using the save_on_cpu hook for the former. ghstack-source-id: 146078900 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D33228820 fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a	2021-12-21 10:08:18 -08:00
Rohan Varma	79a40b22aa	[Checkpoint] Make checkpoint_wrapper an nn.Module (#70164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164 Implement Alban's suggestion to make checkpoint_wrapper an nn.Module instead of patching the forward pass, which is too hacky. ghstack-source-id: 146011215 Test Plan: IC Reviewed By: mrshenli Differential Revision: D33214696 fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2	2021-12-20 13:22:28 -08:00
Rohan Varma	c4281cc92d	Prototype checkpoint_wrapper (#69955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955 Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module. Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case. Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint. ghstack-source-id: 145811242 Test Plan: CI Reviewed By: mrshenli Differential Revision: D33107276 fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb	2021-12-16 09:59:19 -08:00
Wanchao Liang	7c6a8a47db	[BE] minor improvement to dist quantization (#67401 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401 some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup ghstack-source-id: 143910067 ghstack-source-id: 143910067 Test Plan: wait for ci Reviewed By: mrshenli Differential Revision: D31979269 fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289	2021-11-21 23:31:59 -08:00
Michael Suo	f50bf16c04	Revert D31663043: [BE] minor improvement to dist quantization Test Plan: revert-hammer Differential Revision: D31663043 Original commit changeset: 2f96b7346e9c fbshipit-source-id: d38684dfe79ca335fbbe624496ad4c86c29d1570	2021-10-22 16:37:41 -07:00
Wanchao Liang	7379d4db20	[BE] minor improvement to dist quantization (#66649 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649 some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup ghstack-source-id: 141336191 Test Plan: wait for ci Reviewed By: cbalioglu Differential Revision: D31663043 fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79	2021-10-22 13:46:28 -07:00
Yi Wang	c1415a0a72	[Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197 1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type. 2. The parameters are read from local optimizer's param_groups instead of a separate input. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 138307226 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D31007439 fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722	2021-09-17 10:31:58 -07:00
Yi Wang	00e6e0c593	[Model Averaging] Revert #63895 (#64903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903 Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895. Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D30894688 fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485	2021-09-14 09:45:42 -07:00
Yi Wang	bf9d66586c	[DDP Comm Hook] Create a noop hook for performance debugging (#64344 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64344 As title. Additionally, avoid using numpy array in test_ddp_hooks.py. ghstack-source-id: 137170449 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks -- test_ddp_comm_hook_noop_hook Reviewed By: rohan-varma Differential Revision: D30693220 fbshipit-source-id: e17f0d1c6198863cf20a53566f586a6bff602522	2021-09-01 17:36:22 -07:00
Marjan Fariborz	6a76ee04de	Adding alltoall_single collective to collective quantization API (#63154 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154 The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included. ghstack-source-id: 136856877 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16 Reviewed By: wanchaol Differential Revision: D30255251 fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29	2021-08-27 12:46:31 -07:00
Marjan Fariborz	3b284ab024	Adding BFP16 quantization/dequantization support to OSS (#63059 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059 Supporting BFP16 quantization method to OSS. Currently only support CPU ghstack-source-id: 136639528 Test Plan: Imported from OSS Reviewed By: wanchaol Differential Revision: D30194538 fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53	2021-08-25 23:41:34 -07:00
Yi Wang	7edeead796	Add a comment on the potential implicit type up-casting (#63905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63905 as title ghstack-source-id: 136590703 Test Plan: N/A Reviewed By: mrshenli Differential Revision: D30527929 fbshipit-source-id: 69402bbfa87cfd8fc166ce313cde9736ee072589	2021-08-25 12:47:45 -07:00
Aayush Prakash	8a22d4fa5c	[Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895 When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future. The replacement is `tensor.set_`. ghstack-source-id: 136593433 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: SciPioneer Differential Revision: D30526178 fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00	2021-08-25 11:12:55 -07:00
Edward Yang	699c764d2e	Revert D30513613: Removing tensor.data usage in utils with tensor set_ method Test Plan: revert-hammer Differential Revision: D30513613 (`d08a36f831`) Original commit changeset: 402efb9c30fa fbshipit-source-id: 911c66a9852de77dc5274b5fb373258c0c97739a	2021-08-24 12:20:37 -07:00
Aayush Prakash	d08a36f831	Removing tensor.data usage in utils with tensor set_ method (#63867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867 When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future. The replacement is `tensor.set_`. ghstack-source-id: 136531233 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager Reviewed By: SciPioneer Differential Revision: D30513613 fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585	2021-08-24 11:20:44 -07:00
Marjan Fariborz	c545b099aa	Separating quantization test from distributed_test (#63058 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63058 Dedicating separate tests for different quantization methods. Currently supporting FP16 method. ghstack-source-id: 136499767 Test Plan: uck test mode/dev //caffe2/test/distributed/algorithms/quantization:quantization_gloo_fork -- name_of_the_test Reviewed By: wanchaol Differential Revision: D30142580 fbshipit-source-id: 3aacec1a231a662067d2b48c001f0c69fefcdd60	2021-08-24 01:44:55 -07:00
Yinbin Ma	0d437fe6d0	BF16 allreduce hook (#63260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260 Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7. Reviewed By: SciPioneer Differential Revision: D30238317 fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb	2021-08-18 20:53:49 -07:00
Yi Wang	979180cd01	[Model Averaging] Allow subgroup to be None in PostLocalSGDState (#63277 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277 `PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 135848575 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD Reviewed By: cbalioglu Differential Revision: D30325041 fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd	2021-08-16 10:07:41 -07:00
Andrew Gu	2d75703c6a	Remove req to call step() in training loop (#63164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284616 Pulled By: andwgu fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1	2021-08-13 08:22:44 -07:00
Andrew Gu	bd81c9178a	Simplify data structures, add uniform approximation, fix mem leak (#63162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284617 Pulled By: andwgu fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f	2021-08-13 08:20:59 -07:00
Andrew Gu	1b1f1e36b4	Add ``allow_empty_param_list`` to functional optimizers (#62522 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522 Addresses https://github.com/pytorch/pytorch/issues/62481 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D30072074 Pulled By: andwgu fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3	2021-08-09 11:18:56 -07:00
Marjan Fariborz	c7db642a72	Adding collective quantization API (#62142 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142 Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization Test Plan: Tested through distributed_gloo_fork. e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized Reviewed By: wanchaol Differential Revision: D29682812 fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e	2021-08-09 08:11:22 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Andrew Gu	62a90c227f	Make _Join, _Joinable, _JoinHook public (#62605 ) Summary: Overview: This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605 Test Plan: `DistributedDataParallel.join()`: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` `ZeroRedundancyOptimizer`: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing. `Join`: ``` gpurun4 python test/distributed/algorithms/test_join.py ``` Reviewed By: mrshenli Differential Revision: D30055544 Pulled By: andwgu fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026	2021-08-03 12:20:11 -07:00
Andrew Gu	43327cc197	Refactor commonalities between two approaches (#62624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62624 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30058543 Pulled By: andwgu fbshipit-source-id: 73c794062b75e011868fae264f592549eed67482	2021-08-03 08:43:14 -07:00
Andrew Gu	e6a3967c2a	Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62623 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30058544 Pulled By: andwgu fbshipit-source-id: a56910f294c6a40118751eebe255b62700f42be9	2021-08-03 08:13:52 -07:00
Yi Wang	db071ef005	[Reland][DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62592 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592 Reland #62510 `GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity: 1) get_index -> index 2) is_the_last_bucket_to_allreduce -> is_last, 3) get_per_parameter_tensors -> gradients, 4) get_model_params_for_bucket -> parameters. ghstack-source-id: 134848352 Test Plan: unit test Reviewed By: andwgu Differential Revision: D30049431 fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9	2021-08-02 16:38:09 -07:00
Yi Wang	2ec4f69b48	[DDP Comm Hook] Do not expose hook_then_optimizer as a public method (#62532 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532 This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature. ghstack-source-id: 134787831 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl Reviewed By: rohan-varma Differential Revision: D30031222 fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676	2021-08-02 12:25:01 -07:00
Eli Uriegas	6f95850127	Revert D30024161: [DDP Communication Hook] Rename 4 Methods of GradBucket Class Test Plan: revert-hammer Differential Revision: D30024161 (`29c8b1db57`) Original commit changeset: 07e6072a2f7b fbshipit-source-id: d571c2caadaf7b71fe2aba3c0597bd8074d153de	2021-08-02 10:26:54 -07:00
Qing Hu	29c8b1db57	[DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62510 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510 `GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity: 1) get_index -> index 2) is_the_last_bucket_to_allreduce -> is_last, 3) get_per_parameter_tensors -> gradients, 4) get_model_params_for_bucket -> parameters. Test Plan: Local run comprehensive test with following results: https://pxl.cl/1Ml8b For two timeout failure test cases, most likely environment related and fail in my devserver. Reviewed By: SciPioneer Differential Revision: D30024161 fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f	2021-08-02 09:33:32 -07:00
Andrew Gu	51f687fd4b	Add overlap with DDP to ZeRO (two approaches) (#62157 ) Summary: Overview: This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration. Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157 Test Plan: The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass: - ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`) - `test_ddp_with_zero_step_parity_gpu` - `test_ddp_with_zero_step_interleaved_parity_gpu` These were tested on the AI AWS cluster. An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302. Both approaches have been verified using an internal accuracy benchmark. Reviewed By: mrshenli Differential Revision: D29971046 Pulled By: andwgu fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8	2021-08-02 08:33:34 -07:00
Yi Wang	32b37ba246	[DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457 Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor. Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`. ghstack-source-id: 134771419 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type Reviewed By: rohan-varma Differential Revision: D30007390 fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f	2021-07-30 20:51:34 -07:00
Yi Wang	acba9b3104	[DDP Communication Hook] Simplify the implementation of parseHookResult of PythonCommHook (#62389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389 Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list. Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280 Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`. ghstack-source-id: 134626246 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork Reviewed By: rohan-varma Differential Revision: D29982485 fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a	2021-07-29 17:27:35 -07:00
Yi Wang	9fee176be3	[Model Averaging] Fix docstring of PeriodicModelAverager (#62392 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392 The constructor of `PeriodicModelAverager` does not need to accept parameters. ghstack-source-id: 134626245 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29986446 fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76	2021-07-29 17:26:27 -07:00
Yi Wang	2eaf71d749	[Model Averaging] Update model averager API to avoid the redundant `params` arg needed by post-localSGD optimizer (#62132 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62132 as title Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134560541 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29887751 fbshipit-source-id: 60dadb04790d800fdcc7cb8a08d060e411718739	2021-07-28 18:43:09 -07:00
Yi Wang	2581dfc249	[Model Averaging] Create a base class for model averaging (#62111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111 This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134489187 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29884954 fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938	2021-07-28 10:15:36 -07:00
Rohan Varma	64283fe146	[DDP/Functional Optim] Support kwarg arguments (#62079 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079 Adds support for kwarg arguments into functional optimizer running as hook. ghstack-source-id: 134330379 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29838127 fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2	2021-07-26 22:12:50 -07:00
Rohan Varma	6dc2c07304	[Reland] [DDP] Implement a hook which performs FunctionalSGD step. (#62177 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62177 Reland of https://github.com/pytorch/pytorch/pull/61678 Fix CI failure by gating including torchvision model on whether torchvision is available or not. ghstack-source-id: 134282165 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29904101 fbshipit-source-id: 47e799eb4a90acbbda91c5857ea00de3045d49f5	2021-07-26 11:56:56 -07:00
Rohan Varma	2299d6a013	Revert D29701447: [DDP] Implement a hook which performs FunctionalSGD step. Test Plan: revert-hammer Differential Revision: D29701447 (`bd95cf4473`) Original commit changeset: 183954593b82 fbshipit-source-id: 714e6a2b698147db9533a67783aed2a65d9d5bfe	2021-07-25 22:23:30 -07:00
Rohan Varma	bd95cf4473	[DDP] Implement a hook which performs FunctionalSGD step. (#61678 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61678 This diff makes the following changes: - Add `step_param` method to `_FunctionalSGD` class which is written similar to `step` but for a single param - Implement a communication hook wrapper that runs a given comm. hook and then applies functional SGD step - Verifies that this is equal to regular allreduce + SGD optimizerghstack-source-id: 133567598 ghstack-source-id: 134263399 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29701447 fbshipit-source-id: 183954593b82a092414623292f9b10e675fef96e	2021-07-25 13:36:47 -07:00
Yi Wang	e856a45283	[Model Averaging] Refactor averagers to accept parameters instead of a module (#62105 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62105 This is for the preparation of wrapping the averager as an optimizer, which can only accept parameters rather than a module. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134213572 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters Reviewed By: rohan-varma Differential Revision: D29883693 fbshipit-source-id: 474ba924a0b05068b12f163fb74582bccf314964	2021-07-23 18:39:45 -07:00
Yi Wang	b03b45afd9	[DDP Comm Hook] Use a single tensor instead of a tensor list as the comm hook result (#62074 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62074 Since SPMD mode is retired, the comm hook result will always be a single tensor. This can improve comm hook developer experience, as no need to add an extra `[0]` to the precursor future result. #Closes: https://github.com/pytorch/pytorch/issues/61914 ghstack-source-id: 134164593 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork Reviewed By: rohan-varma Differential Revision: D29864732 fbshipit-source-id: 59fe6dd78b66214b1788514ad4d236039d9bda31	2021-07-23 03:32:05 -07:00
Yi Wang	53222c59f0	Reformat (#62073 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62073 as title ghstack-source-id: 134159445 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D29869185 fbshipit-source-id: 17a32d56860e9469bd26c4eb4ca2d483827d946e	2021-07-22 23:36:22 -07:00
Andrew Gu	3e3acf8a9a	Minor documentation fixes (#61785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29746648 Pulled By: andwgu fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6	2021-07-19 09:01:29 -07:00
Andrew Gu	57feb35474	Refactor non-joined process computation (#61555 ) Summary: Overview: This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania. Changes: This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.) The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance. This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition. Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user. New API: Now, the example usage would look like: ``` ddp_model = DistributedDataParallel(...) zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...) with _Join([ddp_model, zero_optim]): ... ``` Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example: ``` with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False): ... ``` They will be forwarded to every `_join_hook()` function via `kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`). Recap:** After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names): - Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context. - We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`. - To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed. - We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`. - We call `Joinable.__init__(self)` in `C`'s constructor. - The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability. - Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555 Test Plan: I ran the existing DDP join tests: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` I ran the ZeRO join tests: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu ``` Reviewed By: zou3519 Differential Revision: D29690359 Pulled By: andwgu fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7	2021-07-14 08:20:40 -07:00
Yi Wang	df00c636d2	[Model Averaging] Skip model averaging for the first K steps (#61207 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61207 Model averager now must be combined with post-localSGD DDP communication hook. It will skip model averaging for the first K steps, because post-localSGD communication hook will run global gradient averaging during this phase. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 133371335 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: pritamdamania87 Differential Revision: D29523738 fbshipit-source-id: 3fa9611046e1c0afa4bda78aa3ba200fa2a5fa4b	2021-07-10 17:12:16 -07:00
Yi Wang	0f6876d721	[Model Averaging] Create a post-localSGD communication hook (#61206 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61206 Create a communication hook to run post-local SGD. This will be combined with model averager component to better support local SGD. In contrast to the previous approach that runs local gradient averaging + global model averaging at each step for the first K steps, now we plan to run global gradient averaging only for the first K steps at each step, just like normal DDP. This can give us two advantages: 1) For some optimizers, model averaging can cause discrepancy in optimizer states. If we still do global gradient averaging for the first K steps, we can defer such discrepancy until we actually start local SGD. 2) Gradient averaging at the first K steps only run one allreduce that overlaps with backward pass, so it should also be more efficient. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 133371322 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD Reviewed By: pritamdamania87 Differential Revision: D29523292 fbshipit-source-id: 3f215f7150f2917c2781278fad759530c685ea2c	2021-07-10 17:11:10 -07:00
Andrew Gu	179249084b	Refactor DDP join() API, adding hooks (#60757 ) Summary: Targets https://github.com/pytorch/pytorch/issues/54318. Overview: DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.) There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager. The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications. The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same. Notes: - The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing. - `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks. - `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`). - Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden. - If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications. - If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757 Test Plan: The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- ``` Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually. Reviewed By: iramazanli, mrshenli Differential Revision: D29624636 Pulled By: andwgu fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7	2021-07-09 08:29:20 -07:00
Yi Wang	5b6818f08a	[Model Averaging] Enforce a synchronization before allreduce parameters (#60891 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60891 This fix is particularly useful for local SGD when the averaging period is very small, which may cause the conflict between gradient allreduce within per-machine subgroup and the global parameter allreduce by the communication world. ghstack-source-id: 132564252 Test Plan: f281873295 (#Try1) failed due to the conflict between global process group and subgroup. ``` <Thread(configerator-monitor-singleton, started 139839806633728)> File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 870, in run self._target(self._args, *self._kwargs) File "/tmp/jetter.gson7tr3/configerator/client.py", line 348, in _monitor_loop self._parent_thread.join(self._interval_ms / 1000) File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1015, in join self._wait_for_tstate_lock(timeout=max(timeout, 0)) File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock elif lock.acquire(block, timeout): ``` Fixed after adding an explicit sync: f282044866, f282241800 Reviewed By: rohan-varma Differential Revision: D29434597 fbshipit-source-id: a4f777fc26f379639f85fda32de425cd3b337b33	2021-06-29 01:39:40 -07:00
Yi Wang	f262217101	[Model Averaging] Move step out of model averaging API (#60632 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60632 Address the comment https://github.com/pytorch/pytorch/pull/60320#discussion_r654845062 ghstack-source-id: 132340278 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29355609 fbshipit-source-id: 50a6f13ed70b5a5b5b92ead2f3d7082c11277af5	2021-06-25 17:20:52 -07:00
Yi Wang	80f40b172f	[Model Averaging] Periodic model averager (#60320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60320 This averager can be used for post-local SGD. ghstack-source-id: 131908011 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29249850 fbshipit-source-id: 09675d6bb1edfb8ffbeb94510d91962532d8ca3e	2021-06-23 20:23:04 -07:00
Yi Wang	aeea5bf4a1	[Model Averaging] Provide a util function for model averaging (#60303 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60303 The util function can be used for averaging parameters. More optimizations can be done in the future. ghstack-source-id: 132214212 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_average_parameters Reviewed By: rohan-varma Differential Revision: D29242806 fbshipit-source-id: 76fb5a92adb4bdc6151a9f411e366a0ed2a31f47	2021-06-23 15:41:15 -07:00
Yi Wang	2b398d0537	[Reland][Gradient Compression] Apply division first to avoid overflow (#59576 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59576 If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce. This fix is applied to both C++ and Python comm hooks. ghstack-source-id: 130754510 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Reviewed By: rohan-varma Differential Revision: D28941327 fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f	2021-06-08 10:03:21 -07:00
Mike Ruberry	f998e63dca	Revert D28922548: [Gradient Compression] Apply division first to avoid overflow Test Plan: revert-hammer Differential Revision: D28922548 (`459270ac01`) Original commit changeset: 442bd3cc7a35 fbshipit-source-id: 7e4361b4eb283cdb21f15a36d6eebf558dd7386f	2021-06-07 03:57:10 -07:00
Yi Wang	459270ac01	[Gradient Compression] Apply division first to avoid overflow (#59522 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522 If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce. This fix is applied to both C++ and Python comm hooks. ghstack-source-id: 130686229 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Reviewed By: rohan-varma Differential Revision: D28922548 fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be	2021-06-07 01:43:10 -07:00
Yi Wang	9bfc1c4e0e	[Gradient Compression] Update the docstring of fp16_compress_hook (#58168 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58168 Update the documentation to be consistent to https://github.com/pytorch/pytorch/pull/57410. ghstack-source-id: 128797174 Test Plan: N/A Reviewed By: agolynski, zhengwy888 Differential Revision: D28388160 fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65	2021-05-12 14:28:41 -07:00
lezcano	24087d07ca	Deprecate QR (#57745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57745 Reviewed By: bdhirsh Differential Revision: D28318164 Pulled By: mruberry fbshipit-source-id: b8e3cb9d7ab33f30c8653ec39f932a8af8bd2a50	2021-05-10 22:56:37 -07:00
Weiyi Zheng	c07babbcf1	[Gradient Compression] Divide by world size before all_reduce to avoid overflow (#57410 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57410 FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem. ghstack-source-id: 127877083 Test Plan: before chage f268909897 after change: f270950609 If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook. Reviewed By: SciPioneer Differential Revision: D28128628 fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0	2021-05-07 12:23:21 -07:00
Sam Estep	e3900d2ba5	Add lint for unqualified `noqa` (#56272 ) Summary: As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future. Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two: ``` test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999 test/jit/test_misc.py:28: print(f"format blank") # noqa F541 ``` However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored: - If you change them to anything else, the warnings will still be suppressed. - If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally: ``` test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment ``` I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2365189927 Reviewed By: janeyx99 Differential Revision: D27830127 Pulled By: samestep fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb	2021-04-19 13:16:18 -07:00
Yi Wang	b4cb020c0f	[Gradient Compression] Make orthogonalization_epsilon configurable in PowerSGDState (#55738 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55738 Per title, and use 0 as the default value. It turns out that setting this epsilon as 0 can accelerate convergence and improve accuracy for some use cases. Test Plan: unit tests f264687105 f264675194 Reviewed By: shuyingsunshine21 Differential Revision: D27694971 fbshipit-source-id: b61528c6c817127974acdc4635bccf607532287f	2021-04-13 02:52:56 -07:00
Yi Wang	2496a09314	[Gradient Compression] Fix PowerSGD docstring by removing an extra whitespace (#55666 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55666 {F590513307} Some code is not properly displayed due to an extra whitespace ahead of `(num_rows + num_cols)`. ghstack-source-id: 126148569 Test Plan: Locally viewed Reviewed By: rohan-varma Differential Revision: D27673663 fbshipit-source-id: 603ae4ddbe86ceaefc311885b82b0f6b48b57b27	2021-04-09 21:11:40 -07:00
Yi Wang	1b4bb3691c	[Gradient Compression] Update _powerSGD_comm_hook_wrapper to only expose 2 most critical hyperparameters (#55295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55295 Update `_powerSGD_comm_hook_wrapper` to only expose 2 most critical hyperparameters, to make this API more clear to any future user (although the second hyperparameter `start_powerSGD_iter` is not in use yet). Test Plan: waitforbuildbot Reviewed By: shuyingsunshine21 Differential Revision: D27561734 fbshipit-source-id: b661981cc033b109f4f2fc92b435567a184a7fb5	2021-04-06 01:29:10 -07:00
Yi Wang	cc4036905c	[Gradient Compression] Update the default value of start_powerSGD_iter and update the docstring (#55272 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55272 1. Set 1K as the default value of `start_powerSGD_iter` for practicability. The original default value 10 is usually too small for real use cases. The new default value 1K is also consistent with PyTorch Lightning. 2. Update the docstring of `start_powerSGD_iter` to remind the users to set a value no less than the warm-up steps if any. 3. Update some unit tests to start PowerSGD early. ghstack-source-id: 125707662 Test Plan: waitforbuildbot Reviewed By: shuyingsunshine21 Differential Revision: D27553388 fbshipit-source-id: 40076419bc85755c0c0b64b79ba914b241085fcc	2021-04-06 01:27:29 -07:00
Yi Wang	6a2f046504	[SPMD] Restrict DDP communication hooks to SPSD mode (#55253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253 Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks. The next step is limiting only 1 model replica in Reducer. ghstack-source-id: 125677637 Test Plan: waitforbuildbot Reviewed By: zhaojuanmao Differential Revision: D27533898 fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae	2021-04-05 16:46:47 -07:00
Yi Wang	058357a439	[Gradient Compression] Report compression rate for batched PowerSGD hook (#55103 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55103 Previously compression rate is only reported in PowerSGD hook. Also report this metric for comprehensive experimentation. It is very easy to compute the sizes before and after compression, because there is only one matrix factorization per bucket, and no accumulation within the bucket is needed. 1) The size before compression is the input tensor size. 2) The size after compression is the size of P + Q, where each has a size of `square_side_length * state.matrix_approximation_rank`. ghstack-source-id: 125399028 Test Plan: Tested by running scripts/wayi/torch/power_sgd.py locally. Reviewed By: deadlybulb Differential Revision: D27474295 fbshipit-source-id: a2225e85be03ab20238f01014d5ec9ae1787c4fb	2021-03-31 22:17:05 -07:00
Yi Wang	7c0941ee63	Clang-format powerSGD_hook.py (#54839 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54839 ghstack-source-id: 125089465 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D27384796 fbshipit-source-id: 8312059f6a47d60ca29f75041141bb88804e1b32	2021-03-30 09:28:45 -07:00
Yi Wang	6c31f56bf4	[Gradient Compression] Add cuda.syncrhonize back to batched powerSGD (#54838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54838 Realize that an explicit sync is somehow still needed for batched PowerSGD hook. I find that a job failure can be fixed by this change. The sync was once removed by #54482. Test Plan: f260900882 f260899693 Reviewed By: rohan-varma Differential Revision: D27384738 fbshipit-source-id: 3efd738b9fd375e2ceb36ed3a6bf99cd8ce8ff95	2021-03-30 09:27:11 -07:00
Mark Astley	4bf90558e0	[Gradient Compression] Add logging for gradient compression stats. (#54647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54647 Regularly log stats showing effect of gradient compression when using the PowerSGD DDP communication hook. Test Plan: buck run mode/dev-nosan scripts/wayi/torch:power_sgd Play with the layer sizes of the input model (you can just use linear layers for convenience), and check the log that shows compression stats. For convenience, you can change `logging.info` to `print` locally. You can create some test diffs on top of this diff, to show that the compression stats are correct in different cases. Run with power_sgd script: {F537381542} Diff with example using a simple linear model: D27299934 sample output: {F538486535} Reviewed By: SciPioneer Differential Revision: D27240254 fbshipit-source-id: 9e142b2f7957cc874804f799b7bb3bffdf824858	2021-03-25 07:44:17 -07:00
Yi Wang	c22fc448cd	[Gradient Compression] Remove cuda.syncrhonize in batched powerSGD (#54482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54482 `cuda.synchronize` is unnecessary for `batched_powerSGD_hook`. ghstack-source-id: 124607761 Test Plan: f259607860 f259563921 Reviewed By: rohan-varma Differential Revision: D27254314 fbshipit-source-id: 4744c07a6f0c8939e766ffa935ddbf3c47e85d18	2021-03-23 00:55:53 -07:00
Yi Wang	de70cdb66b	Clang format default_hooks.py (#53956 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53956 ghstack-source-id: 123852987 Test Plan: N/A Reviewed By: iseessel Differential Revision: D27032713 fbshipit-source-id: 11d831fa0f08b1c8bc2e44acd144bf85a69a1211	2021-03-13 10:41:11 -08:00
Yi Wang	ca4aae85fa	[Gradient Compression] Update the docstring of fp16_compress_wrapper (#53955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53955 Per title ghstack-source-id: 123852836 Test Plan: N/A Reviewed By: iseessel Differential Revision: D27032700 fbshipit-source-id: 6f9bbc028efe6cc9b54f4ec729fea745368efb2e	2021-03-13 10:39:40 -08:00
Isaac Seessel	3078233e9a	[Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808 Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm. Test Plan: Unit test: ``` buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper ``` Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression 1) FP16 Compression: f256897690 2) FP16 Wrapper + AllReduce (after patching D26960986): f256897289 Reviewed By: SciPioneer Differential Revision: D26978832 fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0	2021-03-12 17:31:07 -08:00
Yi Wang	8016d28c0b	[Gradient Compression] Update the comment on fp16_compress_hook (#53780 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53780 Update the comment, because the input data type of `fp16_compress_hook` does not have to be FP32. For example, the input dtype can also be FP64, as long as it can be casted into FP16. ghstack-source-id: 123680621 Test Plan: N/A Reviewed By: iseessel Differential Revision: D26967224 fbshipit-source-id: 26d79a3629a597e6335b6f59c97d25a764a8ed80	2021-03-11 13:40:32 -08:00
Yi Wang	68b62493b8	[Gradient Compression] Make GradBucket class public (#53099 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099 Publish GradBucket APIs for publishing DDP communication hooks. s/_GradBucket/GradBucket ghstack-source-id: 123030921 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D26721121 fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a	2021-03-03 19:22:15 -08:00
Yi Wang	b59075eced	[Gradient Compression] Refactor tensor grouping in PowerSGD (#52981 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52981 No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function. Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in https://github.com/pytorch/pytorch/pull/52541#discussion_r580867311 ghstack-source-id: 122997032 Test Plan: waitforbuildbot Already LGTMed by PowerSGD paper author. Ads1x (completed): https://www.internalfb.com/intern/tupperware/details/job/?handle=priv3_global%2Fmast_hpc%2Ftsm_hpc-wayi_ads_10x_POWER_SGD_gpu8_2021-02-28_15-29.trainer&tatwTabs=tasks&task_id=0&task_tab=TASK_LOGS Detectron2: 1) Before refactoring: f254353864 Accuracy: 39.972 Overall training speed: 67498 iterations in 6:15:42 (0.3340 s / it) 2) After refactoring: f254353380 Accuracy: 39.944 Overall training speed: 67498 iterations in 6:09:41 (0.3286 s / it) Reviewed By: rohan-varma Differential Revision: D26713689 fbshipit-source-id: 12cfcb65feaa2a2d94e3c7793073031f13828305	2021-03-03 19:20:41 -08:00
Yi Wang	ba36e32406	[Gradient Compression] Correct the usage of min_compression_rate (#52979 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52979 Compression rate = uncompressed size / compressed size, so the compression rate is usually greater than 1. Previously the compression rate was perceived as compressed size / uncompressed size, which can be very confusing. ghstack-source-id: 122996272 Test Plan: unit tests Reviewed By: zhaojuanmao Differential Revision: D26713349 fbshipit-source-id: 83b7f8908c101954cf01f56a22161047fbfeaa53	2021-03-03 15:35:40 -08:00
Yi Wang	b05dd931ee	[Gradient Compression] Add is_the_last_bucket_to_allreduce method to GradBucket class (#53010 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010 To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook. Create an API to hide the details and improve the usability before publishing GradBucket APIs. ghstack-source-id: 122723081 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D26720813 fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c	2021-03-02 14:39:12 -08:00
Yi Wang	ecb5ac90ed	[Gradient Compression] Add get_per_parameter_tensors method to GradBucket class (#53009 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009 It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook. Create a util method in GradBucket class before publishing GradBucket APIs. ghstack-source-id: 122833594 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl f254364097 Reviewed By: rohan-varma Differential Revision: D26717893 fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f	2021-03-02 14:39:03 -08:00
Yi Wang	890e051047	Clang-format quantization_hooks.py (#53100 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53100 ghstack-source-id: 122723751 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D26721146 fbshipit-source-id: 985057fc02c997124b676854eb0a55e569971a3f	2021-03-02 12:48:43 -08:00
Shen Li	729d88119a	Fix GradBucket Typing (#52943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52943 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D26699759 Pulled By: mrshenli fbshipit-source-id: 712165a29d114da761ef4f161096ca46a958df03	2021-02-27 20:04:38 -08:00
Seung-Jae Bang	2d75346c25	[Gradient Compression] Add a minimum compression rate threshold for PowerSGD communication hook (#52541 ) Summary: Fixes #{52034} - Add a minimum compression rate threshold to `PowerSGDState` - Use the threshold to determine whether to compress high-rank tensors or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/52541 Test Plan: No performance regression using rank-8 compression: baseline: f253000411 updated one: f253010955 Reviewed By: rohan-varma Differential Revision: D26594862 Pulled By: SciPioneer fbshipit-source-id: 2859a91b4ca6bd1862bf6cd6441dc2a89badb2d5	2021-02-23 22:03:02 -08:00
Yi Wang	03ae6d9903	Remove useless _allgather_then_aggregate_hook (#52593 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52593 This hook is not used at all, and it probably can only be used for demonstrating that allgather is slower than allreduce, so it should never be used in practice. However, this hook and its helper function stay with the communication hook public APIs in the same file. It will be better to make the public API file as concise as possible. Since I don't think we will use this hook in the future, prefer deleting it to moving it to a separate file. ghstack-source-id: 122180969 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D26575318 fbshipit-source-id: b258154a7c92e33236c34104bd79bc244ecdb158	2021-02-22 12:12:53 -08:00
Yi Wang	4b3c99ce4a	[Resubmission] Add a documentation page for DDP communication hooks (#51773 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773 Resubmission of #51715. Minor changes: 1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py. 2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst. Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels . It seems that `python_doc_test` was flaky. The previous error message was not informative: https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch. Rebasing to a new master branch seems to get this fixed: https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps Screenshot: {F369899792} ghstack-source-id: 121199613 Test Plan: View locally Reviewed By: mingzhe09088 Differential Revision: D26272687 fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db	2021-02-06 21:22:04 -08:00
Natalia Gimelshein	d3023d86ba	Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks Test Plan: revert-hammer Differential Revision: D26249330 (`e62aabac43`) Original commit changeset: ab973390ddb7 fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd	2021-02-04 22:38:06 -08:00
Yi Wang	e62aabac43	[Gradient Compression] Add a documentation page for DDP communication hooks (#51715 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715 Add a documentation page for DDP communication hooks. Screenshot: {F369781049} Test Plan: View locally Reviewed By: pritamdamania87 Differential Revision: D26249330 fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e	2021-02-04 18:53:53 -08:00
Yi Wang	43df03de13	[Gradient Compression] Replace torch.sqrt(torch.sum(col ** 2)) by torch.norm() (#51629 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51629 Leverage the existing util functions as much as possible for potential performance gain. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120919883 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl No performance regression: f248664994 uses `torch.norm()` ``` total: 32 GPUs -- 32 GPUs: p25: 1.050 30/s (batch size 32) p50: 1.230 26/s (batch size 32) p75: 1.449 22/s (batch size 32) p90: 1.611 19/s (batch size 32) p95: 1.702 18/s (batch size 32) backward: 32 GPUs -- 32 GPUs: p25: 0.769 41/s (batch size 32) p50: 0.920 34/s (batch size 32) p75: 1.139 28/s (batch size 32) p90: 1.322 24/s (batch size 32) p95: 1.440 22/s (batch size 32) ``` f248678690 does not use `torch.norm()` ``` total: 32 GPUs -- 32 GPUs: p25: 1.056 30/s (batch size 32) p50: 1.249 25/s (batch size 32) p75: 1.443 22/s (batch size 32) p90: 1.608 19/s (batch size 32) p95: 1.711 18/s (batch size 32) backward: 32 GPUs -- 32 GPUs: p25: 0.777 41/s (batch size 32) p50: 0.939 34/s (batch size 32) p75: 1.127 28/s (batch size 32) p90: 1.322 24/s (batch size 32) p95: 1.448 22/s (batch size 32) ``` Reviewed By: pritamdamania87 Differential Revision: D26219835 fbshipit-source-id: 31d8ad3401d4efced4a6069f4f1e169ea3372697	2021-02-03 13:39:11 -08:00
Yi Wang	79e7544cb4	[Gradient Compression] Check start_PowerSGD_iter > 1 and add guidance on tuning PowerSGD configs. (#51427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120834126 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state Reviewed By: rohan-varma Differential Revision: D26166897 fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d	2021-02-02 04:30:24 -08:00
Yi Wang	c08078031f	[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl baseline: f248001754 batched PowerSGD: f246960752 The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35 Reviewed By: rohan-varma Differential Revision: D26077709 fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5	2021-02-01 15:26:29 -08:00
Yi Wang	0831984ed5	[Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400 Resubmission of #51094 Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725690 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26162333 fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0	2021-02-01 11:34:41 -08:00
Iurii Zdebskyi	5a406c023e	Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future Test Plan: revert-hammer Differential Revision: D26070147 (`e7b3496232`) Original commit changeset: 8c9339f1511e fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34	2021-01-29 09:06:24 -08:00
Yi Wang	e7b3496232	[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51094 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51094 Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120619680 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26070147 fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81	2021-01-28 19:03:18 -08:00
Yi Wang	9d731e87de	[Gradient Compression] Explicitly specify the dtype of the error tensor (#50985 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50985 Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120377786 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26034988 fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da	2021-01-28 19:03:14 -08:00
Yi Wang	b619d37bb4	[Gradient Compression] Simplify the implementation of error feedback and warm-start (#50981 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50981 Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120617971 Test Plan: real run Reviewed By: rohan-varma Differential Revision: D26034418 fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14	2021-01-28 18:59:40 -08:00
Yi Wang	9f19843d19	[Gradient Compression] Typo fixes in PowerSGD (#50974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50974 Typo fixes. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257221 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D26031679 fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b	2021-01-25 22:55:54 -08:00
Yi Wang	ffaae32d60	[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations (#50973 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26031478 fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e	2021-01-25 22:38:39 -08:00
Yi Wang	439afda090	[Gradient Compression] Fix warm-start for PowerSGD laywerwise compression (#50283 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50283 Realize that for the layerwise compression, the previous warm-start implementation only skips memory allocations, but does not skip filling random values for Qs. Also fix the unit test in distributed_test.py. Previously the process group was not created correctly, and not communication occurred in the test_DistributedDataParallel_powerSGD_ddp_comm_hook. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120101220 Test Plan: Verified the fix by adding added some loggings locally. Also verified no NE diff on Ads 1x. Reviewed By: rohan-varma Differential Revision: D25846222 fbshipit-source-id: 1ebeeb55ceba64d4d904ea6ac1bb42b1b2241520	2021-01-20 22:31:44 -08:00
Yi Wang	ce370398cc	[Gradient Compression] Remove the extra comma after "bucket" in PowerSGD hook signatures (#50197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50197 Remove the extra comma after "bucket". ghstack-source-id: 119513484 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25823117 fbshipit-source-id: acf048f7cb732c23cba3a81ccce1e70f6b9f4299	2021-01-07 15:56:20 -08:00
Ansley Ussery	c619892482	Fix errata (#49903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25718411 Pulled By: ansley fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604	2020-12-28 20:40:41 -08:00
Samuel Marks	e6779d4357	[*.py] Rename "Arguments:" to "Args:" (#49736 ) Summary: I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings. ```sh (pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" \| paste -s -d+ -- \| bc)"; done Args: 1095 Arguments: 0336 ``` It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per: - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md) - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md) - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst) Therefore, only `Args:` is valid. This PR replaces them throughout the codebase. PS: For related PRs, see tensorflow/tensorflow/pull/45420 PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736 Reviewed By: albanD Differential Revision: D25710534 Pulled By: soumith fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619	2020-12-28 09:34:47 -08:00
Yi Wang	55b431b17a	[Gradient Compression] Directly let world_size = group_to_use.size() (#49715 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49715 Address the comment on https://github.com/pytorch/pytorch/pull/49417#discussion_r545388351 ghstack-source-id: 119049598 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25673997 fbshipit-source-id: 44eb2540e5a77331c34ba503285cbd0bd63c2c0a	2020-12-22 23:24:54 -08:00
Yi Wang	88c33ff8ab	[Gradient Compression] Explicitly restrict the scope of torch.cuda.synchronize to the current device (#49711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49711 `torch.cuda.synchronize` uses the current device by default. Explicitly specify this device for better readability. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 119017654 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25672267 fbshipit-source-id: 62a2266727a2ea76175f3c438daf20951091c771	2020-12-22 23:21:45 -08:00
Yi Wang	af1b636b89	[Gradient Compression] Change wait() to value() in some callbacks of PowerSGD communication hook (#49709 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49709 Since wait() has already been called in the return statements of the precursor callbacks, no need to wait again. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 119015237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25672068 fbshipit-source-id: da136327db4c4c0e3b846ba8d6885629f1044374	2020-12-22 21:37:04 -08:00
Yi Wang	c348faedc4	[Gradient Compression] Warm-start of PowerSGD (#49451 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49451 Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible. This can give a better compression performance in terms of both accuracy and speed. Also add a unit test for batched PowerSGD to test_c10d.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 119014132 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25583086 fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee	2020-12-22 01:19:14 -08:00
Yi Wang	96aed203bf	[Gradient Compression] Replace the assertions in PowerSGD comm hook by stream syncrhonization (#49435 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49435 Previously the assertion that prevents illegal memory access is because of the torch.any that returns a boolean value, which initiates a data transfer from the device to the host and forces a synchronization. An explicit synchronization is more to the point. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118664204 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25573484 fbshipit-source-id: 516d0d502da2863b516c15332702335ee662f072	2020-12-20 17:24:06 -08:00
Yi Wang	342bfd892f	[Gradient Compression] Add error feedback to layerwise PowerSGD (#49418 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49418 Add error feedback to the original implementation of PowerSGD. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118670930 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25555538 fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2	2020-12-20 17:22:39 -08:00
Yi Wang	8b61fbdac9	Resubmit: [Gradient Compression] Implement the original layerwise PowerSGD (#49639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49639 Resubmit #49417 with a fix for distributed_test. The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission. The real diff is: `4ca1014bb5` This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079 ghstack-source-id: 118969912 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、 Reviewed By: mrshenli Differential Revision: D25654961 fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e	2020-12-20 13:02:52 -08:00
Shen Li	ad9923e5d5	Revert D25511543: [Gradient Compression] Implement the original layerwise PowerSGD Test Plan: revert-hammer Differential Revision: D25511543 (`71f3399e19`) Original commit changeset: 19ef188bc2d4 fbshipit-source-id: a363641a059aeacc57684884998cf8fb7363d748	2020-12-18 20:30:29 -08:00
Yi Wang	71f3399e19	[Gradient Compression] Implement the original layerwise PowerSGD (#49417 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49417 The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118921275 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25511543 fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c	2020-12-18 18:02:15 -08:00
Yi Wang	a419a3e25d	Add assertion on any NaN error on the error feedback (#49374 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374 After the assertion is added, the NaN error on certain trainings disappears. It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118572471 Test Plan: Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8 To reproduce the error, just comment out the assertion. Reviewed By: rohan-varma Differential Revision: D25548299 fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5	2020-12-14 20:15:39 -08:00
Yi Wang	29f0fa36b1	[Gradient Compression] Minor update of the comments on PowerSGD. (#49246 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49246 Previously the comment on matrix_approximation_rank was in PowerSGD_hook function. Now move it into PowerSGDState, because the function arg is already moved to this state as an attribute. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118414247 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D25501091 fbshipit-source-id: 701e3109a9a3f2a5f9d18d5bf6d0a266518ee8ea	2020-12-11 17:45:53 -08:00
Luca Wehrstedt	4c425e8da0	Merge common parts of FutureNCCL into at::ivalue::Future (#48505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. ghstack-source-id: 118180032 Test Plan: Unit tests Reviewed By: wanchaol Differential Revision: D25180535 fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c	2020-12-10 03:54:22 -08:00
Yi Wang	c876d4f477	[Gradient Compression] Let the dtype of created low-rank tensors P and Q be the same type as the input tensor (#48902 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48902 Previously if the dtype of input gradients is FP16, matrix multiplications will fail, because the created low-rank tensors P and Q use FP32 dtype. Now let the dtype of P and Q be the same as the input tensor. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117962078 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25362071 fbshipit-source-id: e68753ff23bb480605b02891e128202ed0f8a587	2020-12-07 17:40:06 -08:00
Yi Wang	17f53bffef	[Gradient Compression] Replace the key of error_dict in PowerSGD state with bucket index (#48867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48867 Previously the key of error_dict is the hashcode of tensor. Now replaced with bucket index. Bucket index can have a few advantages over the hashcode of tensor. 1) Error dict in the state never removes any key. If the bucket rebuild process occurs frequently, the size of error dict can increase. For now, such rebuild process is infrequent, so it is probably fine. 2) Integer index has a better readability than hashcode, and it can facilitate debugging. If the user wants to debug the tensor values, usually only a specific bucket needs to be targeted. It's easy to specify such condition (e..g, bucket_index = 0), but it's hard to specify a hashcode in advance, as it can only be determined at runtime. Note that sometimes the buckets can be rebuilt in the forward pass. In this case, the shape of the bucket with the same index will not be consistent with the one in the previous iteration, and hence the error tensor will be re--initialized as a zero tensor of the new shape. Therefore, `and state.error_dict[bucket_index].shape[0] == padded_total_length` is added to the condition of applying the local error from the previous iteration. Deleted the arg type of `dist._GradBucket` in powerSGD_hook.py, because somehow test_run_mypy - TestTypeHints failed: AssertionError: mypy failed: torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:128: error: "_GradBucket" has no attribute "get_index" [attr-defined] Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117951402 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25346347 fbshipit-source-id: 8348aa103002ec1c69e3ae759504b431140b3b0d	2020-12-05 23:53:27 -08:00
Yi Wang	9c6979a266	[Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) (#48670 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48670 Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117636492 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25240290 fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b	2020-12-02 06:39:30 -08:00
Yi Wang	ddb6594971	[Gradient Compression] Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48507 Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step. Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps. 'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117483639 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Also checked the initial Qs and input random seeds of torch.manual_seed() of different ranks for a few steps in real runs. Example logs: Exactly same random seed of different ranks at the same step on two nodes, and the random seed varies at each step. {F346971916} Reviewed By: rohan-varma Differential Revision: D25191589 fbshipit-source-id: f7f17df3ad2075ecae1a2a56ca082160f7c5fcfc	2020-11-30 18:46:45 -08:00
Yi Wang	6400d27bbb	[Gradient Compression] Define a customized state for PowerSGD comm hook (#48348 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48348 To support the features like error feedback, warm start, PowerSGD comm hook needs to maintain a state besides process group. Currently this state only includes a process group and a matrix approximation rank config. This diff is a pure refactoring. Plan to add more state fields later. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117305280 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Reviewed By: rohan-varma Differential Revision: D25137962 fbshipit-source-id: cd72b8b01e20f80a92c7577d22f2c96e9eebdc52	2020-11-21 09:25:35 -08:00
Yi Wang	1a6666c967	[Gradient Compression] Add a comment on _orthogonalize. (#48253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48253 Explained why a hand-crafted orthogonalize function is used instead of `torch.qr`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117132622 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D25088607 fbshipit-source-id: ebc228afcb4737bb8529e7143ea170086730520e	2020-11-19 19:22:04 -08:00
Yi Wang	daff3a81a1	[Gradient Compression] PowerSGD comm hook (#48060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48060 Implement a PowerSGD variant that applies to a batched flattened tensor with zero paddings. This version does not require handling 1D tensors and multi-dimenionsal tensors in the input separately, and hence it does not need to create two asyncrhonous future chains. Potential optimizations: 1) Consider FP16 compression throughout PowerSGD. 2) Warm start and save one matrix multiplication per ieration. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117105938 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: jiayisuse Differential Revision: D24843692 fbshipit-source-id: f44200b1fd6e12e829fc543d21ab7ae086769561	2020-11-19 02:59:11 -08:00
Xu Zhao	49f0e5dfeb	Fix typing errors in torch.distributed.*, close issue #42967 . (#47534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534 Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24952497 Pulled By: xuzhao9 fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e	2020-11-16 23:27:59 -08:00
Yi Wang	fccfe7bd1a	[Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158 1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation. 2. Typo fix. 3. Reformat default_hooks.py. 4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.) The new style can be used for testing any new comm hook like PowerSGD easily. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 116012600 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D24669639 fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939	2020-11-06 00:28:09 -08:00
Yi Wang	f91fcefc81	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115783237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl //arvr/projects/eye_tracking/Masquerade:python_test USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install Reviewed By: mrshenli Differential Revision: D24700959 fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a	2020-11-03 18:33:50 -08:00
Yi Wang	b1b77148ac	Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234 Revert the diff because of https://github.com/pytorch/pytorch/issues/47153 Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115720415 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24691866 fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539	2020-11-02 20:51:18 -08:00
Yi Wang	ee0033af9b	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115629230 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl Reviewed By: pritamdamania87 Differential Revision: D24471910 fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a	2020-10-30 23:19:42 -07:00
Yi Wang	ee3d3e6dba	[pytorch][PR][Gradient Compression] Reduce the peak memory of fp16 compression provided by ddp comm hook (#46078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46078 The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket. #Closes: https://github.com/pytorch/pytorch/issues/45968 ghstack-source-id: 113996453 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_hook Also verified the decrease in memory consumption with some toy modeling exmaples. Reviewed By: pritamdamania87 Differential Revision: D24178118 fbshipit-source-id: 453d0b52930809bd836172936b77abd69610237a	2020-10-12 16:15:38 -07:00
Yi Wang	022ba5a78b	Make ddp_comm_hook_wrapper a private method. (#44643 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643 This method is not used anywhere else. Also formatted the file. Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks Reviewed By: pritamdamania87 Differential Revision: D23675945 fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461	2020-09-24 13:29:48 -07:00
Sinan Nasir	1a79d7bb28	DDP communication hook examples (#43310 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43310 In this diff, we prepared some example DDP communication hooks [#40848](https://github.com/pytorch/pytorch/pull/40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. ghstack-source-id: 110923269 Test Plan: python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py Couldn't download test skip set, leaving all tests enabled... ..... ---------------------------------------------------------------------- Ran 4 tests in 26.724s OK Internal testing: ``` buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks ``` Reviewed By: malfet Differential Revision: D22937999 fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec	2020-08-28 18:59:14 -07:00

1 2 3 4 5

232 Commits