pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Yi Wang	c08078031f	[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl baseline: f248001754 batched PowerSGD: f246960752 The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35 Reviewed By: rohan-varma Differential Revision: D26077709 fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5	2021-02-01 15:26:29 -08:00
Yi Wang	0831984ed5	[Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400 Resubmission of #51094 Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725690 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26162333 fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0	2021-02-01 11:34:41 -08:00
Iurii Zdebskyi	5a406c023e	Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future Test Plan: revert-hammer Differential Revision: D26070147 (`e7b3496232`) Original commit changeset: 8c9339f1511e fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34	2021-01-29 09:06:24 -08:00
Yi Wang	e7b3496232	[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51094 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51094 Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120619680 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D26070147 fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81	2021-01-28 19:03:18 -08:00
Yi Wang	9d731e87de	[Gradient Compression] Explicitly specify the dtype of the error tensor (#50985 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50985 Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120377786 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26034988 fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da	2021-01-28 19:03:14 -08:00
Yi Wang	b619d37bb4	[Gradient Compression] Simplify the implementation of error feedback and warm-start (#50981 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50981 Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120617971 Test Plan: real run Reviewed By: rohan-varma Differential Revision: D26034418 fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14	2021-01-28 18:59:40 -08:00
Yi Wang	9f19843d19	[Gradient Compression] Typo fixes in PowerSGD (#50974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50974 Typo fixes. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257221 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D26031679 fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b	2021-01-25 22:55:54 -08:00
Yi Wang	ffaae32d60	[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations (#50973 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26031478 fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e	2021-01-25 22:38:39 -08:00
Yi Wang	439afda090	[Gradient Compression] Fix warm-start for PowerSGD laywerwise compression (#50283 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50283 Realize that for the layerwise compression, the previous warm-start implementation only skips memory allocations, but does not skip filling random values for Qs. Also fix the unit test in distributed_test.py. Previously the process group was not created correctly, and not communication occurred in the test_DistributedDataParallel_powerSGD_ddp_comm_hook. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120101220 Test Plan: Verified the fix by adding added some loggings locally. Also verified no NE diff on Ads 1x. Reviewed By: rohan-varma Differential Revision: D25846222 fbshipit-source-id: 1ebeeb55ceba64d4d904ea6ac1bb42b1b2241520	2021-01-20 22:31:44 -08:00
Yi Wang	ce370398cc	[Gradient Compression] Remove the extra comma after "bucket" in PowerSGD hook signatures (#50197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50197 Remove the extra comma after "bucket". ghstack-source-id: 119513484 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25823117 fbshipit-source-id: acf048f7cb732c23cba3a81ccce1e70f6b9f4299	2021-01-07 15:56:20 -08:00
Ansley Ussery	c619892482	Fix errata (#49903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25718411 Pulled By: ansley fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604	2020-12-28 20:40:41 -08:00
Samuel Marks	e6779d4357	[*.py] Rename "Arguments:" to "Args:" (#49736 ) Summary: I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings. ```sh (pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" \| paste -s -d+ -- \| bc)"; done Args: 1095 Arguments: 0336 ``` It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per: - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md) - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md) - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst) Therefore, only `Args:` is valid. This PR replaces them throughout the codebase. PS: For related PRs, see tensorflow/tensorflow/pull/45420 PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736 Reviewed By: albanD Differential Revision: D25710534 Pulled By: soumith fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619	2020-12-28 09:34:47 -08:00
Yi Wang	55b431b17a	[Gradient Compression] Directly let world_size = group_to_use.size() (#49715 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49715 Address the comment on https://github.com/pytorch/pytorch/pull/49417#discussion_r545388351 ghstack-source-id: 119049598 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25673997 fbshipit-source-id: 44eb2540e5a77331c34ba503285cbd0bd63c2c0a	2020-12-22 23:24:54 -08:00
Yi Wang	88c33ff8ab	[Gradient Compression] Explicitly restrict the scope of torch.cuda.synchronize to the current device (#49711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49711 `torch.cuda.synchronize` uses the current device by default. Explicitly specify this device for better readability. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 119017654 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25672267 fbshipit-source-id: 62a2266727a2ea76175f3c438daf20951091c771	2020-12-22 23:21:45 -08:00
Yi Wang	af1b636b89	[Gradient Compression] Change wait() to value() in some callbacks of PowerSGD communication hook (#49709 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49709 Since wait() has already been called in the return statements of the precursor callbacks, no need to wait again. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 119015237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25672068 fbshipit-source-id: da136327db4c4c0e3b846ba8d6885629f1044374	2020-12-22 21:37:04 -08:00
Yi Wang	c348faedc4	[Gradient Compression] Warm-start of PowerSGD (#49451 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49451 Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible. This can give a better compression performance in terms of both accuracy and speed. Also add a unit test for batched PowerSGD to test_c10d.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 119014132 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25583086 fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee	2020-12-22 01:19:14 -08:00
Yi Wang	96aed203bf	[Gradient Compression] Replace the assertions in PowerSGD comm hook by stream syncrhonization (#49435 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49435 Previously the assertion that prevents illegal memory access is because of the torch.any that returns a boolean value, which initiates a data transfer from the device to the host and forces a synchronization. An explicit synchronization is more to the point. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118664204 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25573484 fbshipit-source-id: 516d0d502da2863b516c15332702335ee662f072	2020-12-20 17:24:06 -08:00
Yi Wang	342bfd892f	[Gradient Compression] Add error feedback to layerwise PowerSGD (#49418 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49418 Add error feedback to the original implementation of PowerSGD. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118670930 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25555538 fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2	2020-12-20 17:22:39 -08:00
Yi Wang	8b61fbdac9	Resubmit: [Gradient Compression] Implement the original layerwise PowerSGD (#49639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49639 Resubmit #49417 with a fix for distributed_test. The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission. The real diff is: `4ca1014bb5` This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079 ghstack-source-id: 118969912 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、 Reviewed By: mrshenli Differential Revision: D25654961 fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e	2020-12-20 13:02:52 -08:00
Shen Li	ad9923e5d5	Revert D25511543: [Gradient Compression] Implement the original layerwise PowerSGD Test Plan: revert-hammer Differential Revision: D25511543 (`71f3399e19`) Original commit changeset: 19ef188bc2d4 fbshipit-source-id: a363641a059aeacc57684884998cf8fb7363d748	2020-12-18 20:30:29 -08:00
Yi Wang	71f3399e19	[Gradient Compression] Implement the original layerwise PowerSGD (#49417 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49417 The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118921275 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25511543 fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c	2020-12-18 18:02:15 -08:00
Yi Wang	a419a3e25d	Add assertion on any NaN error on the error feedback (#49374 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374 After the assertion is added, the NaN error on certain trainings disappears. It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118572471 Test Plan: Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8 To reproduce the error, just comment out the assertion. Reviewed By: rohan-varma Differential Revision: D25548299 fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5	2020-12-14 20:15:39 -08:00
Yi Wang	29f0fa36b1	[Gradient Compression] Minor update of the comments on PowerSGD. (#49246 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49246 Previously the comment on matrix_approximation_rank was in PowerSGD_hook function. Now move it into PowerSGDState, because the function arg is already moved to this state as an attribute. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118414247 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D25501091 fbshipit-source-id: 701e3109a9a3f2a5f9d18d5bf6d0a266518ee8ea	2020-12-11 17:45:53 -08:00
Luca Wehrstedt	4c425e8da0	Merge common parts of FutureNCCL into at::ivalue::Future (#48505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. ghstack-source-id: 118180032 Test Plan: Unit tests Reviewed By: wanchaol Differential Revision: D25180535 fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c	2020-12-10 03:54:22 -08:00
Yi Wang	c876d4f477	[Gradient Compression] Let the dtype of created low-rank tensors P and Q be the same type as the input tensor (#48902 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48902 Previously if the dtype of input gradients is FP16, matrix multiplications will fail, because the created low-rank tensors P and Q use FP32 dtype. Now let the dtype of P and Q be the same as the input tensor. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117962078 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25362071 fbshipit-source-id: e68753ff23bb480605b02891e128202ed0f8a587	2020-12-07 17:40:06 -08:00
Yi Wang	17f53bffef	[Gradient Compression] Replace the key of error_dict in PowerSGD state with bucket index (#48867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48867 Previously the key of error_dict is the hashcode of tensor. Now replaced with bucket index. Bucket index can have a few advantages over the hashcode of tensor. 1) Error dict in the state never removes any key. If the bucket rebuild process occurs frequently, the size of error dict can increase. For now, such rebuild process is infrequent, so it is probably fine. 2) Integer index has a better readability than hashcode, and it can facilitate debugging. If the user wants to debug the tensor values, usually only a specific bucket needs to be targeted. It's easy to specify such condition (e..g, bucket_index = 0), but it's hard to specify a hashcode in advance, as it can only be determined at runtime. Note that sometimes the buckets can be rebuilt in the forward pass. In this case, the shape of the bucket with the same index will not be consistent with the one in the previous iteration, and hence the error tensor will be re--initialized as a zero tensor of the new shape. Therefore, `and state.error_dict[bucket_index].shape[0] == padded_total_length` is added to the condition of applying the local error from the previous iteration. Deleted the arg type of `dist._GradBucket` in powerSGD_hook.py, because somehow test_run_mypy - TestTypeHints failed: AssertionError: mypy failed: torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:128: error: "_GradBucket" has no attribute "get_index" [attr-defined] Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117951402 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25346347 fbshipit-source-id: 8348aa103002ec1c69e3ae759504b431140b3b0d	2020-12-05 23:53:27 -08:00
Yi Wang	9c6979a266	[Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) (#48670 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48670 Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration. Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage. This is halfway of error feedback. Plan to add the new index field in a separate PR. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117636492 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25240290 fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b	2020-12-02 06:39:30 -08:00
Yi Wang	ddb6594971	[Gradient Compression] Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48507 Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step. Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps. 'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675 Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117483639 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Also checked the initial Qs and input random seeds of torch.manual_seed() of different ranks for a few steps in real runs. Example logs: Exactly same random seed of different ranks at the same step on two nodes, and the random seed varies at each step. {F346971916} Reviewed By: rohan-varma Differential Revision: D25191589 fbshipit-source-id: f7f17df3ad2075ecae1a2a56ca082160f7c5fcfc	2020-11-30 18:46:45 -08:00
Yi Wang	6400d27bbb	[Gradient Compression] Define a customized state for PowerSGD comm hook (#48348 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48348 To support the features like error feedback, warm start, PowerSGD comm hook needs to maintain a state besides process group. Currently this state only includes a process group and a matrix approximation rank config. This diff is a pure refactoring. Plan to add more state fields later. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117305280 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Reviewed By: rohan-varma Differential Revision: D25137962 fbshipit-source-id: cd72b8b01e20f80a92c7577d22f2c96e9eebdc52	2020-11-21 09:25:35 -08:00
Yi Wang	1a6666c967	[Gradient Compression] Add a comment on _orthogonalize. (#48253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48253 Explained why a hand-crafted orthogonalize function is used instead of `torch.qr`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117132622 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D25088607 fbshipit-source-id: ebc228afcb4737bb8529e7143ea170086730520e	2020-11-19 19:22:04 -08:00
Yi Wang	daff3a81a1	[Gradient Compression] PowerSGD comm hook (#48060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48060 Implement a PowerSGD variant that applies to a batched flattened tensor with zero paddings. This version does not require handling 1D tensors and multi-dimenionsal tensors in the input separately, and hence it does not need to create two asyncrhonous future chains. Potential optimizations: 1) Consider FP16 compression throughout PowerSGD. 2) Warm start and save one matrix multiplication per ieration. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117105938 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: jiayisuse Differential Revision: D24843692 fbshipit-source-id: f44200b1fd6e12e829fc543d21ab7ae086769561	2020-11-19 02:59:11 -08:00
Xu Zhao	49f0e5dfeb	Fix typing errors in torch.distributed.*, close issue #42967 . (#47534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534 Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24952497 Pulled By: xuzhao9 fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e	2020-11-16 23:27:59 -08:00
Yi Wang	fccfe7bd1a	[Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158 1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation. 2. Typo fix. 3. Reformat default_hooks.py. 4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.) The new style can be used for testing any new comm hook like PowerSGD easily. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 116012600 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl Reviewed By: rohan-varma Differential Revision: D24669639 fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939	2020-11-06 00:28:09 -08:00
Yi Wang	f91fcefc81	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115783237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl //arvr/projects/eye_tracking/Masquerade:python_test USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install Reviewed By: mrshenli Differential Revision: D24700959 fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a	2020-11-03 18:33:50 -08:00
Yi Wang	b1b77148ac	Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234 Revert the diff because of https://github.com/pytorch/pytorch/issues/47153 Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115720415 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24691866 fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539	2020-11-02 20:51:18 -08:00
Yi Wang	ee0033af9b	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115629230 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl Reviewed By: pritamdamania87 Differential Revision: D24471910 fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a	2020-10-30 23:19:42 -07:00
Yi Wang	ee3d3e6dba	[pytorch][PR][Gradient Compression] Reduce the peak memory of fp16 compression provided by ddp comm hook (#46078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46078 The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket. #Closes: https://github.com/pytorch/pytorch/issues/45968 ghstack-source-id: 113996453 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_hook Also verified the decrease in memory consumption with some toy modeling exmaples. Reviewed By: pritamdamania87 Differential Revision: D24178118 fbshipit-source-id: 453d0b52930809bd836172936b77abd69610237a	2020-10-12 16:15:38 -07:00
Yi Wang	022ba5a78b	Make ddp_comm_hook_wrapper a private method. (#44643 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643 This method is not used anywhere else. Also formatted the file. Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks Reviewed By: pritamdamania87 Differential Revision: D23675945 fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461	2020-09-24 13:29:48 -07:00
Sinan Nasir	1a79d7bb28	DDP communication hook examples (#43310 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43310 In this diff, we prepared some example DDP communication hooks [#40848](https://github.com/pytorch/pytorch/pull/40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. ghstack-source-id: 110923269 Test Plan: python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py Couldn't download test skip set, leaving all tests enabled... ..... ---------------------------------------------------------------------- Ran 4 tests in 26.724s OK Internal testing: ``` buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks ``` Reviewed By: malfet Differential Revision: D22937999 fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec	2020-08-28 18:59:14 -07:00

39 Commits