Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54647
Regularly log stats showing effect of gradient compression when using the PowerSGD DDP communication hook.
Test Plan:
buck run mode/dev-nosan scripts/wayi/torch:power_sgd
Play with the layer sizes of the input model (you can just use linear layers for convenience), and check the log that shows compression stats. For convenience, you can change `logging.info` to `print` locally.
You can create some test diffs on top of this diff, to show that the compression stats are correct in different cases.
Run with power_sgd script:
{F537381542}
Diff with example using a simple linear model: D27299934
sample output:
{F538486535}
Reviewed By: SciPioneer
Differential Revision: D27240254
fbshipit-source-id: 9e142b2f7957cc874804f799b7bb3bffdf824858
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53780
Update the comment, because the input data type of `fp16_compress_hook` does not have to be FP32. For example, the input dtype can also be FP64, as long as it can be casted into FP16.
ghstack-source-id: 123680621
Test Plan: N/A
Reviewed By: iseessel
Differential Revision: D26967224
fbshipit-source-id: 26d79a3629a597e6335b6f59c97d25a764a8ed80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52979
Compression rate = uncompressed size / compressed size, so the compression rate is usually greater than 1.
Previously the compression rate was perceived as compressed size / uncompressed size, which can be very confusing.
ghstack-source-id: 122996272
Test Plan: unit tests
Reviewed By: zhaojuanmao
Differential Revision: D26713349
fbshipit-source-id: 83b7f8908c101954cf01f56a22161047fbfeaa53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010
To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.
Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D26720813
fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009
It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.
Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
f254364097
Reviewed By: rohan-varma
Differential Revision: D26717893
fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
Summary:
Fixes #{52034}
- Add a minimum compression rate threshold to `PowerSGDState`
- Use the threshold to determine whether to compress high-rank tensors or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52541
Test Plan:
No performance regression using rank-8 compression:
baseline: f253000411
updated one: f253010955
Reviewed By: rohan-varma
Differential Revision: D26594862
Pulled By: SciPioneer
fbshipit-source-id: 2859a91b4ca6bd1862bf6cd6441dc2a89badb2d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52593
This hook is not used at all, and it probably can only be used for demonstrating that allgather is slower than allreduce, so it should never be used in practice.
However, this hook and its helper function stay with the communication hook public APIs in the same file. It will be better to make the public API file as concise as possible.
Since I don't think we will use this hook in the future, prefer deleting it to moving it to a separate file.
ghstack-source-id: 122180969
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26575318
fbshipit-source-id: b258154a7c92e33236c34104bd79bc244ecdb158
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51427
A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1.
Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`.
Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120834126
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state
Reviewed By: rohan-varma
Differential Revision: D26166897
fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270
Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.
This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
baseline: f248001754
batched PowerSGD: f246960752
The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35
Reviewed By: rohan-varma
Differential Revision: D26077709
fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50985
Explicitly specify the dtype of error tensor when it is initialized by zeros.
Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).
This change will make the dtype of error tensor look more clear.
Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120377786
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D26034988
fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50981
Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors.
Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120617971
Test Plan: real run
Reviewed By: rohan-varma
Differential Revision: D26034418
fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50973
This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.
Also add more comments on the fields in `PowerSGDState`.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D26031478
fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50283
Realize that for the layerwise compression, the previous warm-start implementation only skips memory allocations, but does not skip filling random values for Qs.
Also fix the unit test in distributed_test.py. Previously the process group was not created correctly, and not communication occurred in the test_DistributedDataParallel_powerSGD_ddp_comm_hook.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120101220
Test Plan:
Verified the fix by adding added some loggings locally.
Also verified no NE diff on Ads 1x.
Reviewed By: rohan-varma
Differential Revision: D25846222
fbshipit-source-id: 1ebeeb55ceba64d4d904ea6ac1bb42b1b2241520
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49711
`torch.cuda.synchronize` uses the current device by default. Explicitly specify this device for better readability.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119017654
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25672267
fbshipit-source-id: 62a2266727a2ea76175f3c438daf20951091c771
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49709
Since wait() has already been called in the return statements of the precursor callbacks, no need to wait again.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119015237
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25672068
fbshipit-source-id: da136327db4c4c0e3b846ba8d6885629f1044374
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49451
Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible.
This can give a better compression performance in terms of both accuracy and speed.
Also add a unit test for batched PowerSGD to test_c10d.py.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119014132
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25583086
fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49435
Previously the assertion that prevents illegal memory access is because of the torch.any that returns a boolean value, which initiates a data transfer from the device to the host and forces a synchronization.
An explicit synchronization is more to the point.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118664204
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25573484
fbshipit-source-id: 516d0d502da2863b516c15332702335ee662f072
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49418
Add error feedback to the original implementation of PowerSGD.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118670930
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25555538
fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49639
Resubmit #49417 with a fix for distributed_test.
The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission.
The real diff is:
4ca1014bb5
This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079
ghstack-source-id: 118969912
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、
Reviewed By: mrshenli
Differential Revision: D25654961
fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49417
The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook".
Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned.
Also add a test in distributed_test.py.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118921275
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25511543
fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374
After the assertion is added, the NaN error on certain trainings disappears.
It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118572471
Test Plan:
Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8
To reproduce the error, just comment out the assertion.
Reviewed By: rohan-varma
Differential Revision: D25548299
fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49246
Previously the comment on matrix_approximation_rank was in PowerSGD_hook function. Now move it into PowerSGDState, because the function arg is already moved to this state as an attribute.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118414247
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D25501091
fbshipit-source-id: 701e3109a9a3f2a5f9d18d5bf6d0a266518ee8ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...
The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180032
Test Plan: Unit tests
Reviewed By: wanchaol
Differential Revision: D25180535
fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48902
Previously if the dtype of input gradients is FP16, matrix multiplications will fail, because the created low-rank tensors P and Q use FP32 dtype.
Now let the dtype of P and Q be the same as the input tensor.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117962078
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D25362071
fbshipit-source-id: e68753ff23bb480605b02891e128202ed0f8a587
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48867
Previously the key of error_dict is the hashcode of tensor. Now replaced with bucket index.
Bucket index can have a few advantages over the hashcode of tensor.
1) Error dict in the state never removes any key. If the bucket rebuild process occurs frequently, the size of error dict can increase. For now, such rebuild process is infrequent, so it is probably fine.
2) Integer index has a better readability than hashcode, and it can facilitate debugging.
If the user wants to debug the tensor values, usually only a specific bucket needs to be targeted. It's easy to specify such condition (e..g, bucket_index = 0), but it's hard to specify a hashcode in advance, as it can only be determined at runtime.
Note that sometimes the buckets can be rebuilt in the forward pass. In this case, the shape of the bucket with the same index will not be consistent with the one in the previous iteration, and hence the error tensor will be re--initialized as a zero tensor of the new shape. Therefore, `and state.error_dict[bucket_index].shape[0] == padded_total_length` is added to the condition of applying the local error from the previous iteration.
Deleted the arg type of `dist._GradBucket` in powerSGD_hook.py, because somehow test_run_mypy - TestTypeHints failed:
AssertionError: mypy failed: torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:128: error: "_GradBucket" has no attribute "get_index" [attr-defined]
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117951402
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D25346347
fbshipit-source-id: 8348aa103002ec1c69e3ae759504b431140b3b0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48670
Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.
Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.
This is halfway of error feedback. Plan to add the new index field in a separate PR.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117636492
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
Reviewed By: rohan-varma
Differential Revision: D25240290
fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48507
Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step.
Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps.
'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117483639
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Also checked the initial Qs and input random seeds of torch.manual_seed() of different ranks for a few steps in real runs.
Example logs:
Exactly same random seed of different ranks at the same step on two nodes, and the random seed varies at each step.
{F346971916}
Reviewed By: rohan-varma
Differential Revision: D25191589
fbshipit-source-id: f7f17df3ad2075ecae1a2a56ca082160f7c5fcfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48348
To support the features like error feedback, warm start, PowerSGD comm hook needs to maintain a state besides process group. Currently this state only includes a process group and a matrix approximation rank config.
This diff is a pure refactoring. Plan to add more state fields later.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117305280
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Reviewed By: rohan-varma
Differential Revision: D25137962
fbshipit-source-id: cd72b8b01e20f80a92c7577d22f2c96e9eebdc52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48253
Explained why a hand-crafted orthogonalize function is used instead of `torch.qr`.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117132622
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D25088607
fbshipit-source-id: ebc228afcb4737bb8529e7143ea170086730520e