Commit Graph

56 Commits

Author SHA1 Message Date
Yi Wang
ca4aae85fa [Gradient Compression] Update the docstring of fp16_compress_wrapper (#53955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53955

Per title
ghstack-source-id: 123852836

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D27032700

fbshipit-source-id: 6f9bbc028efe6cc9b54f4ec729fea745368efb2e
2021-03-13 10:39:40 -08:00
Isaac Seessel
3078233e9a [Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808

Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm.

Test Plan:
Unit test:
```
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper
```

Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression
1) FP16 Compression:
f256897690

2) FP16 Wrapper + AllReduce (after patching D26960986):
f256897289

Reviewed By: SciPioneer

Differential Revision: D26978832

fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0
2021-03-12 17:31:07 -08:00
Yi Wang
8016d28c0b [Gradient Compression] Update the comment on fp16_compress_hook (#53780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53780

Update the comment, because the input data type of `fp16_compress_hook` does not have to be FP32. For example, the input dtype can also be FP64, as long as it can be casted into FP16.
ghstack-source-id: 123680621

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D26967224

fbshipit-source-id: 26d79a3629a597e6335b6f59c97d25a764a8ed80
2021-03-11 13:40:32 -08:00
Yi Wang
68b62493b8 [Gradient Compression] Make GradBucket class public (#53099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099

Publish GradBucket APIs for publishing DDP communication hooks.

s/_GradBucket/GradBucket
ghstack-source-id: 123030921

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26721121

fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a
2021-03-03 19:22:15 -08:00
Yi Wang
b59075eced [Gradient Compression] Refactor tensor grouping in PowerSGD (#52981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52981

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function.

Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in https://github.com/pytorch/pytorch/pull/52541#discussion_r580867311
ghstack-source-id: 122997032

Test Plan:
waitforbuildbot

Already LGTMed by PowerSGD paper author.

Ads1x (completed):
https://www.internalfb.com/intern/tupperware/details/job/?handle=priv3_global%2Fmast_hpc%2Ftsm_hpc-wayi_ads_10x_POWER_SGD_gpu8_2021-02-28_15-29.trainer&tatwTabs=tasks&task_id=0&task_tab=TASK_LOGS

Detectron2:
1) Before refactoring:
f254353864
Accuracy: 39.972
Overall training speed: 67498 iterations in 6:15:42 (0.3340 s / it)

2) After refactoring:
f254353380
Accuracy: 39.944
Overall training speed: 67498 iterations in 6:09:41 (0.3286 s / it)

Reviewed By: rohan-varma

Differential Revision: D26713689

fbshipit-source-id: 12cfcb65feaa2a2d94e3c7793073031f13828305
2021-03-03 19:20:41 -08:00
Yi Wang
ba36e32406 [Gradient Compression] Correct the usage of min_compression_rate (#52979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52979

Compression rate = uncompressed size / compressed size, so the compression rate is usually greater than 1.

Previously the compression rate was perceived as compressed size / uncompressed size, which can be very confusing.
ghstack-source-id: 122996272

Test Plan: unit tests

Reviewed By: zhaojuanmao

Differential Revision: D26713349

fbshipit-source-id: 83b7f8908c101954cf01f56a22161047fbfeaa53
2021-03-03 15:35:40 -08:00
Yi Wang
b05dd931ee [Gradient Compression] Add is_the_last_bucket_to_allreduce method to GradBucket class (#53010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010

To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.

Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D26720813

fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
2021-03-02 14:39:12 -08:00
Yi Wang
ecb5ac90ed [Gradient Compression] Add get_per_parameter_tensors method to GradBucket class (#53009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009

It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.

Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

f254364097

Reviewed By: rohan-varma

Differential Revision: D26717893

fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
2021-03-02 14:39:03 -08:00
Yi Wang
890e051047 Clang-format quantization_hooks.py (#53100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53100

ghstack-source-id: 122723751

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26721146

fbshipit-source-id: 985057fc02c997124b676854eb0a55e569971a3f
2021-03-02 12:48:43 -08:00
Shen Li
729d88119a Fix GradBucket Typing (#52943)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52943

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26699759

Pulled By: mrshenli

fbshipit-source-id: 712165a29d114da761ef4f161096ca46a958df03
2021-02-27 20:04:38 -08:00
Seung-Jae Bang
2d75346c25 [Gradient Compression] Add a minimum compression rate threshold for PowerSGD communication hook (#52541)
Summary:
Fixes #{52034}
- Add a minimum compression rate threshold to `PowerSGDState`
- Use the threshold to determine whether to compress high-rank tensors or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52541

Test Plan:
No performance regression using rank-8 compression:
baseline: f253000411
updated one: f253010955

Reviewed By: rohan-varma

Differential Revision: D26594862

Pulled By: SciPioneer

fbshipit-source-id: 2859a91b4ca6bd1862bf6cd6441dc2a89badb2d5
2021-02-23 22:03:02 -08:00
Yi Wang
03ae6d9903 Remove useless _allgather_then_aggregate_hook (#52593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52593

This hook is not used at all, and it probably can only be used for demonstrating that allgather is slower than allreduce, so it should never be used in practice.

However, this hook and its helper function stay with the communication hook public APIs in the same file. It will be better to make the public API file as concise as possible.

Since I don't think we will use this hook in the future, prefer deleting it to moving it to a separate file.
ghstack-source-id: 122180969

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26575318

fbshipit-source-id: b258154a7c92e33236c34104bd79bc244ecdb158
2021-02-22 12:12:53 -08:00
Yi Wang
4b3c99ce4a [Resubmission] Add a documentation page for DDP communication hooks (#51773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773

Resubmission of #51715.

Minor changes:
1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py.

2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst.

Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels .

It seems that `python_doc_test` was flaky. The previous error message was not informative:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch.

Rebasing to a new master branch seems to get this fixed:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps

Screenshot:

{F369899792}
ghstack-source-id: 121199613

Test Plan: View locally

Reviewed By: mingzhe09088

Differential Revision: D26272687

fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db
2021-02-06 21:22:04 -08:00
Natalia Gimelshein
d3023d86ba Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks
Test Plan: revert-hammer

Differential Revision:
D26249330 (e62aabac43)

Original commit changeset: ab973390ddb7

fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd
2021-02-04 22:38:06 -08:00
Yi Wang
e62aabac43 [Gradient Compression] Add a documentation page for DDP communication hooks (#51715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715

Add a documentation page for DDP communication hooks.

Screenshot:

{F369781049}

Test Plan: View locally

Reviewed By: pritamdamania87

Differential Revision: D26249330

fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e
2021-02-04 18:53:53 -08:00
Yi Wang
43df03de13 [Gradient Compression] Replace torch.sqrt(torch.sum(col ** 2)) by torch.norm() (#51629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51629

Leverage the existing util functions as much as possible for potential performance gain.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120919883

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

No performance regression:
f248664994 uses `torch.norm()`
```
total:
  32 GPUs -- 32 GPUs: p25:  1.050    30/s  (batch size 32)
p50:  1.230    26/s  (batch size 32)
p75:  1.449    22/s  (batch size 32)
p90:  1.611    19/s  (batch size 32)
p95:  1.702    18/s  (batch size 32)

backward:
  32 GPUs -- 32 GPUs: p25:  0.769    41/s  (batch size 32)
p50:  0.920    34/s  (batch size 32)
p75:  1.139    28/s  (batch size 32)
p90:  1.322    24/s  (batch size 32)
p95:  1.440    22/s  (batch size 32)
```

f248678690 does not use `torch.norm()`
```
total:
  32 GPUs -- 32 GPUs: p25:  1.056    30/s  (batch size 32)
p50:  1.249    25/s  (batch size 32)
p75:  1.443    22/s  (batch size 32)
p90:  1.608    19/s  (batch size 32)
p95:  1.711    18/s  (batch size 32)

backward:
  32 GPUs -- 32 GPUs: p25:  0.777    41/s  (batch size 32)
p50:  0.939    34/s  (batch size 32)
p75:  1.127    28/s  (batch size 32)
p90:  1.322    24/s  (batch size 32)
p95:  1.448    22/s  (batch size 32)
```

Reviewed By: pritamdamania87

Differential Revision: D26219835

fbshipit-source-id: 31d8ad3401d4efced4a6069f4f1e169ea3372697
2021-02-03 13:39:11 -08:00
Yi Wang
79e7544cb4 [Gradient Compression] Check start_PowerSGD_iter > 1 and add guidance on tuning PowerSGD configs. (#51427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51427

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1.

Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`.

Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120834126

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state

Reviewed By: rohan-varma

Differential Revision: D26166897

fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d
2021-02-02 04:30:24 -08:00
Yi Wang
c08078031f [Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

baseline: f248001754
batched PowerSGD: f246960752

The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35

Reviewed By: rohan-varma

Differential Revision: D26077709

fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
2021-02-01 15:26:29 -08:00
Yi Wang
0831984ed5 [Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400

Resubmission of #51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
2021-02-01 11:34:41 -08:00
Iurii Zdebskyi
5a406c023e Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
Test Plan: revert-hammer

Differential Revision:
D26070147 (e7b3496232)

Original commit changeset: 8c9339f1511e

fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34
2021-01-29 09:06:24 -08:00
Yi Wang
e7b3496232 [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26070147

fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81
2021-01-28 19:03:18 -08:00
Yi Wang
9d731e87de [Gradient Compression] Explicitly specify the dtype of the error tensor (#50985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120377786

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26034988

fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da
2021-01-28 19:03:14 -08:00
Yi Wang
b619d37bb4 [Gradient Compression] Simplify the implementation of error feedback and warm-start (#50981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50981

Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors.

Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120617971

Test Plan: real run

Reviewed By: rohan-varma

Differential Revision: D26034418

fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14
2021-01-28 18:59:40 -08:00
Yi Wang
9f19843d19 [Gradient Compression] Typo fixes in PowerSGD (#50974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50974

Typo fixes.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257221

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26031679

fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b
2021-01-25 22:55:54 -08:00
Yi Wang
ffaae32d60 [Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations (#50973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26031478

fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
2021-01-25 22:38:39 -08:00
Yi Wang
439afda090 [Gradient Compression] Fix warm-start for PowerSGD laywerwise compression (#50283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50283

Realize that for the layerwise compression, the previous warm-start implementation only skips memory allocations, but does not skip filling random values for Qs.

Also fix the unit test in distributed_test.py. Previously the process group was not created correctly, and not communication occurred in the test_DistributedDataParallel_powerSGD_ddp_comm_hook.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120101220

Test Plan:
Verified the fix by adding added some loggings locally.

Also verified no NE diff on Ads 1x.

Reviewed By: rohan-varma

Differential Revision: D25846222

fbshipit-source-id: 1ebeeb55ceba64d4d904ea6ac1bb42b1b2241520
2021-01-20 22:31:44 -08:00
Yi Wang
ce370398cc [Gradient Compression] Remove the extra comma after "bucket" in PowerSGD hook signatures (#50197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50197

Remove the extra comma after "bucket".
ghstack-source-id: 119513484

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25823117

fbshipit-source-id: acf048f7cb732c23cba3a81ccce1e70f6b9f4299
2021-01-07 15:56:20 -08:00
Ansley Ussery
c619892482 Fix errata (#49903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25718411

Pulled By: ansley

fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604
2020-12-28 20:40:41 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
Yi Wang
55b431b17a [Gradient Compression] Directly let world_size = group_to_use.size() (#49715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49715

Address the comment on https://github.com/pytorch/pytorch/pull/49417#discussion_r545388351
ghstack-source-id: 119049598

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25673997

fbshipit-source-id: 44eb2540e5a77331c34ba503285cbd0bd63c2c0a
2020-12-22 23:24:54 -08:00
Yi Wang
88c33ff8ab [Gradient Compression] Explicitly restrict the scope of torch.cuda.synchronize to the current device (#49711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49711

`torch.cuda.synchronize` uses the current device by default. Explicitly specify this device for better readability.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119017654

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25672267

fbshipit-source-id: 62a2266727a2ea76175f3c438daf20951091c771
2020-12-22 23:21:45 -08:00
Yi Wang
af1b636b89 [Gradient Compression] Change wait() to value() in some callbacks of PowerSGD communication hook (#49709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49709

Since wait() has already been called in the return statements of the precursor callbacks, no need to wait again.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119015237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25672068

fbshipit-source-id: da136327db4c4c0e3b846ba8d6885629f1044374
2020-12-22 21:37:04 -08:00
Yi Wang
c348faedc4 [Gradient Compression] Warm-start of PowerSGD (#49451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49451

Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible.

This can give a better compression performance in terms of both accuracy and speed.

Also add a unit test for batched PowerSGD to test_c10d.py.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119014132

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25583086

fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee
2020-12-22 01:19:14 -08:00
Yi Wang
96aed203bf [Gradient Compression] Replace the assertions in PowerSGD comm hook by stream syncrhonization (#49435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49435

Previously the assertion that prevents illegal memory access is because of the torch.any that returns a boolean value, which initiates a data transfer from the device to the host and forces a synchronization.

An explicit synchronization is more to the point.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118664204

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25573484

fbshipit-source-id: 516d0d502da2863b516c15332702335ee662f072
2020-12-20 17:24:06 -08:00
Yi Wang
342bfd892f [Gradient Compression] Add error feedback to layerwise PowerSGD (#49418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49418

Add error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118670930

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25555538

fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2
2020-12-20 17:22:39 -08:00
Yi Wang
8b61fbdac9 Resubmit: [Gradient Compression] Implement the original layerwise PowerSGD (#49639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49639

Resubmit #49417 with a fix for distributed_test.

The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission.

The real diff is:
4ca1014bb5

This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079

ghstack-source-id: 118969912

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、

Reviewed By: mrshenli

Differential Revision: D25654961

fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e
2020-12-20 13:02:52 -08:00
Shen Li
ad9923e5d5 Revert D25511543: [Gradient Compression] Implement the original layerwise PowerSGD
Test Plan: revert-hammer

Differential Revision:
D25511543 (71f3399e19)

Original commit changeset: 19ef188bc2d4

fbshipit-source-id: a363641a059aeacc57684884998cf8fb7363d748
2020-12-18 20:30:29 -08:00
Yi Wang
71f3399e19 [Gradient Compression] Implement the original layerwise PowerSGD (#49417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49417

The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook".

Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned.

Also add a test in distributed_test.py.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118921275

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25511543

fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c
2020-12-18 18:02:15 -08:00
Yi Wang
a419a3e25d Add assertion on any NaN error on the error feedback (#49374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374

After the assertion is added, the NaN error on certain trainings disappears.

It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118572471

Test Plan:
Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8

To reproduce the error, just comment out the assertion.

Reviewed By: rohan-varma

Differential Revision: D25548299

fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5
2020-12-14 20:15:39 -08:00
Yi Wang
29f0fa36b1 [Gradient Compression] Minor update of the comments on PowerSGD. (#49246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49246

Previously the comment on matrix_approximation_rank was in PowerSGD_hook function. Now move it into PowerSGDState, because the function arg is already moved to this state as an attribute.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118414247

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D25501091

fbshipit-source-id: 701e3109a9a3f2a5f9d18d5bf6d0a266518ee8ea
2020-12-11 17:45:53 -08:00
Luca Wehrstedt
4c425e8da0 Merge common parts of FutureNCCL into at::ivalue::Future (#48505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180032

Test Plan: Unit tests

Reviewed By: wanchaol

Differential Revision: D25180535

fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c
2020-12-10 03:54:22 -08:00
Yi Wang
c876d4f477 [Gradient Compression] Let the dtype of created low-rank tensors P and Q be the same type as the input tensor (#48902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48902

Previously if the dtype of input gradients is FP16, matrix multiplications will fail, because the created low-rank tensors P and Q use FP32 dtype.

Now let the dtype of P and Q be the same as the input tensor.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117962078

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25362071

fbshipit-source-id: e68753ff23bb480605b02891e128202ed0f8a587
2020-12-07 17:40:06 -08:00
Yi Wang
17f53bffef [Gradient Compression] Replace the key of error_dict in PowerSGD state with bucket index (#48867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48867

Previously the key of error_dict is the hashcode of tensor. Now replaced with bucket index.

Bucket index can have a few advantages over the hashcode of tensor.
1) Error dict in the state never removes any key. If the bucket rebuild process occurs frequently, the size of error dict can increase. For now, such rebuild process is infrequent, so it is probably fine.

2) Integer index has a better readability than hashcode, and it can facilitate debugging.
If the user wants to debug the tensor values, usually only a specific bucket needs to be targeted. It's easy to specify such condition (e..g, bucket_index = 0), but it's hard to specify a hashcode in advance, as it can only be determined at runtime.

Note that sometimes the buckets can be rebuilt in the forward pass. In this case, the shape of the bucket with the same index will not be consistent with the one in the previous iteration, and hence the error tensor will be re--initialized as a zero tensor of the new shape. Therefore, `and state.error_dict[bucket_index].shape[0] == padded_total_length` is added to the condition of applying the local error from the previous iteration.

Deleted the arg type of `dist._GradBucket` in powerSGD_hook.py, because somehow test_run_mypy - TestTypeHints failed:
AssertionError: mypy failed: torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:128: error: "_GradBucket" has no attribute "get_index"  [attr-defined]

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117951402

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25346347

fbshipit-source-id: 8348aa103002ec1c69e3ae759504b431140b3b0d
2020-12-05 23:53:27 -08:00
Yi Wang
9c6979a266 [Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) (#48670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48670

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117636492

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25240290

fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b
2020-12-02 06:39:30 -08:00
Yi Wang
ddb6594971 [Gradient Compression] Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48507

Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step.

Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps.

'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117483639

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Also checked the initial Qs and input random seeds of torch.manual_seed() of different ranks for a few steps in real runs.

Example logs:
Exactly same random seed of different ranks at the same step on two nodes, and the random seed varies at each step.

{F346971916}

Reviewed By: rohan-varma

Differential Revision: D25191589

fbshipit-source-id: f7f17df3ad2075ecae1a2a56ca082160f7c5fcfc
2020-11-30 18:46:45 -08:00
Yi Wang
6400d27bbb [Gradient Compression] Define a customized state for PowerSGD comm hook (#48348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48348

To support the features like error feedback, warm start, PowerSGD comm hook needs to maintain a state besides process group. Currently this state only includes a process group and a matrix approximation rank config.

This diff is a pure refactoring. Plan to add more state fields later.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117305280

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D25137962

fbshipit-source-id: cd72b8b01e20f80a92c7577d22f2c96e9eebdc52
2020-11-21 09:25:35 -08:00
Yi Wang
1a6666c967 [Gradient Compression] Add a comment on _orthogonalize. (#48253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48253

Explained why a hand-crafted orthogonalize function is used instead of `torch.qr`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117132622

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D25088607

fbshipit-source-id: ebc228afcb4737bb8529e7143ea170086730520e
2020-11-19 19:22:04 -08:00
Yi Wang
daff3a81a1 [Gradient Compression] PowerSGD comm hook (#48060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48060

Implement a PowerSGD variant that applies to a batched flattened tensor with zero paddings.

This version does not require handling 1D tensors and multi-dimenionsal tensors in the input separately, and hence it does not need to create two asyncrhonous future chains.

Potential optimizations:
1) Consider FP16 compression throughout PowerSGD.
2) Warm start and save one matrix multiplication per ieration.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117105938

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: jiayisuse

Differential Revision: D24843692

fbshipit-source-id: f44200b1fd6e12e829fc543d21ab7ae086769561
2020-11-19 02:59:11 -08:00
Xu Zhao
49f0e5dfeb Fix typing errors in torch.distributed.*, close issue #42967. (#47534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952497

Pulled By: xuzhao9

fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e
2020-11-16 23:27:59 -08:00
Yi Wang
fccfe7bd1a [Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158

1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation.
2. Typo fix.
3. Reformat default_hooks.py.
4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.)

The new style can be used for testing any new comm hook like PowerSGD easily.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

ghstack-source-id: 116012600

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D24669639

fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939
2020-11-06 00:28:09 -08:00