Commit Graph

151 Commits

Author SHA1 Message Date
PyTorch MergeBot
c5d57e7be9 Revert "Use batched operations for PowerSGD"
This reverts commit 5654e63398.

Reverted https://github.com/pytorch/pytorch/pull/75157 on behalf of https://github.com/albanD
2022-04-18 13:10:29 +00:00
magialiao
5654e63398 Use batched operations for PowerSGD
This implements method proposed in #74907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75157
Approved by: https://github.com/wayi1, https://github.com/rohan-varma
2022-04-18 04:34:17 +00:00
Haijunlv
08f3b95857 fix PostLocalSGDOptimizer and ModelAverager average bug
Fixes #74157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74894
Approved by: https://github.com/rohan-varma, https://github.com/wayi1
2022-04-13 11:41:27 +00:00
wayi1
4fb7fa081e [Model Averaging] Code simplification for _find_process_group function (#75007)
Summary:
Previously the highest-level process group in `period_process_group_dict` could be `None`, indicating the global group. Now `period_process_group_dict` cannot contain `None` as a process group, so the function `_find_process_group` can just return a process group instead of a tuple -- when not found, just return `None`, because now the returned process group cannot be `None`.

Proposal: https://github.com/pytorch/pytorch/issues/71325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75007

Reviewed By: awgu

Differential Revision: D35357816

Pulled By: rohan-varma

fbshipit-source-id: 4522dba49797df7140227bfd822d668b7e118a66
(cherry picked from commit 77ca01b555d52685283c969176b08de4ff46c32d)
2022-04-04 20:31:22 +00:00
Yi Wang
2aebece625 [Model Averaging] Remove unused variable world_size in post_localSGD_hook.py (#74803)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74803

Reviewed By: albanD

Differential Revision: D35175613

Pulled By: mrshenli

fbshipit-source-id: 881933656ed214554b8acb4c5756349cea0af51d
(cherry picked from commit 033efb2eea856d00d5e78c8a99d726c6cf69d714)
2022-03-28 17:41:26 +00:00
wayi1
5fbe8b1966 [Model Averaging] Make HierarchicalModelAverager a subclass of averagers.ModelAverager
Make `HierarchicalModelAverager` a subclass of `averagers.ModelAverager` is a preparation step for incorporating hierarchical SGD into `PostLocalSGDOptimizer`.

Proposal: https://github.com/pytorch/pytorch/issues/73382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74564
Approved by: https://github.com/rohan-varma
2022-03-24 21:52:00 +00:00
wayi1
5993f48711 [Model Averaging] Add a reference to hierarchical SGD (#73823)
Summary:
Add a reference.

Also fix the comment: unlike `averagers.py`, currently this is not a base class that can inherit many subclasses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73823

Reviewed By: ejguan

Differential Revision: D34684366

Pulled By: rohan-varma

fbshipit-source-id: e253ed39ba0783ad73bfd889e9a2e7d0c9214a3a
(cherry picked from commit a9fec3585078881ccd5886ebb27e52b15f7181b1)
2022-03-08 05:56:17 +00:00
wayi1
0bb3b0652c [Model Averaging] Support hierarchical model averaging (#73285)
Summary:
Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325.

Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs.

In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285

Reviewed By: mrshenli

Differential Revision: D34457792

Pulled By: rohan-varma

fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba
(cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)
2022-03-04 18:29:36 +00:00
Andrew Gu
59dd84cab6 [Join][BE] Fix typo; remove obsolete method (#72886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886

**Test Plan**
Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34255651

Pulled By: awgu

fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d
(cherry picked from commit 4492f03a3f)
2022-02-16 15:03:09 +00:00
Rohan Varma
aeacf910b5 [Checkpoint] Rename file (#72748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72748

Removes underscore from file/class as directory is already private
ghstack-source-id: 149109295

Test Plan: Ci

Reviewed By: samdow

Differential Revision: D34179308

fbshipit-source-id: 8e956f3c83f21159c5e0fcdce09624ecb8a73ac0
(cherry picked from commit adfd8bc357)
2022-02-16 00:08:23 +00:00
wayi1
8b08478115 Fix the doc of PostLocalSGDState (#72792)
Summary:
The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly.

See the example in unit test: 4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792

Reviewed By: samdow

Differential Revision: D34213221

Pulled By: rohan-varma

fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662
(cherry picked from commit bf90af704f)
2022-02-15 23:47:12 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Brian Muse
8bf3179f6e #71946 Remove Python 3.6 references (#72211)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71946

This commit removes some bits of code that were hard coded for Python 3.6 support from the `.circleci` and `torch` folders. It should only be merged if https://github.com/pytorch/pytorch/issues/66462 is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72211

Reviewed By: dagitses, seemethere

Differential Revision: D33982604

Pulled By: musebc

fbshipit-source-id: 8f453bf9909df615addd59538adb369c65484044
(cherry picked from commit 944a9970fe)
2022-02-08 03:46:20 +00:00
Omar
25f9fe22a9 [PowerSGD] Add orthogonalization with QR factorization (#72043)
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in https://github.com/pytorch/pytorch/issues/65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:
![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png)

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.

### Additional context
_No response_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72043

Reviewed By: albanD

Differential Revision: D34042781

Pulled By: cbalioglu

fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
(cherry picked from commit f64bf3839a)
2022-02-07 21:15:40 +00:00
Yanli Zhao
2336571cb7 make fsdp folder to be public (#72084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084

make fsdp folder to be public
ghstack-source-id: 148173447

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D33903417

fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5
(cherry picked from commit bd06513cfe)
2022-02-02 15:50:14 +00:00
Rohan Varma
8fa5cde3a9 Fix hooks (#71970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71970

- Provide default arg for power SGD convenience wrapper that matches the main API default

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D33837457

fbshipit-source-id: 8f4efab4992b3fff09456a18db2c83e087c25bdf
(cherry picked from commit 83f52fb3c7)
2022-01-28 23:07:33 +00:00
Rohan Varma
bdcdf94bdd [Opt Overlap] Clean up code in _OptimizerHookState (#71620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620

Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33700593

fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
(cherry picked from commit a50b2caf0e)
2022-01-26 19:33:49 +00:00
Rohan Varma
1c8fcc44cb [Opt Overlap] Support optimizing partial set of parameters (#71608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608

Per title
ghstack-source-id: 147577178

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33696382

fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
(cherry picked from commit 436b547fb0)
2022-01-26 19:33:49 +00:00
Rohan Varma
8273912a8c [Opt Overlap] Implement _OverlappedOptimizer (#71605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71605

ghstack-source-id: 147577173

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33692686

fbshipit-source-id: b0fdb45245d923e1de8fef4431d3e235ac57dcbf
(cherry picked from commit 8b83dbf690)
2022-01-26 07:32:04 +00:00
Rohan Varma
f5a71ec2d6 [Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604

Implement 2 helper functions:
- as_functional_optim which takes in a torch.optim class type and arguments and
  creates the corresponding functional optimizer.
- create_functional_optim which takes in the functional optimizer class type
  and constructs it. Note that as_functional_optim calls into
  create_functional_optim.

  The first will be used in future PRs as described in
  https://github.com/pytorch/pytorch/issues/67570 to create a functional
  optimizer from a traditional optimizer. The latter is used in
  _OptimizerHookState to create a functional optimizer.

  Both new helper functions are covered by unittests.
ghstack-source-id: 147577170

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33688995

fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1
(cherry picked from commit 42fdae2991)
2022-01-25 18:32:13 +00:00
Rohan Varma
281663955f [Opt Overlap] Create Optimizer Hook State directly from functional optim (#71602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71602

The design in https://github.com/pytorch/pytorch/issues/67570 requires
`_OptimizerHookState` to be created directly from a functional optimizer. Add
support and tests for this. Also refactor a few tests.
ghstack-source-id: 147577175

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33687477

fbshipit-source-id: f3c789aa77773f918e01a8d0cf08739b2edf07b3
(cherry picked from commit 4851e1c6d4)
2022-01-25 18:32:13 +00:00
Rohan Varma
9b3a56eecf [Optimizer Overlap] Move hooks to own file (#71601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71601

Moves current prototype optimizer overlap to its own file for a better
namespace. No code changes besides a few comment fixes. Note that this code is
still prototype and not expected to be used by an end user.
ghstack-source-id: 147458826

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33662678

fbshipit-source-id: 3cc931323230a4b66c02b9e6f744aaf5c48d4d34
(cherry picked from commit 5070595c7f)
2022-01-23 00:04:32 +00:00
Rohan Varma
d8abe813bc [LocalSGD] Move feature to Beta, clean up some docs (#71621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621

Moves this feature to beta as discussed, and cleans up some docs.
Synced offline with wayi1 who mentioned that the current names are preferred
as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325.
ghstack-source-id: 147382940

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33700444

fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067
(cherry picked from commit 656e9809b2)
2022-01-21 21:10:42 +00:00
Omar Younis
569aeec1bc fix typo in debugging_hooks.py (#70956)
Summary:
I just fixed a small typo in the debugging_hooks documentation

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70956

Reviewed By: jbschlosser

Differential Revision: D33508898

Pulled By: dagitses

fbshipit-source-id: fc5935e5a2e2ddc45657a22d3b33a11aba378d9b
2022-01-10 12:59:42 -08:00
Yi Wang
ed50a35cf8 [Model Averaging] Update the documentation of PeriodicModelAverager (#70974)
Summary:
Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974

Reviewed By: dagitses

Differential Revision: D33474576

Pulled By: rohan-varma

fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6
2022-01-07 13:20:42 -08:00
Rohan Varma
a197f3fe52 [FSDP/Checkpoint] Activation offload support in checkpoint_wrapper (#70165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165

Implements activation offload support in checkpoint_wrapper API via
save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation
and instead compose offload + checkpoint using the save_on_cpu hook for the
former.
ghstack-source-id: 146078900

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33228820

fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a
2021-12-21 10:08:18 -08:00
Rohan Varma
79a40b22aa [Checkpoint] Make checkpoint_wrapper an nn.Module (#70164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164

Implement Alban's suggestion to make checkpoint_wrapper an nn.Module
instead of patching the forward pass, which is too hacky.
ghstack-source-id: 146011215

Test Plan: IC

Reviewed By: mrshenli

Differential Revision: D33214696

fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2
2021-12-20 13:22:28 -08:00
Rohan Varma
c4281cc92d Prototype checkpoint_wrapper (#69955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955

Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module.

Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case.

Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint.
ghstack-source-id: 145811242

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D33107276

fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb
2021-12-16 09:59:19 -08:00
Wanchao Liang
7c6a8a47db [BE] minor improvement to dist quantization (#67401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401

some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 143910067
ghstack-source-id: 143910067

Test Plan: wait for ci

Reviewed By: mrshenli

Differential Revision: D31979269

fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289
2021-11-21 23:31:59 -08:00
Michael Suo
f50bf16c04 Revert D31663043: [BE] minor improvement to dist quantization
Test Plan: revert-hammer

Differential Revision:
D31663043

Original commit changeset: 2f96b7346e9c

fbshipit-source-id: d38684dfe79ca335fbbe624496ad4c86c29d1570
2021-10-22 16:37:41 -07:00
Wanchao Liang
7379d4db20 [BE] minor improvement to dist quantization (#66649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649

some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 141336191

Test Plan: wait for ci

Reviewed By: cbalioglu

Differential Revision: D31663043

fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79
2021-10-22 13:46:28 -07:00
Yi Wang
c1415a0a72 [Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197

1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2. The parameters are read from local optimizer's param_groups instead of a separate input.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 138307226

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D31007439

fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722
2021-09-17 10:31:58 -07:00
Yi Wang
00e6e0c593 [Model Averaging] Revert #63895 (#64903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903

Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D30894688

fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485
2021-09-14 09:45:42 -07:00
Yi Wang
bf9d66586c [DDP Comm Hook] Create a noop hook for performance debugging (#64344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64344

As title.

Additionally, avoid using numpy array in test_ddp_hooks.py.
ghstack-source-id: 137170449

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks -- test_ddp_comm_hook_noop_hook

Reviewed By: rohan-varma

Differential Revision: D30693220

fbshipit-source-id: e17f0d1c6198863cf20a53566f586a6bff602522
2021-09-01 17:36:22 -07:00
Marjan Fariborz
6a76ee04de Adding alltoall_single collective to collective quantization API (#63154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154

The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included.
ghstack-source-id: 136856877

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16

Reviewed By: wanchaol

Differential Revision: D30255251

fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29
2021-08-27 12:46:31 -07:00
Marjan Fariborz
3b284ab024 Adding BFP16 quantization/dequantization support to OSS (#63059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059

Supporting BFP16 quantization method to OSS. Currently only support CPU
ghstack-source-id: 136639528

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D30194538

fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53
2021-08-25 23:41:34 -07:00
Yi Wang
7edeead796 Add a comment on the potential implicit type up-casting (#63905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63905

as title
ghstack-source-id: 136590703

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D30527929

fbshipit-source-id: 69402bbfa87cfd8fc166ce313cde9736ee072589
2021-08-25 12:47:45 -07:00
Aayush Prakash
8a22d4fa5c [Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895

When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.

The replacement is `tensor.set_`.
ghstack-source-id: 136593433

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: SciPioneer

Differential Revision: D30526178

fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00
2021-08-25 11:12:55 -07:00
Edward Yang
699c764d2e Revert D30513613: Removing tensor.data usage in utils with tensor set_ method
Test Plan: revert-hammer

Differential Revision:
D30513613 (d08a36f831)

Original commit changeset: 402efb9c30fa

fbshipit-source-id: 911c66a9852de77dc5274b5fb373258c0c97739a
2021-08-24 12:20:37 -07:00
Aayush Prakash
d08a36f831 Removing tensor.data usage in utils with tensor set_ method (#63867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867

When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.

The replacement is `tensor.set_`.

ghstack-source-id: 136531233

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager

Reviewed By: SciPioneer

Differential Revision: D30513613

fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585
2021-08-24 11:20:44 -07:00
Marjan Fariborz
c545b099aa Separating quantization test from distributed_test (#63058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63058

Dedicating separate tests for different quantization methods. Currently supporting FP16 method.
ghstack-source-id: 136499767

Test Plan: uck test mode/dev //caffe2/test/distributed/algorithms/quantization:quantization_gloo_fork -- name_of_the_test

Reviewed By: wanchaol

Differential Revision: D30142580

fbshipit-source-id: 3aacec1a231a662067d2b48c001f0c69fefcdd60
2021-08-24 01:44:55 -07:00
Yinbin Ma
0d437fe6d0 BF16 allreduce hook (#63260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260

Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.

Reviewed By: SciPioneer

Differential Revision: D30238317

fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
2021-08-18 20:53:49 -07:00
Yi Wang
979180cd01 [Model Averaging] Allow subgroup to be None in PostLocalSGDState (#63277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277

`PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135848575

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: cbalioglu

Differential Revision: D30325041

fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd
2021-08-16 10:07:41 -07:00
Andrew Gu
2d75703c6a Remove req to call step() in training loop (#63164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284616

Pulled By: andwgu

fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1
2021-08-13 08:22:44 -07:00
Andrew Gu
bd81c9178a Simplify data structures, add uniform approximation, fix mem leak (#63162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284617

Pulled By: andwgu

fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f
2021-08-13 08:20:59 -07:00
Andrew Gu
1b1f1e36b4 Add `allow_empty_param_list` to functional optimizers (#62522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522

Addresses https://github.com/pytorch/pytorch/issues/62481

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30072074

Pulled By: andwgu

fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3
2021-08-09 11:18:56 -07:00
Marjan Fariborz
c7db642a72 Adding collective quantization API (#62142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142

Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization

Test Plan:
Tested through distributed_gloo_fork.
e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized

Reviewed By: wanchaol

Differential Revision: D29682812

fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e
2021-08-09 08:11:22 -07:00
Sean Lawlor
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
Andrew Gu
62a90c227f Make _Join, _Joinable, _JoinHook public (#62605)
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
2021-08-03 12:20:11 -07:00
Andrew Gu
43327cc197 Refactor commonalities between two approaches (#62624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62624

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30058543

Pulled By: andwgu

fbshipit-source-id: 73c794062b75e011868fae264f592549eed67482
2021-08-03 08:43:14 -07:00