Commit Graph

280 Commits

Author SHA1 Message Date
Olga Andreeva
f204afc2bb Added communication hook for sharded cases (#83254)
Fixes https://github.com/pytorch/pytorch/issues/79114

An implementation of a FSDP communication hook interface for a sharded strategies:

- Added `reduce_scatter_hook` to default hooks. Note the difference of `reduce_scatter` from `all_reduce`, it requires 2 tensors:`input_gradient` and `output` variables and stores result in `output`, which is further used as a summed gradient shard.
- Adjusted FSDP logic to return `reduce_scatter_hook` as a default communication hook for sharded strategies, `DefaultState` is the same for sharded and non-sharded strategies.
- Adjusted low-precision hooks to work with both `all_reduce` and `reduce_scatter` depending on whether `output` tensor is provided or not.

Test plan:

Added all existing sharded strategies as an input parameters to existing tests.
For`test_default_communication_hook_behaviour` double checked how a linear layer is sharded across workers. This test creates a simple net ``1 X N``, where ``N`` - is the number of workers. For sharded cases, ``N`` parameters are sharded across ``N`` workers. This test checks that after backward, each worker has a proper value in it's chunk of the gradient, or the whole gradient on every worker is equal to an expected value.

Checked that low-precision tests work for sharded cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83254
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-08-18 18:41:14 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Rohan Varma
5b2c03823d Generalize CheckpointWrapper (#83035)
Allow checkpoint_wrapper to take in the checkpoint functional impl. This decouples it from torch.utils.checkpoint and allows other checkpoint implementations to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83035
Approved by: https://github.com/awgu
2022-08-09 23:35:39 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
Olga Andreeva
a60907ec11 Adding fsdp fp16 and bf16 hooks (#81711)
Recently, `register_comm_hook` was introduced to `FSDP`, which at the moment supports only `NO_SHARD` strategy and has a default `all_reduce` hook implemented. This PR adds two lower precision hooks to an existing default hook.

I've also made slight adjustments to existing implementation of an `all_reduce` hook including:

`AllReduceState` ->` DefaultState `, motivation: `AllReduceState` is not specific to all_reduce. Gradients' pre- and post-division factors are also useful for other hooks, that require pre- and post-division, e.g. `fp16_hook` and `bf16_hook`.
I've put all 3 hooks into `default_hooks.py`
Additionally, `FSDP` supports `MixedPrecision` and, theoretically, it is possible to specify MixedPrecision for gradients and attach a lower precision hook to the model. To avoid double-casting, I've added a couple of checks to `fully_sharded_data_parallel`, i.e. casting to precision and back is performed by a lower precision hook only. I think, as a next step, it would be nice to ensure that user can't have both lower precision hook and MixedPrecision(reduce_dtype=<precision>) specified, but I am happy to discuss this and adjust current implementation.

As a test, I create two models: one with a lower precision hook and one with a `MixedPrecision(reduce_dtype=<precision>)` specified, perform one forward/backward and optimizer step and compare gradients.

PS. first version of this PR was reverted, because added unittests didn't include NCCL version checks for `bf16_hook` (thus failed on trunk). In this version, I've added appropriate checks for tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81711
Approved by: https://github.com/rohan-varma
2022-07-19 23:54:51 +00:00
PyTorch MergeBot
a8f4011e90 Revert "Adding fsdp fp16 and bf16 hooks (#80557)"
This reverts commit f7d6828467.

Reverted https://github.com/pytorch/pytorch/pull/80557 on behalf of https://github.com/aovladi due to broke distributed tests on trunk
2022-07-19 03:11:19 +00:00
Olga Andreeva
f7d6828467 Adding fsdp fp16 and bf16 hooks (#80557)
Recently, `register_comm_hook` was introduced to `FSDP`, which at the moment supports only `NO_SHARD` strategy and has a default `all_reduce` hook implemented. This PR adds two lower precision hooks to an existing default hook.

I've also made slight adjustments to existing implementation of an `all_reduce` hook including:

- `AllReduceState` ->  `DefaultState` , motivation: `AllReduceState` is not specific to `all_reduce`. Gradients' pre- and post-division factors are also useful for other hooks, that require pre- and post-division, e.g. fp16_hook and bf16_hook.
- I've put all 3 hooks into `default_hooks.py`

Additionally, `FSDP` supports `MixedPrecision` and, theoretically, it is possible to specify `MixedPrecision` for gradients and attach a lower precision hook to the model. To avoid double-casting, I've added a couple of checks to `fully_sharded_data_parallel`, i.e. casting to precision and back is performed by a lower precision hook only. I think, as a next step, it would be nice to ensure that user can't have both lower precision hook and `MixedPrecision(reduce_dtype=<precision>)` specified, but I am happy to discuss this and adjust current implementation.

As a test, I create two models: one with a lower precision hook and one with a `MixedPrecision(reduce_dtype=<precision>)` specified, perform one forward/backward and optimizer step and compare gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80557
Approved by: https://github.com/rohan-varma
2022-07-18 22:40:56 +00:00
Jerome
547e499731 Enable Zero1's ddp_with_overlap for hpu backend (#80438)
Enable zero with ddp overlap feature along with a simple interface to insert functional optimizer to the map

Signed-off-by: Jerome <janand@habana.ai>

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80438
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-07-18 15:05:27 +00:00
Rohan Varma
0c5fdfd95f Revert "Revert "[FSDP Optim State] Remove checkpoint prefix (#80480)"" (#80936)
This reverts commit fe361dede4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80936
Approved by: https://github.com/awgu
2022-07-06 22:21:07 +00:00
PyTorch MergeBot
fe361dede4 Revert "[FSDP Optim State] Remove checkpoint prefix (#80480)"
This reverts commit 04c50fec1c.

Reverted https://github.com/pytorch/pytorch/pull/80480 on behalf of https://github.com/suo due to Broke master 04c50fec1c, the test failures were not unrelated
2022-07-06 02:43:27 +00:00
Rohan Varma
04c50fec1c [FSDP Optim State] Remove checkpoint prefix (#80480)
Remove `_checkpoint_wrapped_module` prefixes when creating keys for optimizer state_dict.

Having these does not actually create an issue for optim_state_dict save / load, but we'd like to strip these keys out for downstream code that consumes these APIs typically expecting checkpointing prefixes to not exist (as checkpointing should be a transparent operation which should not change module / parameter names).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80480
Approved by: https://github.com/awgu, https://github.com/fegin
2022-07-06 01:17:58 +00:00
Chien-Chin Huang
e0eeb06ec6 Consolidate the naming of named_parameter and state_dict for CheckpointWrapper (#80089)
named_parameter() should return the same parameter names as state_dict() but the current CheckpointWrapper does not enforce this naming rule. This PR resolves this issue.

Differential Revision: [D37344200](https://our.internmc.facebook.com/intern/diff/D37344200/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80089
Approved by: https://github.com/rohan-varma
2022-07-05 22:11:59 +00:00
Charlie Yan
ffae7308c9 Enable test: distributed/algorithms/quantization/test_quantization (#80097)
fixes  https://github.com/pytorch/pytorch/issues/69017
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80097
Approved by: https://github.com/wanchaol
2022-07-01 01:32:33 +00:00
PyTorch MergeBot
f667aaed1d Revert "Added serialization to postlocal_SGD. (#80435)"
This reverts commit dfdf4e79df.

Reverted https://github.com/pytorch/pytorch/pull/80435 on behalf of https://github.com/suo due to broke distributed tests on trunk, see: dfdf4e79df
2022-06-30 01:34:10 +00:00
Olga Andreeva
dfdf4e79df Added serialization to postlocal_SGD. (#80435)
Fixes #75666

Current PR adds the functionality for `PostLocalSGD` communication hook and tests that communication hook can be properly saved and restored. Similar to https://github.com/pytorch/pytorch/pull/79334, where serialization was added to `PowerSGD`.

``__getstate__``

 Returns:
```
        ``Dict[str, Any]`` which will be pickled and saved.
        ``process_group`` and ``subgroup`` are not serializable and excluded from
        a returned state.
```
``__setstate__``
```
          Takes provided ``state`` and retrieves ``PostLocalSGDState``.
          ``process_group`` and ``subgroup`` are set to default process_group and subgroup respectively.
           Default subgroup is equivalent to the subgroup on each node.
```

Small adjustment to `PowerSGD`'s warning message.

Refactored unittest, i.e. separated parity and log checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80435
Approved by: https://github.com/awgu
2022-06-29 23:59:46 +00:00
Rohan Varma
5fc2d45a3a Remove unneeded TODO (#80453)
This TODO is no longer needed, as we use `_register_fused_optim` to register the overlapped optimizer in DDP.  Also, remove comment about API being experimental, as this API is no longer going to be used by end user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80453
Approved by: https://github.com/awgu
2022-06-29 01:19:48 +00:00
Olga Andreeva
5fc209ed11 FSDP communication hook interface for NO_SHARD strategy (#79833)
Fixes #79114

An implementation of a FSDP communication hook interface for a NO_SHARD strategy:
- `FullyShardedDataParallel.register_comm_hook(self, state: object, hook: callable)` checks current sharding strategy. If it is other that NO_SHARD, raises a runtime error. Otherwise, sets and shares a specified hook and its state with all submodules
- When FSDP is ready to communicate a gradient, checks if there is a registered hook, and calls it instead of all_reduce. Additionally, gradient pre and post devision are not performed if a hook is registered.

To test the interface, I've implemented a communication hook, that calls for `all_reduce`.

A  unittest:
- checks that is a sharding strategy is anything but NO_SHARD, a runtime error is raised
- checks that for a NO_SHARD case, model with registered all_reduce hook and without a hook work the same.
- checks for 2 types of FSDP models: with the wrapped first layer and without. (to make sure submodules have a hook registered)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79833
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-06-28 08:03:11 +00:00
anjali411
3bcc19b29a Add __all__ to various submodules in torch.fx, distributions, distributed, package (#80367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80367
Approved by: https://github.com/albanD
2022-06-27 21:27:30 +00:00
Rohan Varma
2ede28724d [CheckpointWrapper] Replace generic mod prefix (#79830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79830
Approved by: https://github.com/awgu, https://github.com/zhaojuanmao
2022-06-21 16:01:59 +00:00
Olga Andreeva
8a6d83079c Functionality/pickling for commhooks (#79334)
This PR addresses issue address #75666.
Stateful communication hook now can be saved and reloaded to resume training.

Current PR adds the functionality for PowerSGD communication hook and tests that communication hook can be properly saved and restored.

PowerSGD implementation uses ``__slots__``, as a result introduced __getstate__ and __setstate__ methods are implemented to work with `__slots__` and not` __dict__`.

`__getstate__ `

	 Returns:
           A dictionary that represents a ``PowerSGDState`` which will be pickled and saved.
          ``process_group`` is non-serializable and excluded from a returned state.

`__setstate__`

	Takes a provided ``state`` and retrieves ``PowerSGDState``.
        ``process_group`` is set to default with a proper warning issued to a user.

Unit test

A hook-independent `_test_hook_pickling` is added with this PR, as well as `test_ddp_hook_pickling_powerSGD`, which tests `powerSGD`’s ability to be saved and reloaded.

Currently, the test creates a ddp model with a provided hook, trains it for 10 epochs and saves model’s state and hook’s state.
During reloading, unit test makes sure that a warning was logged (only one warning and the proper one). It then proceeds to check that reloaded hook and original hook are the same. Finally, it checks that a hook’s state was properly initialized:
	- it compares slot values (all, but 2: `process_group` and `rng`) for original and reloaded state
	- it checks that process group was set to a default group
	- it checks that a random state was restored properly with np.testing.assert_array_equal, because `rng` is an instance of `np.random.RandomState`, represented by a tuple. One of entries is of `ndarray dtype[uint32]` type and `np.testing.assert_array_equal` is used for assertion.

Future To-Do:
	- Implement similar __getstate__ and __setstate__ for other stateful communication hooks
	- Add appropriate tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79334
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-06-16 23:15:34 +00:00
Rohan Varma
543919cfc8 Forward attributes to wrapped module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78854

Approved by: https://github.com/albanD
2022-06-14 01:13:33 +00:00
Rohan Varma
44fe851feb [WIP] Fix non-reentrant hooks based checkpointing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78752

Approved by: https://github.com/albanD
2022-06-14 01:13:33 +00:00
Rohan Varma
ec86070922 Checkpoint util
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78704

Approved by: https://github.com/zhaojuanmao
2022-06-10 18:37:36 +00:00
Rohan Varma
f9f8127414 CheckpointWrapper state_dict fix (#77224)
- Uses state dict / load state dict hooks to ensure that modules wrapped with `CheckpointWrapper` can be loaded into non-checkpointed wrapped module.

This is because a training run can use activation checkpointing, then we can recover `state_dict`, and a future run may not want to wrap modules with activation checkpointing or decide to change activation checkpoint wrapping structure. To support this, we add hooks to remove / add the relevant prefix as needed.

Tests are added to ensure we can load into CheckpointWrapper module as well as local module from CheckpointWrapper-wrapped module. state_dict with FSDP is also verified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77224
Approved by: https://github.com/zhaojuanmao
2022-05-17 03:39:31 +00:00
wayi1
5ab8afe487 [Model Averaging] Support disabling post-local gradient sync (#76723)
I find that sometimes disabling intra-subgroup gradient allreduce can still give a satisfying accuracy for some cases, so better to make such gradient averaging configurable. This does not take into account the saving in the communication of allreducing gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76723
Approved by: https://github.com/rohan-varma
2022-05-16 18:09:09 +00:00
Yi Wang
25fa6235f4 [Model Averaging] Make an error message more clear in hierarchical_model_averager.py
As title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75832
Approved by: https://github.com/mrshenli
2022-04-26 15:20:51 +00:00
wayi1
e90580390d [Model Averaging] Make the error message more informative in hierarchical_model_averager.py
As title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76277
Approved by: https://github.com/rohan-varma
2022-04-24 15:10:19 +00:00
magialiao
7c8c8cc248 Use batched operations for PowerSGD
This PR is a rebased version of #75157 which fixes CI issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76041
Approved by: https://github.com/albanD, https://github.com/rohan-varma
2022-04-21 03:25:09 +00:00
Alban Desmaison
da3c848dfa Make distributed raise ImportError when not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75975

Approved by: https://github.com/mrshenli
2022-04-20 13:05:18 +00:00
PyTorch MergeBot
c5d57e7be9 Revert "Use batched operations for PowerSGD"
This reverts commit 5654e63398.

Reverted https://github.com/pytorch/pytorch/pull/75157 on behalf of https://github.com/albanD
2022-04-18 13:10:29 +00:00
magialiao
5654e63398 Use batched operations for PowerSGD
This implements method proposed in #74907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75157
Approved by: https://github.com/wayi1, https://github.com/rohan-varma
2022-04-18 04:34:17 +00:00
Haijunlv
08f3b95857 fix PostLocalSGDOptimizer and ModelAverager average bug
Fixes #74157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74894
Approved by: https://github.com/rohan-varma, https://github.com/wayi1
2022-04-13 11:41:27 +00:00
wayi1
4fb7fa081e [Model Averaging] Code simplification for _find_process_group function (#75007)
Summary:
Previously the highest-level process group in `period_process_group_dict` could be `None`, indicating the global group. Now `period_process_group_dict` cannot contain `None` as a process group, so the function `_find_process_group` can just return a process group instead of a tuple -- when not found, just return `None`, because now the returned process group cannot be `None`.

Proposal: https://github.com/pytorch/pytorch/issues/71325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75007

Reviewed By: awgu

Differential Revision: D35357816

Pulled By: rohan-varma

fbshipit-source-id: 4522dba49797df7140227bfd822d668b7e118a66
(cherry picked from commit 77ca01b555d52685283c969176b08de4ff46c32d)
2022-04-04 20:31:22 +00:00
Yi Wang
2aebece625 [Model Averaging] Remove unused variable world_size in post_localSGD_hook.py (#74803)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74803

Reviewed By: albanD

Differential Revision: D35175613

Pulled By: mrshenli

fbshipit-source-id: 881933656ed214554b8acb4c5756349cea0af51d
(cherry picked from commit 033efb2eea856d00d5e78c8a99d726c6cf69d714)
2022-03-28 17:41:26 +00:00
wayi1
5fbe8b1966 [Model Averaging] Make HierarchicalModelAverager a subclass of averagers.ModelAverager
Make `HierarchicalModelAverager` a subclass of `averagers.ModelAverager` is a preparation step for incorporating hierarchical SGD into `PostLocalSGDOptimizer`.

Proposal: https://github.com/pytorch/pytorch/issues/73382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74564
Approved by: https://github.com/rohan-varma
2022-03-24 21:52:00 +00:00
wayi1
5993f48711 [Model Averaging] Add a reference to hierarchical SGD (#73823)
Summary:
Add a reference.

Also fix the comment: unlike `averagers.py`, currently this is not a base class that can inherit many subclasses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73823

Reviewed By: ejguan

Differential Revision: D34684366

Pulled By: rohan-varma

fbshipit-source-id: e253ed39ba0783ad73bfd889e9a2e7d0c9214a3a
(cherry picked from commit a9fec3585078881ccd5886ebb27e52b15f7181b1)
2022-03-08 05:56:17 +00:00
wayi1
0bb3b0652c [Model Averaging] Support hierarchical model averaging (#73285)
Summary:
Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325.

Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs.

In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285

Reviewed By: mrshenli

Differential Revision: D34457792

Pulled By: rohan-varma

fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba
(cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)
2022-03-04 18:29:36 +00:00
Andrew Gu
59dd84cab6 [Join][BE] Fix typo; remove obsolete method (#72886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886

**Test Plan**
Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34255651

Pulled By: awgu

fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d
(cherry picked from commit 4492f03a3f)
2022-02-16 15:03:09 +00:00
Rohan Varma
aeacf910b5 [Checkpoint] Rename file (#72748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72748

Removes underscore from file/class as directory is already private
ghstack-source-id: 149109295

Test Plan: Ci

Reviewed By: samdow

Differential Revision: D34179308

fbshipit-source-id: 8e956f3c83f21159c5e0fcdce09624ecb8a73ac0
(cherry picked from commit adfd8bc357)
2022-02-16 00:08:23 +00:00
wayi1
8b08478115 Fix the doc of PostLocalSGDState (#72792)
Summary:
The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly.

See the example in unit test: 4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792

Reviewed By: samdow

Differential Revision: D34213221

Pulled By: rohan-varma

fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662
(cherry picked from commit bf90af704f)
2022-02-15 23:47:12 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Brian Muse
8bf3179f6e #71946 Remove Python 3.6 references (#72211)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71946

This commit removes some bits of code that were hard coded for Python 3.6 support from the `.circleci` and `torch` folders. It should only be merged if https://github.com/pytorch/pytorch/issues/66462 is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72211

Reviewed By: dagitses, seemethere

Differential Revision: D33982604

Pulled By: musebc

fbshipit-source-id: 8f453bf9909df615addd59538adb369c65484044
(cherry picked from commit 944a9970fe)
2022-02-08 03:46:20 +00:00
Omar
25f9fe22a9 [PowerSGD] Add orthogonalization with QR factorization (#72043)
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in https://github.com/pytorch/pytorch/issues/65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:
![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png)

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.

### Additional context
_No response_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72043

Reviewed By: albanD

Differential Revision: D34042781

Pulled By: cbalioglu

fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
(cherry picked from commit f64bf3839a)
2022-02-07 21:15:40 +00:00
Yanli Zhao
2336571cb7 make fsdp folder to be public (#72084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084

make fsdp folder to be public
ghstack-source-id: 148173447

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D33903417

fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5
(cherry picked from commit bd06513cfe)
2022-02-02 15:50:14 +00:00
Rohan Varma
8fa5cde3a9 Fix hooks (#71970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71970

- Provide default arg for power SGD convenience wrapper that matches the main API default

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D33837457

fbshipit-source-id: 8f4efab4992b3fff09456a18db2c83e087c25bdf
(cherry picked from commit 83f52fb3c7)
2022-01-28 23:07:33 +00:00
Rohan Varma
bdcdf94bdd [Opt Overlap] Clean up code in _OptimizerHookState (#71620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620

Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33700593

fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
(cherry picked from commit a50b2caf0e)
2022-01-26 19:33:49 +00:00
Rohan Varma
1c8fcc44cb [Opt Overlap] Support optimizing partial set of parameters (#71608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608

Per title
ghstack-source-id: 147577178

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33696382

fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
(cherry picked from commit 436b547fb0)
2022-01-26 19:33:49 +00:00
Rohan Varma
8273912a8c [Opt Overlap] Implement _OverlappedOptimizer (#71605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71605

ghstack-source-id: 147577173

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33692686

fbshipit-source-id: b0fdb45245d923e1de8fef4431d3e235ac57dcbf
(cherry picked from commit 8b83dbf690)
2022-01-26 07:32:04 +00:00
Rohan Varma
f5a71ec2d6 [Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604

Implement 2 helper functions:
- as_functional_optim which takes in a torch.optim class type and arguments and
  creates the corresponding functional optimizer.
- create_functional_optim which takes in the functional optimizer class type
  and constructs it. Note that as_functional_optim calls into
  create_functional_optim.

  The first will be used in future PRs as described in
  https://github.com/pytorch/pytorch/issues/67570 to create a functional
  optimizer from a traditional optimizer. The latter is used in
  _OptimizerHookState to create a functional optimizer.

  Both new helper functions are covered by unittests.
ghstack-source-id: 147577170

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33688995

fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1
(cherry picked from commit 42fdae2991)
2022-01-25 18:32:13 +00:00
Rohan Varma
281663955f [Opt Overlap] Create Optimizer Hook State directly from functional optim (#71602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71602

The design in https://github.com/pytorch/pytorch/issues/67570 requires
`_OptimizerHookState` to be created directly from a functional optimizer. Add
support and tests for this. Also refactor a few tests.
ghstack-source-id: 147577175

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33687477

fbshipit-source-id: f3c789aa77773f918e01a8d0cf08739b2edf07b3
(cherry picked from commit 4851e1c6d4)
2022-01-25 18:32:13 +00:00
Rohan Varma
9b3a56eecf [Optimizer Overlap] Move hooks to own file (#71601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71601

Moves current prototype optimizer overlap to its own file for a better
namespace. No code changes besides a few comment fixes. Note that this code is
still prototype and not expected to be used by an end user.
ghstack-source-id: 147458826

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33662678

fbshipit-source-id: 3cc931323230a4b66c02b9e6f744aaf5c48d4d34
(cherry picked from commit 5070595c7f)
2022-01-23 00:04:32 +00:00
Rohan Varma
d8abe813bc [LocalSGD] Move feature to Beta, clean up some docs (#71621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621

Moves this feature to beta as discussed, and cleans up some docs.
Synced offline with wayi1 who mentioned that the current names are preferred
as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325.
ghstack-source-id: 147382940

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33700444

fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067
(cherry picked from commit 656e9809b2)
2022-01-21 21:10:42 +00:00
Omar Younis
569aeec1bc fix typo in debugging_hooks.py (#70956)
Summary:
I just fixed a small typo in the debugging_hooks documentation

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70956

Reviewed By: jbschlosser

Differential Revision: D33508898

Pulled By: dagitses

fbshipit-source-id: fc5935e5a2e2ddc45657a22d3b33a11aba378d9b
2022-01-10 12:59:42 -08:00
Yi Wang
ed50a35cf8 [Model Averaging] Update the documentation of PeriodicModelAverager (#70974)
Summary:
Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974

Reviewed By: dagitses

Differential Revision: D33474576

Pulled By: rohan-varma

fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6
2022-01-07 13:20:42 -08:00
Rohan Varma
a197f3fe52 [FSDP/Checkpoint] Activation offload support in checkpoint_wrapper (#70165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165

Implements activation offload support in checkpoint_wrapper API via
save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation
and instead compose offload + checkpoint using the save_on_cpu hook for the
former.
ghstack-source-id: 146078900

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33228820

fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a
2021-12-21 10:08:18 -08:00
Rohan Varma
79a40b22aa [Checkpoint] Make checkpoint_wrapper an nn.Module (#70164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164

Implement Alban's suggestion to make checkpoint_wrapper an nn.Module
instead of patching the forward pass, which is too hacky.
ghstack-source-id: 146011215

Test Plan: IC

Reviewed By: mrshenli

Differential Revision: D33214696

fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2
2021-12-20 13:22:28 -08:00
Rohan Varma
c4281cc92d Prototype checkpoint_wrapper (#69955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955

Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module.

Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case.

Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint.
ghstack-source-id: 145811242

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D33107276

fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb
2021-12-16 09:59:19 -08:00
Wanchao Liang
7c6a8a47db [BE] minor improvement to dist quantization (#67401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401

some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 143910067
ghstack-source-id: 143910067

Test Plan: wait for ci

Reviewed By: mrshenli

Differential Revision: D31979269

fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289
2021-11-21 23:31:59 -08:00
Michael Suo
f50bf16c04 Revert D31663043: [BE] minor improvement to dist quantization
Test Plan: revert-hammer

Differential Revision:
D31663043

Original commit changeset: 2f96b7346e9c

fbshipit-source-id: d38684dfe79ca335fbbe624496ad4c86c29d1570
2021-10-22 16:37:41 -07:00
Wanchao Liang
7379d4db20 [BE] minor improvement to dist quantization (#66649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649

some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 141336191

Test Plan: wait for ci

Reviewed By: cbalioglu

Differential Revision: D31663043

fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79
2021-10-22 13:46:28 -07:00
Yi Wang
c1415a0a72 [Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197

1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2. The parameters are read from local optimizer's param_groups instead of a separate input.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 138307226

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D31007439

fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722
2021-09-17 10:31:58 -07:00
Yi Wang
00e6e0c593 [Model Averaging] Revert #63895 (#64903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903

Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D30894688

fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485
2021-09-14 09:45:42 -07:00
Yi Wang
bf9d66586c [DDP Comm Hook] Create a noop hook for performance debugging (#64344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64344

As title.

Additionally, avoid using numpy array in test_ddp_hooks.py.
ghstack-source-id: 137170449

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks -- test_ddp_comm_hook_noop_hook

Reviewed By: rohan-varma

Differential Revision: D30693220

fbshipit-source-id: e17f0d1c6198863cf20a53566f586a6bff602522
2021-09-01 17:36:22 -07:00
Marjan Fariborz
6a76ee04de Adding alltoall_single collective to collective quantization API (#63154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154

The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included.
ghstack-source-id: 136856877

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16

Reviewed By: wanchaol

Differential Revision: D30255251

fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29
2021-08-27 12:46:31 -07:00
Marjan Fariborz
3b284ab024 Adding BFP16 quantization/dequantization support to OSS (#63059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059

Supporting BFP16 quantization method to OSS. Currently only support CPU
ghstack-source-id: 136639528

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D30194538

fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53
2021-08-25 23:41:34 -07:00
Yi Wang
7edeead796 Add a comment on the potential implicit type up-casting (#63905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63905

as title
ghstack-source-id: 136590703

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D30527929

fbshipit-source-id: 69402bbfa87cfd8fc166ce313cde9736ee072589
2021-08-25 12:47:45 -07:00
Aayush Prakash
8a22d4fa5c [Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895

When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.

The replacement is `tensor.set_`.
ghstack-source-id: 136593433

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: SciPioneer

Differential Revision: D30526178

fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00
2021-08-25 11:12:55 -07:00
Edward Yang
699c764d2e Revert D30513613: Removing tensor.data usage in utils with tensor set_ method
Test Plan: revert-hammer

Differential Revision:
D30513613 (d08a36f831)

Original commit changeset: 402efb9c30fa

fbshipit-source-id: 911c66a9852de77dc5274b5fb373258c0c97739a
2021-08-24 12:20:37 -07:00
Aayush Prakash
d08a36f831 Removing tensor.data usage in utils with tensor set_ method (#63867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867

When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.

The replacement is `tensor.set_`.

ghstack-source-id: 136531233

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager

Reviewed By: SciPioneer

Differential Revision: D30513613

fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585
2021-08-24 11:20:44 -07:00
Marjan Fariborz
c545b099aa Separating quantization test from distributed_test (#63058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63058

Dedicating separate tests for different quantization methods. Currently supporting FP16 method.
ghstack-source-id: 136499767

Test Plan: uck test mode/dev //caffe2/test/distributed/algorithms/quantization:quantization_gloo_fork -- name_of_the_test

Reviewed By: wanchaol

Differential Revision: D30142580

fbshipit-source-id: 3aacec1a231a662067d2b48c001f0c69fefcdd60
2021-08-24 01:44:55 -07:00
Yinbin Ma
0d437fe6d0 BF16 allreduce hook (#63260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260

Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.

Reviewed By: SciPioneer

Differential Revision: D30238317

fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
2021-08-18 20:53:49 -07:00
Yi Wang
979180cd01 [Model Averaging] Allow subgroup to be None in PostLocalSGDState (#63277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277

`PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135848575

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: cbalioglu

Differential Revision: D30325041

fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd
2021-08-16 10:07:41 -07:00
Andrew Gu
2d75703c6a Remove req to call step() in training loop (#63164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284616

Pulled By: andwgu

fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1
2021-08-13 08:22:44 -07:00
Andrew Gu
bd81c9178a Simplify data structures, add uniform approximation, fix mem leak (#63162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284617

Pulled By: andwgu

fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f
2021-08-13 08:20:59 -07:00
Andrew Gu
1b1f1e36b4 Add `allow_empty_param_list` to functional optimizers (#62522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522

Addresses https://github.com/pytorch/pytorch/issues/62481

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30072074

Pulled By: andwgu

fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3
2021-08-09 11:18:56 -07:00
Marjan Fariborz
c7db642a72 Adding collective quantization API (#62142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142

Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization

Test Plan:
Tested through distributed_gloo_fork.
e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized

Reviewed By: wanchaol

Differential Revision: D29682812

fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e
2021-08-09 08:11:22 -07:00
Sean Lawlor
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
Andrew Gu
62a90c227f Make _Join, _Joinable, _JoinHook public (#62605)
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
2021-08-03 12:20:11 -07:00
Andrew Gu
43327cc197 Refactor commonalities between two approaches (#62624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62624

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30058543

Pulled By: andwgu

fbshipit-source-id: 73c794062b75e011868fae264f592549eed67482
2021-08-03 08:43:14 -07:00
Andrew Gu
e6a3967c2a Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62623

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30058544

Pulled By: andwgu

fbshipit-source-id: a56910f294c6a40118751eebe255b62700f42be9
2021-08-03 08:13:52 -07:00
Yi Wang
db071ef005 [Reland][DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592

Reland #62510

`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352

Test Plan: unit test

Reviewed By: andwgu

Differential Revision: D30049431

fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
2021-08-02 16:38:09 -07:00
Yi Wang
2ec4f69b48 [DDP Comm Hook] Do not expose hook_then_optimizer as a public method (#62532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532

This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl

Reviewed By: rohan-varma

Differential Revision: D30031222

fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
2021-08-02 12:25:01 -07:00
Eli Uriegas
6f95850127 Revert D30024161: [DDP Communication Hook] Rename 4 Methods of GradBucket Class
Test Plan: revert-hammer

Differential Revision:
D30024161 (29c8b1db57)

Original commit changeset: 07e6072a2f7b

fbshipit-source-id: d571c2caadaf7b71fe2aba3c0597bd8074d153de
2021-08-02 10:26:54 -07:00
Qing Hu
29c8b1db57 [DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510

`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.

Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.

Reviewed By: SciPioneer

Differential Revision: D30024161

fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
2021-08-02 09:33:32 -07:00
Andrew Gu
51f687fd4b Add overlap with DDP to ZeRO (two approaches) (#62157)
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.

Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157

Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`

These were tested on the AI AWS cluster.

An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.

Both approaches have been verified using an internal accuracy benchmark.

Reviewed By: mrshenli

Differential Revision: D29971046

Pulled By: andwgu

fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
2021-08-02 08:33:34 -07:00
Yi Wang
32b37ba246 [DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457

Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.

Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type

Reviewed By: rohan-varma

Differential Revision: D30007390

fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
2021-07-30 20:51:34 -07:00
Yi Wang
acba9b3104 [DDP Communication Hook] Simplify the implementation of parseHookResult of PythonCommHook (#62389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389

Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list.

Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280

Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`.
ghstack-source-id: 134626246

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D29982485

fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a
2021-07-29 17:27:35 -07:00
Yi Wang
9fee176be3 [Model Averaging] Fix docstring of PeriodicModelAverager (#62392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392

The constructor of `PeriodicModelAverager` does not need to accept parameters.
ghstack-source-id: 134626245

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork --  test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29986446

fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76
2021-07-29 17:26:27 -07:00
Yi Wang
2eaf71d749 [Model Averaging] Update model averager API to avoid the redundant params arg needed by post-localSGD optimizer (#62132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62132

as title

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134560541

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29887751

fbshipit-source-id: 60dadb04790d800fdcc7cb8a08d060e411718739
2021-07-28 18:43:09 -07:00
Yi Wang
2581dfc249 [Model Averaging] Create a base class for model averaging (#62111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111

This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134489187

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29884954

fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938
2021-07-28 10:15:36 -07:00
Rohan Varma
64283fe146 [DDP/Functional Optim] Support kwarg arguments (#62079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079

Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29838127

fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2
2021-07-26 22:12:50 -07:00
Rohan Varma
6dc2c07304 [Reland] [DDP] Implement a hook which performs FunctionalSGD step. (#62177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62177

Reland of https://github.com/pytorch/pytorch/pull/61678
Fix CI failure by gating including torchvision model on whether torchvision is available or not.
ghstack-source-id: 134282165

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29904101

fbshipit-source-id: 47e799eb4a90acbbda91c5857ea00de3045d49f5
2021-07-26 11:56:56 -07:00
Rohan Varma
2299d6a013 Revert D29701447: [DDP] Implement a hook which performs FunctionalSGD step.
Test Plan: revert-hammer

Differential Revision:
D29701447 (bd95cf4473)

Original commit changeset: 183954593b82

fbshipit-source-id: 714e6a2b698147db9533a67783aed2a65d9d5bfe
2021-07-25 22:23:30 -07:00
Rohan Varma
bd95cf4473 [DDP] Implement a hook which performs FunctionalSGD step. (#61678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61678

This diff makes the following changes: - Add `step_param` method to `_FunctionalSGD` class which is written similar to `step` but for a single param - Implement a communication hook wrapper that runs a given comm. hook and then applies functional SGD step - Verifies that this is equal to regular allreduce + SGD optimizerghstack-source-id: 133567598
ghstack-source-id: 134263399

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29701447

fbshipit-source-id: 183954593b82a092414623292f9b10e675fef96e
2021-07-25 13:36:47 -07:00
Yi Wang
e856a45283 [Model Averaging] Refactor averagers to accept parameters instead of a module (#62105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62105

This is for the preparation of wrapping the averager as an optimizer, which can only accept parameters rather than a module.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134213572

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters

Reviewed By: rohan-varma

Differential Revision: D29883693

fbshipit-source-id: 474ba924a0b05068b12f163fb74582bccf314964
2021-07-23 18:39:45 -07:00
Yi Wang
b03b45afd9 [DDP Comm Hook] Use a single tensor instead of a tensor list as the comm hook result (#62074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62074

Since SPMD mode is retired, the comm hook result will always be a single tensor.

This can improve comm hook developer experience, as no need to add an extra `[0]` to the precursor future result.

#Closes: https://github.com/pytorch/pytorch/issues/61914
ghstack-source-id: 134164593

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D29864732

fbshipit-source-id: 59fe6dd78b66214b1788514ad4d236039d9bda31
2021-07-23 03:32:05 -07:00
Yi Wang
53222c59f0 Reformat (#62073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62073

as title
ghstack-source-id: 134159445

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D29869185

fbshipit-source-id: 17a32d56860e9469bd26c4eb4ca2d483827d946e
2021-07-22 23:36:22 -07:00
Andrew Gu
3e3acf8a9a Minor documentation fixes (#61785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29746648

Pulled By: andwgu

fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6
2021-07-19 09:01:29 -07:00
Andrew Gu
57feb35474 Refactor non-joined process computation (#61555)
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.

**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)

The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.

This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.

Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.

**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
    ...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
    ...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).

**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555

Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```

Reviewed By: zou3519

Differential Revision: D29690359

Pulled By: andwgu

fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
2021-07-14 08:20:40 -07:00
Yi Wang
df00c636d2 [Model Averaging] Skip model averaging for the first K steps (#61207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61207

Model averager now must be combined with post-localSGD DDP communication hook. It will skip model averaging for the first K steps, because post-localSGD communication hook will run global gradient averaging during this phase.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371335

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: pritamdamania87

Differential Revision: D29523738

fbshipit-source-id: 3fa9611046e1c0afa4bda78aa3ba200fa2a5fa4b
2021-07-10 17:12:16 -07:00
Yi Wang
0f6876d721 [Model Averaging] Create a post-localSGD communication hook (#61206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61206

Create a communication hook to run post-local SGD. This will be combined with model averager component to better support local SGD.

In contrast to the previous approach that runs local gradient averaging + global model averaging at each step for the first K steps, now we plan to run global gradient averaging only for the first K steps at each step, just like normal DDP. This can give us two advantages:
1) For some optimizers, model averaging can cause discrepancy in optimizer states. If we still do global gradient averaging for the first K steps, we can defer such discrepancy until we actually start local SGD.
2) Gradient averaging at the first K steps only run one allreduce that overlaps with backward pass, so it should also be more efficient.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371322

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: pritamdamania87

Differential Revision: D29523292

fbshipit-source-id: 3f215f7150f2917c2781278fad759530c685ea2c
2021-07-10 17:11:10 -07:00
Andrew Gu
179249084b Refactor DDP join() API, adding hooks (#60757)
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.

**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)

There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.

The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.

The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.

**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in  the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757

Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```

Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.

Reviewed By: iramazanli, mrshenli

Differential Revision: D29624636

Pulled By: andwgu

fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
2021-07-09 08:29:20 -07:00
Yi Wang
5b6818f08a [Model Averaging] Enforce a synchronization before allreduce parameters (#60891)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60891

This fix is particularly useful for local SGD when the averaging period is very small, which may cause the conflict between gradient allreduce within per-machine subgroup and the global parameter allreduce by the communication world.
ghstack-source-id: 132564252

Test Plan:
f281873295 (#Try1) failed due to the conflict between global process group and subgroup.
```
<Thread(configerator-monitor-singleton, started 139839806633728)>
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/jetter.gson7tr3/configerator/client.py", line 348, in _monitor_loop
    self._parent_thread.join(self._interval_ms / 1000)
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1015, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
```

Fixed after adding an explicit sync: f282044866, f282241800

Reviewed By: rohan-varma

Differential Revision: D29434597

fbshipit-source-id: a4f777fc26f379639f85fda32de425cd3b337b33
2021-06-29 01:39:40 -07:00
Yi Wang
f262217101 [Model Averaging] Move step out of model averaging API (#60632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60632

Address the comment https://github.com/pytorch/pytorch/pull/60320#discussion_r654845062
ghstack-source-id: 132340278

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29355609

fbshipit-source-id: 50a6f13ed70b5a5b5b92ead2f3d7082c11277af5
2021-06-25 17:20:52 -07:00
Yi Wang
80f40b172f [Model Averaging] Periodic model averager (#60320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60320

This averager can be used for post-local SGD.
ghstack-source-id: 131908011

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29249850

fbshipit-source-id: 09675d6bb1edfb8ffbeb94510d91962532d8ca3e
2021-06-23 20:23:04 -07:00
Yi Wang
aeea5bf4a1 [Model Averaging] Provide a util function for model averaging (#60303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60303

The util function can be used for averaging parameters.

More optimizations can be done in the future.
ghstack-source-id: 132214212

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_average_parameters

Reviewed By: rohan-varma

Differential Revision: D29242806

fbshipit-source-id: 76fb5a92adb4bdc6151a9f411e366a0ed2a31f47
2021-06-23 15:41:15 -07:00
Yi Wang
2b398d0537 [Reland][Gradient Compression] Apply division first to avoid overflow (#59576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59576

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130754510

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28941327

fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f
2021-06-08 10:03:21 -07:00
Mike Ruberry
f998e63dca Revert D28922548: [Gradient Compression] Apply division first to avoid overflow
Test Plan: revert-hammer

Differential Revision:
D28922548 (459270ac01)

Original commit changeset: 442bd3cc7a35

fbshipit-source-id: 7e4361b4eb283cdb21f15a36d6eebf558dd7386f
2021-06-07 03:57:10 -07:00
Yi Wang
459270ac01 [Gradient Compression] Apply division first to avoid overflow (#59522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130686229

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28922548

fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be
2021-06-07 01:43:10 -07:00
Yi Wang
9bfc1c4e0e [Gradient Compression] Update the docstring of fp16_compress_hook (#58168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58168

Update the documentation to be consistent to https://github.com/pytorch/pytorch/pull/57410.
ghstack-source-id: 128797174

Test Plan: N/A

Reviewed By: agolynski, zhengwy888

Differential Revision: D28388160

fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65
2021-05-12 14:28:41 -07:00
lezcano
24087d07ca Deprecate QR (#57745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57745

Reviewed By: bdhirsh

Differential Revision: D28318164

Pulled By: mruberry

fbshipit-source-id: b8e3cb9d7ab33f30c8653ec39f932a8af8bd2a50
2021-05-10 22:56:37 -07:00
Weiyi Zheng
c07babbcf1 [Gradient Compression] Divide by world size before all_reduce to avoid overflow (#57410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57410

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.
ghstack-source-id: 127877083

Test Plan:
before chage

f268909897

after change:
f270950609

If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook.

Reviewed By: SciPioneer

Differential Revision: D28128628

fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0
2021-05-07 12:23:21 -07:00
Sam Estep
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
Yi Wang
b4cb020c0f [Gradient Compression] Make orthogonalization_epsilon configurable in PowerSGDState (#55738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55738

Per title, and use 0 as the default value.

It turns out that setting this epsilon as 0 can accelerate convergence and improve accuracy for some use cases.

Test Plan:
unit tests
f264687105
f264675194

Reviewed By: shuyingsunshine21

Differential Revision: D27694971

fbshipit-source-id: b61528c6c817127974acdc4635bccf607532287f
2021-04-13 02:52:56 -07:00
Yi Wang
2496a09314 [Gradient Compression] Fix PowerSGD docstring by removing an extra whitespace (#55666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55666

{F590513307}

Some code is not properly displayed due to an extra whitespace ahead of `(num_rows + num_cols)`.
ghstack-source-id: 126148569

Test Plan: Locally viewed

Reviewed By: rohan-varma

Differential Revision: D27673663

fbshipit-source-id: 603ae4ddbe86ceaefc311885b82b0f6b48b57b27
2021-04-09 21:11:40 -07:00
Yi Wang
1b4bb3691c [Gradient Compression] Update _powerSGD_comm_hook_wrapper to only expose 2 most critical hyperparameters (#55295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55295

Update `_powerSGD_comm_hook_wrapper` to only expose 2 most critical hyperparameters, to make this API more clear to any future user (although the second hyperparameter `start_powerSGD_iter` is not in use yet).

Test Plan: waitforbuildbot

Reviewed By: shuyingsunshine21

Differential Revision: D27561734

fbshipit-source-id: b661981cc033b109f4f2fc92b435567a184a7fb5
2021-04-06 01:29:10 -07:00
Yi Wang
cc4036905c [Gradient Compression] Update the default value of start_powerSGD_iter and update the docstring (#55272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55272

1. Set 1K as the default value of `start_powerSGD_iter` for practicability. The original default value 10 is usually too small for real use cases. The new default value 1K is also consistent with PyTorch Lightning.
2. Update the docstring of `start_powerSGD_iter` to remind the users to set a value no less than the warm-up steps if any.
3. Update some unit tests to start PowerSGD early.

ghstack-source-id: 125707662

Test Plan: waitforbuildbot

Reviewed By: shuyingsunshine21

Differential Revision: D27553388

fbshipit-source-id: 40076419bc85755c0c0b64b79ba914b241085fcc
2021-04-06 01:27:29 -07:00
Yi Wang
6a2f046504 [SPMD] Restrict DDP communication hooks to SPSD mode (#55253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253

Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks.

The next step is limiting only 1 model replica in Reducer.
ghstack-source-id: 125677637

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27533898

fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae
2021-04-05 16:46:47 -07:00
Yi Wang
058357a439 [Gradient Compression] Report compression rate for batched PowerSGD hook (#55103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55103

Previously compression rate is only reported in PowerSGD hook. Also report this metric for comprehensive experimentation.

It is very easy to compute the sizes before and after compression, because there is only one matrix factorization per bucket, and no accumulation within the bucket is needed.
1) The size before compression is the input tensor size.
2) The size after compression is the size of P + Q, where each has a size of `square_side_length * state.matrix_approximation_rank`.
ghstack-source-id: 125399028

Test Plan: Tested by running scripts/wayi/torch/power_sgd.py locally.

Reviewed By: deadlybulb

Differential Revision: D27474295

fbshipit-source-id: a2225e85be03ab20238f01014d5ec9ae1787c4fb
2021-03-31 22:17:05 -07:00
Yi Wang
7c0941ee63 Clang-format powerSGD_hook.py (#54839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54839

ghstack-source-id: 125089465

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27384796

fbshipit-source-id: 8312059f6a47d60ca29f75041141bb88804e1b32
2021-03-30 09:28:45 -07:00
Yi Wang
6c31f56bf4 [Gradient Compression] Add cuda.syncrhonize back to batched powerSGD (#54838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54838

Realize that an explicit sync is somehow still needed for batched PowerSGD hook. I find that a job failure can be fixed by this change.

The sync was once removed by #54482.

Test Plan:
f260900882
f260899693

Reviewed By: rohan-varma

Differential Revision: D27384738

fbshipit-source-id: 3efd738b9fd375e2ceb36ed3a6bf99cd8ce8ff95
2021-03-30 09:27:11 -07:00
Mark Astley
4bf90558e0 [Gradient Compression] Add logging for gradient compression stats. (#54647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54647

Regularly log stats showing effect of gradient compression when using the PowerSGD DDP communication hook.

Test Plan:
buck run mode/dev-nosan scripts/wayi/torch:power_sgd

Play with the layer sizes of the input model (you can just use linear layers for convenience), and check the log that shows compression stats. For convenience, you can change `logging.info` to `print` locally.

You can create some test diffs on top of this diff, to show that the compression stats are correct in different cases.

Run with power_sgd script:
{F537381542}

Diff with example using a simple linear model: D27299934
sample output:
{F538486535}

Reviewed By: SciPioneer

Differential Revision: D27240254

fbshipit-source-id: 9e142b2f7957cc874804f799b7bb3bffdf824858
2021-03-25 07:44:17 -07:00
Yi Wang
c22fc448cd [Gradient Compression] Remove cuda.syncrhonize in batched powerSGD (#54482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54482

`cuda.synchronize` is unnecessary for `batched_powerSGD_hook`.
ghstack-source-id: 124607761

Test Plan:
f259607860
f259563921

Reviewed By: rohan-varma

Differential Revision: D27254314

fbshipit-source-id: 4744c07a6f0c8939e766ffa935ddbf3c47e85d18
2021-03-23 00:55:53 -07:00
Yi Wang
de70cdb66b Clang format default_hooks.py (#53956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53956

ghstack-source-id: 123852987

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D27032713

fbshipit-source-id: 11d831fa0f08b1c8bc2e44acd144bf85a69a1211
2021-03-13 10:41:11 -08:00
Yi Wang
ca4aae85fa [Gradient Compression] Update the docstring of fp16_compress_wrapper (#53955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53955

Per title
ghstack-source-id: 123852836

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D27032700

fbshipit-source-id: 6f9bbc028efe6cc9b54f4ec729fea745368efb2e
2021-03-13 10:39:40 -08:00
Isaac Seessel
3078233e9a [Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808

Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm.

Test Plan:
Unit test:
```
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper
```

Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression
1) FP16 Compression:
f256897690

2) FP16 Wrapper + AllReduce (after patching D26960986):
f256897289

Reviewed By: SciPioneer

Differential Revision: D26978832

fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0
2021-03-12 17:31:07 -08:00
Yi Wang
8016d28c0b [Gradient Compression] Update the comment on fp16_compress_hook (#53780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53780

Update the comment, because the input data type of `fp16_compress_hook` does not have to be FP32. For example, the input dtype can also be FP64, as long as it can be casted into FP16.
ghstack-source-id: 123680621

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D26967224

fbshipit-source-id: 26d79a3629a597e6335b6f59c97d25a764a8ed80
2021-03-11 13:40:32 -08:00
Yi Wang
68b62493b8 [Gradient Compression] Make GradBucket class public (#53099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099

Publish GradBucket APIs for publishing DDP communication hooks.

s/_GradBucket/GradBucket
ghstack-source-id: 123030921

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26721121

fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a
2021-03-03 19:22:15 -08:00
Yi Wang
b59075eced [Gradient Compression] Refactor tensor grouping in PowerSGD (#52981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52981

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function.

Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in https://github.com/pytorch/pytorch/pull/52541#discussion_r580867311
ghstack-source-id: 122997032

Test Plan:
waitforbuildbot

Already LGTMed by PowerSGD paper author.

Ads1x (completed):
https://www.internalfb.com/intern/tupperware/details/job/?handle=priv3_global%2Fmast_hpc%2Ftsm_hpc-wayi_ads_10x_POWER_SGD_gpu8_2021-02-28_15-29.trainer&tatwTabs=tasks&task_id=0&task_tab=TASK_LOGS

Detectron2:
1) Before refactoring:
f254353864
Accuracy: 39.972
Overall training speed: 67498 iterations in 6:15:42 (0.3340 s / it)

2) After refactoring:
f254353380
Accuracy: 39.944
Overall training speed: 67498 iterations in 6:09:41 (0.3286 s / it)

Reviewed By: rohan-varma

Differential Revision: D26713689

fbshipit-source-id: 12cfcb65feaa2a2d94e3c7793073031f13828305
2021-03-03 19:20:41 -08:00
Yi Wang
ba36e32406 [Gradient Compression] Correct the usage of min_compression_rate (#52979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52979

Compression rate = uncompressed size / compressed size, so the compression rate is usually greater than 1.

Previously the compression rate was perceived as compressed size / uncompressed size, which can be very confusing.
ghstack-source-id: 122996272

Test Plan: unit tests

Reviewed By: zhaojuanmao

Differential Revision: D26713349

fbshipit-source-id: 83b7f8908c101954cf01f56a22161047fbfeaa53
2021-03-03 15:35:40 -08:00
Yi Wang
b05dd931ee [Gradient Compression] Add is_the_last_bucket_to_allreduce method to GradBucket class (#53010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010

To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.

Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D26720813

fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
2021-03-02 14:39:12 -08:00
Yi Wang
ecb5ac90ed [Gradient Compression] Add get_per_parameter_tensors method to GradBucket class (#53009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009

It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.

Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

f254364097

Reviewed By: rohan-varma

Differential Revision: D26717893

fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
2021-03-02 14:39:03 -08:00
Yi Wang
890e051047 Clang-format quantization_hooks.py (#53100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53100

ghstack-source-id: 122723751

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26721146

fbshipit-source-id: 985057fc02c997124b676854eb0a55e569971a3f
2021-03-02 12:48:43 -08:00
Shen Li
729d88119a Fix GradBucket Typing (#52943)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52943

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26699759

Pulled By: mrshenli

fbshipit-source-id: 712165a29d114da761ef4f161096ca46a958df03
2021-02-27 20:04:38 -08:00
Seung-Jae Bang
2d75346c25 [Gradient Compression] Add a minimum compression rate threshold for PowerSGD communication hook (#52541)
Summary:
Fixes #{52034}
- Add a minimum compression rate threshold to `PowerSGDState`
- Use the threshold to determine whether to compress high-rank tensors or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52541

Test Plan:
No performance regression using rank-8 compression:
baseline: f253000411
updated one: f253010955

Reviewed By: rohan-varma

Differential Revision: D26594862

Pulled By: SciPioneer

fbshipit-source-id: 2859a91b4ca6bd1862bf6cd6441dc2a89badb2d5
2021-02-23 22:03:02 -08:00
Yi Wang
03ae6d9903 Remove useless _allgather_then_aggregate_hook (#52593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52593

This hook is not used at all, and it probably can only be used for demonstrating that allgather is slower than allreduce, so it should never be used in practice.

However, this hook and its helper function stay with the communication hook public APIs in the same file. It will be better to make the public API file as concise as possible.

Since I don't think we will use this hook in the future, prefer deleting it to moving it to a separate file.
ghstack-source-id: 122180969

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26575318

fbshipit-source-id: b258154a7c92e33236c34104bd79bc244ecdb158
2021-02-22 12:12:53 -08:00
Yi Wang
4b3c99ce4a [Resubmission] Add a documentation page for DDP communication hooks (#51773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773

Resubmission of #51715.

Minor changes:
1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py.

2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst.

Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels .

It seems that `python_doc_test` was flaky. The previous error message was not informative:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch.

Rebasing to a new master branch seems to get this fixed:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps

Screenshot:

{F369899792}
ghstack-source-id: 121199613

Test Plan: View locally

Reviewed By: mingzhe09088

Differential Revision: D26272687

fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db
2021-02-06 21:22:04 -08:00
Natalia Gimelshein
d3023d86ba Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks
Test Plan: revert-hammer

Differential Revision:
D26249330 (e62aabac43)

Original commit changeset: ab973390ddb7

fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd
2021-02-04 22:38:06 -08:00
Yi Wang
e62aabac43 [Gradient Compression] Add a documentation page for DDP communication hooks (#51715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715

Add a documentation page for DDP communication hooks.

Screenshot:

{F369781049}

Test Plan: View locally

Reviewed By: pritamdamania87

Differential Revision: D26249330

fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e
2021-02-04 18:53:53 -08:00
Yi Wang
43df03de13 [Gradient Compression] Replace torch.sqrt(torch.sum(col ** 2)) by torch.norm() (#51629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51629

Leverage the existing util functions as much as possible for potential performance gain.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120919883

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

No performance regression:
f248664994 uses `torch.norm()`
```
total:
  32 GPUs -- 32 GPUs: p25:  1.050    30/s  (batch size 32)
p50:  1.230    26/s  (batch size 32)
p75:  1.449    22/s  (batch size 32)
p90:  1.611    19/s  (batch size 32)
p95:  1.702    18/s  (batch size 32)

backward:
  32 GPUs -- 32 GPUs: p25:  0.769    41/s  (batch size 32)
p50:  0.920    34/s  (batch size 32)
p75:  1.139    28/s  (batch size 32)
p90:  1.322    24/s  (batch size 32)
p95:  1.440    22/s  (batch size 32)
```

f248678690 does not use `torch.norm()`
```
total:
  32 GPUs -- 32 GPUs: p25:  1.056    30/s  (batch size 32)
p50:  1.249    25/s  (batch size 32)
p75:  1.443    22/s  (batch size 32)
p90:  1.608    19/s  (batch size 32)
p95:  1.711    18/s  (batch size 32)

backward:
  32 GPUs -- 32 GPUs: p25:  0.777    41/s  (batch size 32)
p50:  0.939    34/s  (batch size 32)
p75:  1.127    28/s  (batch size 32)
p90:  1.322    24/s  (batch size 32)
p95:  1.448    22/s  (batch size 32)
```

Reviewed By: pritamdamania87

Differential Revision: D26219835

fbshipit-source-id: 31d8ad3401d4efced4a6069f4f1e169ea3372697
2021-02-03 13:39:11 -08:00
Yi Wang
79e7544cb4 [Gradient Compression] Check start_PowerSGD_iter > 1 and add guidance on tuning PowerSGD configs. (#51427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51427

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1.

Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`.

Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120834126

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state

Reviewed By: rohan-varma

Differential Revision: D26166897

fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d
2021-02-02 04:30:24 -08:00
Yi Wang
c08078031f [Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

baseline: f248001754
batched PowerSGD: f246960752

The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35

Reviewed By: rohan-varma

Differential Revision: D26077709

fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
2021-02-01 15:26:29 -08:00
Yi Wang
0831984ed5 [Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400

Resubmission of #51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
2021-02-01 11:34:41 -08:00
Iurii Zdebskyi
5a406c023e Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
Test Plan: revert-hammer

Differential Revision:
D26070147 (e7b3496232)

Original commit changeset: 8c9339f1511e

fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34
2021-01-29 09:06:24 -08:00
Yi Wang
e7b3496232 [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26070147

fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81
2021-01-28 19:03:18 -08:00
Yi Wang
9d731e87de [Gradient Compression] Explicitly specify the dtype of the error tensor (#50985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120377786

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26034988

fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da
2021-01-28 19:03:14 -08:00
Yi Wang
b619d37bb4 [Gradient Compression] Simplify the implementation of error feedback and warm-start (#50981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50981

Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors.

Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120617971

Test Plan: real run

Reviewed By: rohan-varma

Differential Revision: D26034418

fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14
2021-01-28 18:59:40 -08:00
Yi Wang
9f19843d19 [Gradient Compression] Typo fixes in PowerSGD (#50974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50974

Typo fixes.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257221

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26031679

fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b
2021-01-25 22:55:54 -08:00
Yi Wang
ffaae32d60 [Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations (#50973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26031478

fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
2021-01-25 22:38:39 -08:00
Yi Wang
439afda090 [Gradient Compression] Fix warm-start for PowerSGD laywerwise compression (#50283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50283

Realize that for the layerwise compression, the previous warm-start implementation only skips memory allocations, but does not skip filling random values for Qs.

Also fix the unit test in distributed_test.py. Previously the process group was not created correctly, and not communication occurred in the test_DistributedDataParallel_powerSGD_ddp_comm_hook.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120101220

Test Plan:
Verified the fix by adding added some loggings locally.

Also verified no NE diff on Ads 1x.

Reviewed By: rohan-varma

Differential Revision: D25846222

fbshipit-source-id: 1ebeeb55ceba64d4d904ea6ac1bb42b1b2241520
2021-01-20 22:31:44 -08:00