Remove `_checkpoint_wrapped_module` prefixes when creating keys for optimizer state_dict.
Having these does not actually create an issue for optim_state_dict save / load, but we'd like to strip these keys out for downstream code that consumes these APIs typically expecting checkpointing prefixes to not exist (as checkpointing should be a transparent operation which should not change module / parameter names).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80480
Approved by: https://github.com/awgu, https://github.com/fegin
Fixes#75666
Current PR adds the functionality for `PostLocalSGD` communication hook and tests that communication hook can be properly saved and restored. Similar to https://github.com/pytorch/pytorch/pull/79334, where serialization was added to `PowerSGD`.
``__getstate__``
Returns:
```
``Dict[str, Any]`` which will be pickled and saved.
``process_group`` and ``subgroup`` are not serializable and excluded from
a returned state.
```
``__setstate__``
```
Takes provided ``state`` and retrieves ``PostLocalSGDState``.
``process_group`` and ``subgroup`` are set to default process_group and subgroup respectively.
Default subgroup is equivalent to the subgroup on each node.
```
Small adjustment to `PowerSGD`'s warning message.
Refactored unittest, i.e. separated parity and log checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80435
Approved by: https://github.com/awgu
This TODO is no longer needed, as we use `_register_fused_optim` to register the overlapped optimizer in DDP. Also, remove comment about API being experimental, as this API is no longer going to be used by end user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80453
Approved by: https://github.com/awgu
Fixes#79114
An implementation of a FSDP communication hook interface for a NO_SHARD strategy:
- `FullyShardedDataParallel.register_comm_hook(self, state: object, hook: callable)` checks current sharding strategy. If it is other that NO_SHARD, raises a runtime error. Otherwise, sets and shares a specified hook and its state with all submodules
- When FSDP is ready to communicate a gradient, checks if there is a registered hook, and calls it instead of all_reduce. Additionally, gradient pre and post devision are not performed if a hook is registered.
To test the interface, I've implemented a communication hook, that calls for `all_reduce`.
A unittest:
- checks that is a sharding strategy is anything but NO_SHARD, a runtime error is raised
- checks that for a NO_SHARD case, model with registered all_reduce hook and without a hook work the same.
- checks for 2 types of FSDP models: with the wrapped first layer and without. (to make sure submodules have a hook registered)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79833
Approved by: https://github.com/rohan-varma, https://github.com/awgu
This PR addresses issue address #75666.
Stateful communication hook now can be saved and reloaded to resume training.
Current PR adds the functionality for PowerSGD communication hook and tests that communication hook can be properly saved and restored.
PowerSGD implementation uses ``__slots__``, as a result introduced __getstate__ and __setstate__ methods are implemented to work with `__slots__` and not` __dict__`.
`__getstate__ `
Returns:
A dictionary that represents a ``PowerSGDState`` which will be pickled and saved.
``process_group`` is non-serializable and excluded from a returned state.
`__setstate__`
Takes a provided ``state`` and retrieves ``PowerSGDState``.
``process_group`` is set to default with a proper warning issued to a user.
Unit test
A hook-independent `_test_hook_pickling` is added with this PR, as well as `test_ddp_hook_pickling_powerSGD`, which tests `powerSGD`’s ability to be saved and reloaded.
Currently, the test creates a ddp model with a provided hook, trains it for 10 epochs and saves model’s state and hook’s state.
During reloading, unit test makes sure that a warning was logged (only one warning and the proper one). It then proceeds to check that reloaded hook and original hook are the same. Finally, it checks that a hook’s state was properly initialized:
- it compares slot values (all, but 2: `process_group` and `rng`) for original and reloaded state
- it checks that process group was set to a default group
- it checks that a random state was restored properly with np.testing.assert_array_equal, because `rng` is an instance of `np.random.RandomState`, represented by a tuple. One of entries is of `ndarray dtype[uint32]` type and `np.testing.assert_array_equal` is used for assertion.
Future To-Do:
- Implement similar __getstate__ and __setstate__ for other stateful communication hooks
- Add appropriate tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79334
Approved by: https://github.com/rohan-varma, https://github.com/awgu
- Uses state dict / load state dict hooks to ensure that modules wrapped with `CheckpointWrapper` can be loaded into non-checkpointed wrapped module.
This is because a training run can use activation checkpointing, then we can recover `state_dict`, and a future run may not want to wrap modules with activation checkpointing or decide to change activation checkpoint wrapping structure. To support this, we add hooks to remove / add the relevant prefix as needed.
Tests are added to ensure we can load into CheckpointWrapper module as well as local module from CheckpointWrapper-wrapped module. state_dict with FSDP is also verified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77224
Approved by: https://github.com/zhaojuanmao
I find that sometimes disabling intra-subgroup gradient allreduce can still give a satisfying accuracy for some cases, so better to make such gradient averaging configurable. This does not take into account the saving in the communication of allreducing gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76723
Approved by: https://github.com/rohan-varma
Summary:
Previously the highest-level process group in `period_process_group_dict` could be `None`, indicating the global group. Now `period_process_group_dict` cannot contain `None` as a process group, so the function `_find_process_group` can just return a process group instead of a tuple -- when not found, just return `None`, because now the returned process group cannot be `None`.
Proposal: https://github.com/pytorch/pytorch/issues/71325
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75007
Reviewed By: awgu
Differential Revision: D35357816
Pulled By: rohan-varma
fbshipit-source-id: 4522dba49797df7140227bfd822d668b7e118a66
(cherry picked from commit 77ca01b555d52685283c969176b08de4ff46c32d)
Summary:
Add a reference.
Also fix the comment: unlike `averagers.py`, currently this is not a base class that can inherit many subclasses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73823
Reviewed By: ejguan
Differential Revision: D34684366
Pulled By: rohan-varma
fbshipit-source-id: e253ed39ba0783ad73bfd889e9a2e7d0c9214a3a
(cherry picked from commit a9fec3585078881ccd5886ebb27e52b15f7181b1)
Summary:
Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325.
Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs.
In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285
Reviewed By: mrshenli
Differential Revision: D34457792
Pulled By: rohan-varma
fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba
(cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886
**Test Plan**
Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D34255651
Pulled By: awgu
fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d
(cherry picked from commit 4492f03a3f)
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in https://github.com/pytorch/pytorch/issues/65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.
This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.
### Additional context
_No response_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72043
Reviewed By: albanD
Differential Revision: D34042781
Pulled By: cbalioglu
fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
(cherry picked from commit f64bf3839a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084
make fsdp folder to be public
ghstack-source-id: 148173447
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D33903417
fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5
(cherry picked from commit bd06513cfe)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71970
- Provide default arg for power SGD convenience wrapper that matches the main API default
Test Plan: CI
Reviewed By: H-Huang
Differential Revision: D33837457
fbshipit-source-id: 8f4efab4992b3fff09456a18db2c83e087c25bdf
(cherry picked from commit 83f52fb3c7)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620
Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33700593
fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
(cherry picked from commit a50b2caf0e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608
Per title
ghstack-source-id: 147577178
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33696382
fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
(cherry picked from commit 436b547fb0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604
Implement 2 helper functions:
- as_functional_optim which takes in a torch.optim class type and arguments and
creates the corresponding functional optimizer.
- create_functional_optim which takes in the functional optimizer class type
and constructs it. Note that as_functional_optim calls into
create_functional_optim.
The first will be used in future PRs as described in
https://github.com/pytorch/pytorch/issues/67570 to create a functional
optimizer from a traditional optimizer. The latter is used in
_OptimizerHookState to create a functional optimizer.
Both new helper functions are covered by unittests.
ghstack-source-id: 147577170
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33688995
fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1
(cherry picked from commit 42fdae2991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71602
The design in https://github.com/pytorch/pytorch/issues/67570 requires
`_OptimizerHookState` to be created directly from a functional optimizer. Add
support and tests for this. Also refactor a few tests.
ghstack-source-id: 147577175
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33687477
fbshipit-source-id: f3c789aa77773f918e01a8d0cf08739b2edf07b3
(cherry picked from commit 4851e1c6d4)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71601
Moves current prototype optimizer overlap to its own file for a better
namespace. No code changes besides a few comment fixes. Note that this code is
still prototype and not expected to be used by an end user.
ghstack-source-id: 147458826
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33662678
fbshipit-source-id: 3cc931323230a4b66c02b9e6f744aaf5c48d4d34
(cherry picked from commit 5070595c7f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621
Moves this feature to beta as discussed, and cleans up some docs.
Synced offline with wayi1 who mentioned that the current names are preferred
as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325.
ghstack-source-id: 147382940
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D33700444
fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067
(cherry picked from commit 656e9809b2)
Summary:
Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974
Reviewed By: dagitses
Differential Revision: D33474576
Pulled By: rohan-varma
fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165
Implements activation offload support in checkpoint_wrapper API via
save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation
and instead compose offload + checkpoint using the save_on_cpu hook for the
former.
ghstack-source-id: 146078900
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D33228820
fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164
Implement Alban's suggestion to make checkpoint_wrapper an nn.Module
instead of patching the forward pass, which is too hacky.
ghstack-source-id: 146011215
Test Plan: IC
Reviewed By: mrshenli
Differential Revision: D33214696
fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955
Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module.
Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case.
Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint.
ghstack-source-id: 145811242
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D33107276
fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401
some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 143910067
ghstack-source-id: 143910067
Test Plan: wait for ci
Reviewed By: mrshenli
Differential Revision: D31979269
fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289