In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters.
Now composability workstream need a FQN based way to index optimizer state_dict for parameters..
For example, SGD optimizer might have something in its `state_dict` like:
```
{'state':
{0:
{'momentum_buffer': tensor(...)},
{1:
{'momentum_buffer': tensor(...)},
...
}
'param_groups':
[{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}]
}
```
And in NamedOptimizer we want the `state_dict` can be:
```
{'state':
{'net1.0.weight':
{'momentum_buffer': tensor(...)},
{'net1.0.bias':
{'momentum_buffer': tensor(...)},
...
}
'param_groups':
[{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}]
}
```
We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer.
For the next couple PR/diffs, we also need to:
1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components.
2. Make `NamedOptimizer` works well with apply_optim_in_backward
3. Upstream also `CombinedOptimizer`.
Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480
Approved by: https://github.com/rohan-varma
Summary:
Upstreaming this as part of sharing common APIs. This is just a plain
move, any changes needed to support DDP / FSDP will come in follow up diffs.
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D40564646
fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539
Approved by: https://github.com/awgu
Those functions enable membership introspection into a ProcessGroup. A common scenario
that needs this is library code that consumes a PG but doesn't create it, which means
it likely doesn't know the global ranks used to create it.
Translating from local to global is necessary when using c10d collectives like broadcast
so if your library code adopts the convention of using local rank 0, it needs
to the following:
```python
import torch.distributed as dist
my_pg: dist.ProcessGroup = ...
def my_library_bcast(tensor)
dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg)
```
This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134
Approved by: https://github.com/rohan-varma
This is a new version of #15648 based on the latest master branch.
Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.
In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)
Fixes https://github.com/pytorch/pytorch/issues/71105
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
### Description
Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.
### Testing
There shouldn't be any testing required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
Near term fix for https://github.com/pytorch/pytorch/issues/76368.
Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic?
A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph.
Q. Ok, why not just do the capture-safe approach with device-side state variables all the time?
A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling.
Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here?
A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like
```python
graph.will_use_optimizer(opt)
graph.capture_begin()
...
```
but that seems clunkier than an optimizer constructor arg.
I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach.
Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862
Approved by: https://github.com/ezyang
I find that the original implementation of `post_localSGD_optimizer.step()` is incorrect:
Whenever `averager.average_parameters()` is called, the built-in step counter will be increased. Therefore, this should only be called exactly once per `optimizer.step()`. However, if a model has multiple param groups or params, the current implementation will call `averager.average_parameters()` multiple times and over-increase the step counter.
Relevant proposals since hierarchical SGD can be supported on `post_localSGD_optimizer`: https://github.com/pytorch/pytorch/issues/73382, https://github.com/pytorch/pytorch/issues/71325
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74737
Approved by: https://github.com/mrshenli
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74215
### Overview of API
This PR introduces full optimizer state dict checkpointing.
- This allows users to save the optimizer state for a `torch.nn.Module` (not necessarily a `FullyShardedDataParallel` instance) that contains `FullyShardedDataParallel` instances and later load that optimizer state.
- This supports loading to a module with a different world size, but the `FSDP` wrapping scheme must be the same.
To **save** the optimizer state, run the following (on all ranks):
```
model: torch.nn.Module = ...
optim = torch.optim.Adam(model.parameters(), ...)
# Train for some steps...
full_osd = FSDP.full_optim_state_dict(model, optim) # returns non-empty dict only on rank 0
if rank == 0:
torch.save(full_osd, ...)
```
To **load** the optimizer state, run the following (on all ranks):
```
new_model: torch.nn.Module = ... # may use different world size
full_osd = torch.load(...)
sharded_osd = FSDP.shard_full_optim_state_dict(full_osd, new_model)
optim = torch.optim.Adam(new_model.parameters(), ...)
optim.load_state_dict(sharded_osd)
```
To support **multiple parameter groups**, we require using an additional argument `optim_input`, which is the first argument that the user passes into the optimizer constructor.
```
optim_input = ...
optim = torch.optim.Adam(optim_input, ...)
FSDP.full_optim_state_dict(model, optim, optim_input) # one more argument
...
new_optim_input = ...
new_optim = torch.optim.Adam(new_optim_input, ...)
FSDP.shard_full_optim_state_dict(full_osd, new_model, new_optim_input) # one more argument
```
One caveat is that the user should be careful of generators, which are exhausted after their first use. The `optim_input` passed into the `FSDP` APIs should be refreshed version of the generator if using generators.
### Test Plan
**`full_optim_state_dict()`**
- [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is `model.parameters()`.
- [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is multiple parameter groups (changing parameter order).
**`shard_full_optim_state_dict()`**
- [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is `model.parameters()`.
- [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is multiple parameter groups (changing parameter order).
- [x] `shard_full_optim_state_dict()` raises a `ValueError` when changing the `FSDP` wrapping scheme.
On the AWS cluster, the TTS contribution for these tests is ~45 seconds.
### Developer Notes
**Relaxing the Problem**
For optimizer state checkpointing, we have relaxed the problem to **not support changing the `FSDP` wrapping scheme** between save and load time. It is unclear how to solve without this relaxation. This was the least restrictive way to relax the problem since it does not affect most expected use cases. Rather, the expected change between save and load time is the **world size**, which this implementation **does support**.
Even with the relaxation, the `optim_input` argument is necessary to determine the `flat_param_id_to_param` mapping, which is important to know which parameter IDs in the flattened space correspond to `FlatParameter`s that hence need to be unflattened.
**Differences with Local Equivalent**
Suppose `full_osd = full_optim_state_dict()` and `local_osd = state_dict()` for a purely local equivalent. The difference between `full_osd` and `local_osd` is that the parameter IDs of unflattened parameters comprising a single flattened parameter are always consecutive in `full_osd`, while they may be non-consecutive in `local_osd`. Suppose in the following that each layer has 1 parameter `param`:
```
FSDP(model)
layer1
FSDP(layer2)
layer3
```
`layer1.param` and `layer3.param` are flattened and attributed to `model`. `layer2.param` is flattened and attributed to itself.
- In `local_osd`, the parameter IDs would be `0: layer1.param`, `1: layer2.param`, and `2: layer3.param`.
- In `full_osd`, the parameter IDs would be `0: layer1.param`, `1: layer3.param`, and `2: layer2.param`. (Parameter IDs of unflattened parameters sharing a flattened parameter are consecutive.)
The idea is that as long as `full_optim_state_dict()` and `shard_full_optim_state_dict()` are internally consistent, then there is no need to match the local equivalent (assuming no change in `FSDP` wrapping).
### Follow-Ups
**API**
- If needed, we can follow-up this PR by adding an argument `key_by_name: bool = False` to both methods that may be set to `True` to key parameters by `str` names instead of `int` parameter IDs. We still need to investigate if keying by name enables changing the `FSDP` wrapping scheme.
**Refactoring**
- In this optimizer state checkpointing, all optimizer state is saved to CPU on rank 0 (set as `OPTIM_TARGET_RANK`). We should unify and refactor these assumptions with model state checkpointing.
**Testing**
- The code path for unused parameters is not tested. The testing and any needed implementation fixes can be done in a follow-up.
- The code path for non-tensor states (e.g. `Adam` `"step"` as `float` instead of as zero-dimension `FloatTensor`) is not tested. However, it is identical to that of zero-dimension tensor states, so I have some confidence. If needed, I can add tests for it in a follow-up.
- Would I have to write my own optimizer? I do not want to introduce dependencies on third party libraries like Nvidia `apex`.
- We may want to add end-to-end checkpointing tests that include both model state dict and optimizer state dict.
Test Plan: Imported from OSS
Reviewed By: zhaojuanmao
Differential Revision: D35045121
Pulled By: awgu
fbshipit-source-id: 33c650dc960acbd7613d4f444a852b9f76ca4a9b
(cherry picked from commit 2bbc2e344296dc455cf686f3a9b097989504be81)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842
**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.
The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
- For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
- The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
- A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
- The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
- There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
- This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.
**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D34675709
Pulled By: awgu
fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.
**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).
To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.
**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932
Reviewed By: rohan-varma
Differential Revision: D34281482
Pulled By: awgu
fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72578
**Overview**
This adds `ZeroRedundancyOptimizer` constructor support for multiple parameter groups (i.e. passing an `iterable` of `dict`s instead of an `iterable` of `torch.Tensor` as the `parameters` argument) to mirror the API for non-sharded optimizers.
Fixes https://github.com/pytorch/pytorch/issues/71347 and https://github.com/pytorch/pytorch/issues/59973.
This modifies `test_collect_shards()` to skip if ROCm.
**Test Plan**
I adjusted the existing constructor test, and I added a test for parity between constructing with two parameter groups up front versus constructor with one parameter group and adding the second parameter group after (via `add_param_group()`) versus a non-sharded optimizer.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D34106940
Pulled By: awgu
fbshipit-source-id: 7e70fc0b3cec891646e0698eaedf02ff4354c128
(cherry picked from commit 40f2d45172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69980
- Merged `torch/optim/adadelta.py` and `torch/optim/_multitensor/adadelta.py` into `torch/optim/adadelta.py`
- Moved adadelta functional forms from `torch/optim/_functional.py` and `torch/optim/_multi_tensor/_functional.py` to `torch/optim/adadelta.py`
- `torch/optim/_functional.py` just imports from `torch/optim/adadelta.py`
- Added a test `test_optimizers_foreach_flag` which replicates `test_multi_tensor_optimizers` in `test/test_optim.py`
- Add a test `test_adadelta_new` that replicates the behavior of `test_adadelta` but with `foreach` flag instead of using the multitensor adadleta class. If we delete `_multitensor/` we could replace `test_adadelta` with this
Remaining TODO:
- [ ] single_tensor adadelta supports complex but multitensor does not, need to integrate the singletensor logic in multitensor and switch the `test_adadelta_complex` to test for foreach in [True, False]
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin, albanD
Differential Revision: D33413059
Pulled By: mikaylagawarecki
fbshipit-source-id: 92a9fa98705762bb6bd464261671e49aef40070e
(cherry picked from commit a008227d22)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71333
Updated
- Adagrad
- Adamax
- Adam
- AdamW
- RAdam
make multi_tensor functionals take `state_steps: List[Tensor]` instead of taking `states: List[Dict]`
make `state_steps: List[int]s -> state_steps:List[Tensor]` where each is a Singleton tensor so step can be updated within the functional
(NAdam and ASGD) were updated in separate diffs to fold their handling of state into the functionals
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D33767872
Pulled By: mikaylagawarecki
fbshipit-source-id: 9baa7cafb6375eab839917df9287c65a437891f2
(cherry picked from commit 831c02b3d0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620
Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33700593
fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
(cherry picked from commit a50b2caf0e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604
Implement 2 helper functions:
- as_functional_optim which takes in a torch.optim class type and arguments and
creates the corresponding functional optimizer.
- create_functional_optim which takes in the functional optimizer class type
and constructs it. Note that as_functional_optim calls into
create_functional_optim.
The first will be used in future PRs as described in
https://github.com/pytorch/pytorch/issues/67570 to create a functional
optimizer from a traditional optimizer. The latter is used in
_OptimizerHookState to create a functional optimizer.
Both new helper functions are covered by unittests.
ghstack-source-id: 147577170
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33688995
fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1
(cherry picked from commit 42fdae2991)