Commit Graph

126 Commits

Author SHA1 Message Date
fduwjj
1a48ae96ba [PT-D][Easy] Reformat the optim code within PTD code base (#90399)
Just run two commands:
```
ufmt format torch/distributed/optim/
ufmt format test/distributed/optim/
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90399
Approved by: https://github.com/awgu
2022-12-08 06:38:59 +00:00
fduwjj
85ae28b454 Reformat optim import (#90294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90294
Approved by: https://github.com/awgu
2022-12-07 07:11:12 +00:00
Ram Rachum
351d73b97f Fix exception causes all over the codebase (#90271)
This is the continuation to #90134 and hopefully the final PR in this series.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271
Approved by: https://github.com/kit1980
2022-12-07 04:29:00 +00:00
fduwjj
1abe264ef0 [Upstream _NamedOptimzer] Reland PR (89480) (#90293)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Reland https://github.com/pytorch/pytorch/pull/89480/
* #90294
* __->__ #90293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293
Approved by: https://github.com/awgu
2022-12-06 21:47:12 +00:00
PyTorch MergeBot
176b962f4b Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)"
This reverts commit 31ec1a1ef7.

Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names
2022-12-06 07:22:37 +00:00
fduwjj
31ec1a1ef7 [PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)
In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters.

Now composability workstream need a FQN based way to index optimizer state_dict for parameters..

For example, SGD optimizer might have something in its `state_dict` like:

```
{'state':
  {0:
    {'momentum_buffer': tensor(...)},
  {1:
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}]
}
```

And in NamedOptimizer we want the `state_dict` can be:

```
{'state':
  {'net1.0.weight':
    {'momentum_buffer': tensor(...)},
  {'net1.0.bias':
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}]
}
```

We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer.

For the next couple PR/diffs, we also need to:
1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components.
2. Make `NamedOptimizer` works well with apply_optim_in_backward
3. Upstream also `CombinedOptimizer`.

Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480
Approved by: https://github.com/rohan-varma
2022-12-06 04:34:19 +00:00
Rohan Varma
404f254e20 Upstream apply_optim_in_backward from TorchRec (#87397) (#88539)
Summary:

Upstreaming this as part of sharing common APIs. This is just a plain
move, any changes needed to support DDP / FSDP will come in follow up diffs.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D40564646

fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539
Approved by: https://github.com/awgu
2022-11-05 18:28:07 +00:00
Rohan Varma
bd5b4e6504 [Easy] Unused var in functional_adam (#88292)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88292
Approved by: https://github.com/awgu
2022-11-02 16:31:16 +00:00
Masaki Kozuki
5f26df0345 resubmit: "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)" (#85739)
Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)) to a new header file and let FusedAdam use them to tame MSVC, hopefully.

cc @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739
Approved by: https://github.com/ngimel
2022-09-29 16:58:59 +00:00
PyTorch MergeBot
7167996346 Revert "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)"
This reverts commit 4615d1bcfa.

Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds
2022-09-27 16:59:35 +00:00
Masaki Kozuki
4615d1bcfa resubmit: [mta] APEX style Fused Adam (#81705) (#85507)
This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
Approved by: https://github.com/ngimel

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507
Approved by: https://github.com/ngimel
2022-09-23 18:56:00 +00:00
PyTorch MergeBot
e505360eb8 Revert "[mta] APEX style Fused Adam (#81705)"
This reverts commit 7a6c4d0c50.

Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come
2022-09-22 19:37:29 +00:00
Masaki Kozuki
7a6c4d0c50 [mta] APEX style Fused Adam (#81705)
This PR implements an APEX style FusedAdam in PyTorch.
This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
Approved by: https://github.com/ngimel
2022-09-20 17:18:33 +00:00
Rodrigo Kumpera
65dc5dd3f3 [c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134)
Those functions enable membership introspection into a ProcessGroup. A common scenario
that needs this is library code that consumes a PG but doesn't create it, which means
it likely doesn't know the global ranks used to create it.

Translating from local to global is necessary when using c10d collectives like broadcast
so if your library code adopts the convention of using local rank 0, it needs
to the following:

```python
import torch.distributed as dist

my_pg: dist.ProcessGroup = ...

def my_library_bcast(tensor)
    dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg)

```

This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134
Approved by: https://github.com/rohan-varma
2022-08-30 17:45:00 +00:00
joncrall
b136f3f310 More doctest refinements. (#83317)
Follow up to #82797

Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

@ezyang @vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
Approved by: https://github.com/ezyang
2022-08-22 20:07:26 +00:00
Rob Zinkov
ff75562cff Adding maximize to rprop (#81864)
Added the maximize flag #68052 to rprop optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81864
Approved by: https://github.com/albanD
2022-08-16 08:19:46 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
Jerome
547e499731 Enable Zero1's ddp_with_overlap for hpu backend (#80438)
Enable zero with ddp overlap feature along with a simple interface to insert functional optimizer to the map

Signed-off-by: Jerome <janand@habana.ai>

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80438
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-07-18 15:05:27 +00:00
anjali411
93912b1a73 Add __all__ to torch.distributed submodules (#80523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80523
Approved by: https://github.com/rohan-varma
2022-07-11 06:54:24 +00:00
PyTorch MergeBot
0b8a5ca01b Revert "Adding maximize to rprop (#80335)"
This reverts commit 495aa9bc3a.

Reverted https://github.com/pytorch/pytorch/pull/80335 on behalf of https://github.com/albanD due to Broke rocm and windows test
2022-07-08 13:34:02 +00:00
Rob Zinkov
495aa9bc3a Adding maximize to rprop (#80335)
Added the maximize flag #68052 to rprop optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80335
Approved by: https://github.com/albanD
2022-07-08 08:04:38 +00:00
Rob Zinkov
a1fd5b4273 Adding maximize to RMSprop (#80326)
Added the maximize flag #68052 to RMSprop optimizer and updates the respective tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80326
Approved by: https://github.com/albanD
2022-07-08 08:04:26 +00:00
wayi1
f76bb88205 fix docstring of PostLocalSGDOptimizer (#80855)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80855
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2022-07-05 14:58:35 +00:00
Michael Carilli
ba27ee9e8f [CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862)
Near term fix for https://github.com/pytorch/pytorch/issues/76368.

Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic?
A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph.

Q. Ok, why not just do the capture-safe approach with device-side state variables all the time?
A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling.

Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here?
A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like
```python
graph.will_use_optimizer(opt)
graph.capture_begin()
...
```
but that seems clunkier than an optimizer constructor arg.

I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach.

Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862
Approved by: https://github.com/ezyang
2022-06-13 01:56:47 +00:00
Olga Andreeva
b1ae519df9 Added functionality for post_local SGD (#78988)
Fixes #74556

Added functionality to save and restore step counter for model averager.
Added a unittest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78988
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-06-09 17:47:04 +00:00
Rob Zinkov
2a496e2f80 Adding maximize to Adamax (#77409)
Added the maximize flag #68052 to Adamax optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77409
Approved by: https://github.com/albanD
2022-05-16 17:34:44 +00:00
Rob Zinkov
6642e88ad2 Adding maximize flag to Adagrad
This adds maximize to Adagrad (#68052) along with updates the respective tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75968
Approved by: https://github.com/albanD
2022-04-20 08:29:03 +00:00
Haijunlv
08f3b95857 fix PostLocalSGDOptimizer and ModelAverager average bug
Fixes #74157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74894
Approved by: https://github.com/rohan-varma, https://github.com/wayi1
2022-04-13 11:41:27 +00:00
francescocastelli
58a44523c1 Add maximize flag to Adadelta
Added the maximize flag to Adadelta optimizer (#68052) and adjusted tests to take maximize into account.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75330
Approved by: https://github.com/cpuhrsch
2022-04-08 20:32:35 +00:00
wayi1
189e72babe [Model Averaging] Fix post_localSGD_optimizer
I find that the original implementation of `post_localSGD_optimizer.step()` is incorrect:

Whenever `averager.average_parameters()` is called, the built-in step counter will be increased. Therefore, this should only be called exactly once per `optimizer.step()`. However, if a model has multiple param groups or params, the current implementation will call `averager.average_parameters()` multiple times and over-increase the step counter.

Relevant proposals since hierarchical SGD can be supported on `post_localSGD_optimizer`: https://github.com/pytorch/pytorch/issues/73382, https://github.com/pytorch/pytorch/issues/71325
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74737
Approved by: https://github.com/mrshenli
2022-04-05 21:10:24 +00:00
Andrew Gu
522041a0fd [FSDP] Add full optim state dict (#74215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74215

###  Overview of API
This PR introduces full optimizer state dict checkpointing.
- This allows users to save the optimizer state for a `torch.nn.Module` (not necessarily a `FullyShardedDataParallel` instance) that contains `FullyShardedDataParallel` instances and later load that optimizer state.
- This supports loading to a module with a different world size, but the `FSDP` wrapping scheme must be the same.

To **save** the optimizer state, run the following (on all ranks):
```
model: torch.nn.Module = ...
optim = torch.optim.Adam(model.parameters(), ...)
# Train for some steps...
full_osd = FSDP.full_optim_state_dict(model, optim)  # returns non-empty dict only on rank 0
if rank == 0:
    torch.save(full_osd, ...)
```
To **load** the optimizer state, run the following (on all ranks):
```
new_model: torch.nn.Module = ...  # may use different world size
full_osd = torch.load(...)
sharded_osd = FSDP.shard_full_optim_state_dict(full_osd, new_model)
optim = torch.optim.Adam(new_model.parameters(), ...)
optim.load_state_dict(sharded_osd)
```

To support **multiple parameter groups**, we require using an additional argument `optim_input`, which is the first argument that the user passes into the optimizer constructor.
```
optim_input = ...
optim = torch.optim.Adam(optim_input, ...)
FSDP.full_optim_state_dict(model, optim, optim_input)  # one more argument
...
new_optim_input = ...
new_optim = torch.optim.Adam(new_optim_input, ...)
FSDP.shard_full_optim_state_dict(full_osd, new_model, new_optim_input)  # one more argument
```
One caveat is that the user should be careful of generators, which are exhausted after their first use. The `optim_input` passed into the `FSDP` APIs should be refreshed version of the generator if using generators.

### Test Plan
**`full_optim_state_dict()`**
- [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is `model.parameters()`.
- [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is multiple parameter groups (changing parameter order).

**`shard_full_optim_state_dict()`**
- [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is `model.parameters()`.
- [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is multiple parameter groups (changing parameter order).
- [x] `shard_full_optim_state_dict()` raises a `ValueError` when changing the `FSDP` wrapping scheme.

On the AWS cluster, the TTS contribution for these tests is ~45 seconds.

###  Developer Notes
**Relaxing the Problem**
For optimizer state checkpointing, we have relaxed the problem to **not support changing the `FSDP` wrapping scheme** between save and load time. It is unclear how to solve without this relaxation. This was the least restrictive way to relax the problem since it does not affect most expected use cases. Rather, the expected change between save and load time is the **world size**, which this implementation **does support**.

Even with the relaxation, the `optim_input` argument is necessary to determine the `flat_param_id_to_param` mapping, which is important to know which parameter IDs in the flattened space correspond to `FlatParameter`s that hence need to be unflattened.

**Differences with Local Equivalent**
Suppose `full_osd = full_optim_state_dict()` and `local_osd = state_dict()` for a purely local equivalent. The difference between `full_osd` and `local_osd` is that the parameter IDs of unflattened parameters comprising a single flattened parameter are always consecutive in `full_osd`, while they may be non-consecutive in `local_osd`. Suppose in the following that each layer has 1 parameter `param`:
```
FSDP(model)
    layer1
    FSDP(layer2)
    layer3
```
`layer1.param` and `layer3.param` are flattened and attributed to `model`. `layer2.param` is flattened and attributed to itself.
- In `local_osd`, the parameter IDs would be `0: layer1.param`, `1: layer2.param`, and `2: layer3.param`.
- In `full_osd`, the parameter IDs would be `0: layer1.param`, `1: layer3.param`, and `2: layer2.param`. (Parameter IDs of unflattened parameters sharing a flattened parameter are consecutive.)

The idea is that as long as `full_optim_state_dict()` and `shard_full_optim_state_dict()` are internally consistent, then there is no need to match the local equivalent (assuming no change in `FSDP` wrapping).

### Follow-Ups
**API**
- If needed, we can follow-up this PR by adding an argument `key_by_name: bool = False` to both methods that may be set to `True` to key parameters by `str` names instead of `int` parameter IDs. We still need to investigate if keying by name enables changing the `FSDP` wrapping scheme.

**Refactoring**
- In this optimizer state checkpointing, all optimizer state is saved to CPU on rank 0 (set as `OPTIM_TARGET_RANK`). We should unify and refactor these assumptions with model state checkpointing.

**Testing**
- The code path for unused parameters is not tested. The testing and any needed implementation fixes can be done in a follow-up.
- The code path for non-tensor states (e.g. `Adam` `"step"` as `float` instead of as zero-dimension `FloatTensor`) is not tested. However, it is identical to that of zero-dimension tensor states, so I have some confidence. If needed, I can add tests for it in a follow-up.
    - Would I have to write my own optimizer? I do not want to introduce dependencies on third party libraries like Nvidia `apex`.
- We may want to add end-to-end checkpointing tests that include both model state dict and optimizer state dict.

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D35045121

Pulled By: awgu

fbshipit-source-id: 33c650dc960acbd7613d4f444a852b9f76ca4a9b
(cherry picked from commit 2bbc2e344296dc455cf686f3a9b097989504be81)
2022-03-30 14:15:23 +00:00
Andrew Gu
9012e8d65a [ZeRO][BE] Clean up ZeRO tests (#73842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842

**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.

The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
    - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
    - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
    - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
    - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
    - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
    - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.

**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34675709

Pulled By: awgu

fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
2022-03-08 13:15:20 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Andrew Gu
c30659ffcc [ZeRO] (Reland) Add ctor support for multiple param groups (#72932)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.

**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).

To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.

**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932

Reviewed By: rohan-varma

Differential Revision: D34281482

Pulled By: awgu

fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
2022-02-22 16:29:55 +00:00
Nikita Shulga
84cb810b3f Revert D34106940: [ZeRO] Add ctor support for multiple param groups
Test Plan: revert-hammer

Differential Revision:
D34106940 (5dd0732457)

Original commit changeset: 7e70fc0b3cec

Original Phabricator Diff: D34106940 (5dd0732457)

fbshipit-source-id: 08f846c9c02be8756475f4e0b57eb381f10c27bd
(cherry picked from commit 7675497d83)
2022-02-16 03:45:15 +00:00
wayi1
8b08478115 Fix the doc of PostLocalSGDState (#72792)
Summary:
The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly.

See the example in unit test: 4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792

Reviewed By: samdow

Differential Revision: D34213221

Pulled By: rohan-varma

fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662
(cherry picked from commit bf90af704f)
2022-02-15 23:47:12 +00:00
Mikayla Gawarecki
2a5aaf1c49 Optim foreach cleanup for AdamW (#70484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70484

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767869

Pulled By: mikaylagawarecki

fbshipit-source-id: 2f5273bbfeea3ed502c5d77da4bebe1674243e86
(cherry picked from commit 2dd9b77917)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
dff58d519f Optim foreach cleanup for Rprop (#70483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70483

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767866

Pulled By: mikaylagawarecki

fbshipit-source-id: ffc5ae68eeea8fa09385862b853b731554b77bcb
(cherry picked from commit 3a0fe29580)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
ce3094f5f6 Optim foreach cleanup for Rmsprop (#70482)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70482

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767862

Pulled By: mikaylagawarecki

fbshipit-source-id: 8e2e9c986d5a3774093a79755940372945f1b3a9
(cherry picked from commit baea537277)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
2cb03e926f Optim foreach cleanup for SGD (#70481)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70481

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767868

Pulled By: mikaylagawarecki

fbshipit-source-id: 89b9227a4ddf99602855973cbc343c58ae3d5328
(cherry picked from commit ffea8ddcfd)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
5f9590681d Optim foreach cleanup for Adam (#70295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70295

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767870

Pulled By: mikaylagawarecki

fbshipit-source-id: f922f15ecb0307458c8ecee737325c42c4f3ce8b
(cherry picked from commit 66233a8a3e)
2022-02-15 18:02:08 +00:00
Andrew Gu
5dd0732457 [ZeRO] Add ctor support for multiple param groups (#72578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72578

**Overview**
This adds `ZeroRedundancyOptimizer` constructor support for multiple parameter groups (i.e. passing an `iterable` of `dict`s instead of an `iterable` of `torch.Tensor` as the `parameters` argument) to mirror the API for non-sharded optimizers.

Fixes https://github.com/pytorch/pytorch/issues/71347 and https://github.com/pytorch/pytorch/issues/59973.

This modifies `test_collect_shards()` to skip if ROCm.

**Test Plan**
I adjusted the existing constructor test, and I added a test for parity between constructing with two parameter groups up front versus constructor with one parameter group and adding the second parameter group after (via `add_param_group()`) versus a non-sharded optimizer.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34106940

Pulled By: awgu

fbshipit-source-id: 7e70fc0b3cec891646e0698eaedf02ff4354c128
(cherry picked from commit 40f2d45172)
2022-02-15 16:51:30 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Mikayla Gawarecki
d9acfef831 Optim foreach cleanup for Adamax (#69982)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69982

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767865

Pulled By: mikaylagawarecki

fbshipit-source-id: c5efd351e359825d38b71f57a2c61a2055c3c114
(cherry picked from commit 37bb80c2d7)
2022-02-09 16:52:13 +00:00
Mikayla Gawarecki
dabfea8363 Optim foreach cleanup for Adagrad (#69981)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69981

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767863

Pulled By: mikaylagawarecki

fbshipit-source-id: 1c99abe4ac4eb2a9eb896dff4837b539b94f68e7
(cherry picked from commit 61c28d0645)
2022-02-09 16:52:12 +00:00
Mikayla Gawarecki
8e8d170674 Optim foreach cleanup for Adadelta (#69980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69980

- Merged `torch/optim/adadelta.py` and `torch/optim/_multitensor/adadelta.py` into `torch/optim/adadelta.py`
- Moved adadelta functional forms from `torch/optim/_functional.py` and `torch/optim/_multi_tensor/_functional.py` to `torch/optim/adadelta.py`
- `torch/optim/_functional.py` just imports from `torch/optim/adadelta.py`
- Added a test `test_optimizers_foreach_flag` which replicates `test_multi_tensor_optimizers` in `test/test_optim.py`
- Add a test `test_adadelta_new` that replicates the behavior of `test_adadelta` but with `foreach` flag instead of using the multitensor adadleta class. If we delete `_multitensor/` we could replace `test_adadelta` with this

Remaining TODO:

- [ ] single_tensor adadelta supports complex but multitensor does not, need to integrate the singletensor logic in multitensor and switch the `test_adadelta_complex` to test for foreach in [True, False]

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin, albanD

Differential Revision: D33413059

Pulled By: mikaylagawarecki

fbshipit-source-id: 92a9fa98705762bb6bd464261671e49aef40070e
(cherry picked from commit a008227d22)
2022-02-09 16:52:12 +00:00
Mikayla Gawarecki
7176c92687 [optim] update step in functional and pass state_steps instead of state (#71333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71333

Updated
- Adagrad
- Adamax
- Adam
- AdamW
- RAdam
make multi_tensor functionals take `state_steps: List[Tensor]` instead of taking `states: List[Dict]`
make `state_steps: List[int]s -> state_steps:List[Tensor]` where each is a Singleton tensor so step can be updated within the functional

(NAdam and ASGD) were updated in separate diffs to fold their handling of state into the functionals

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767872

Pulled By: mikaylagawarecki

fbshipit-source-id: 9baa7cafb6375eab839917df9287c65a437891f2
(cherry picked from commit 831c02b3d0)
2022-02-08 16:51:19 +00:00
Rohan Varma
bdcdf94bdd [Opt Overlap] Clean up code in _OptimizerHookState (#71620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620

Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33700593

fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
(cherry picked from commit a50b2caf0e)
2022-01-26 19:33:49 +00:00
Rohan Varma
f5a71ec2d6 [Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604

Implement 2 helper functions:
- as_functional_optim which takes in a torch.optim class type and arguments and
  creates the corresponding functional optimizer.
- create_functional_optim which takes in the functional optimizer class type
  and constructs it. Note that as_functional_optim calls into
  create_functional_optim.

  The first will be used in future PRs as described in
  https://github.com/pytorch/pytorch/issues/67570 to create a functional
  optimizer from a traditional optimizer. The latter is used in
  _OptimizerHookState to create a functional optimizer.

  Both new helper functions are covered by unittests.
ghstack-source-id: 147577170

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33688995

fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1
(cherry picked from commit 42fdae2991)
2022-01-25 18:32:13 +00:00