Commit Graph

66 Commits

Author SHA1 Message Date
Jason Lu
bc88028e8e Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743)
Summary:
Original commit changeset: 81319beb97f3

Original Phabricator Diff: D47961182

Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822

Reviewed By: atuljangra

Differential Revision: D48131623

@diff-train-skip-merge
(D48131623 landed internally)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743
Approved by: https://github.com/malfet
2023-08-08 15:27:34 +00:00
Mikayla Gawarecki
d8e5f2aa6d Reland "Make adding buffers more like adding parameters (#104069)" (#106224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224
Approved by: https://github.com/atalman, https://github.com/albanD
2023-07-31 17:18:56 +00:00
Rohan Varma
5d4e170d58 [Optim in backward] API to retrieve in-backward optimizers (#105991)
API to retrieve in backward optimizer for checkpointing purposes

Differential Revision: [D47782225](https://our.internmc.facebook.com/intern/diff/D47782225/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105991
Approved by: https://github.com/awgu
2023-07-29 01:36:25 +00:00
Andrey Talman
c6653b65d8 Back out "Make adding buffers more like adding parameters (#104069)" (#105581)
Summary:
D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/

with `TypeError: register_buffer() takes 3 positional arguments but 4 were given`

Original commit changeset: d4b4069fbd38

Original Phabricator Diff: D47537831

Test Plan:
```
buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform
```

Reviewed By: atalman

Differential Revision: D47600140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581
Approved by: https://github.com/mikaylagawarecki
2023-07-20 03:39:53 +00:00
ekamiti
32d422f335 Make adding buffers more like adding parameters (#104069)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:59:05 +00:00
Justin Chu
01abbfbaae [BE] Fix all B022 useless-contextlib-suppress (#100335)
No arguments passed to contextlib.suppress. No exceptions will be suppressed and therefore this context manager is redundant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100335
Approved by: https://github.com/Skylion007
2023-04-30 18:47:40 +00:00
Edward Z. Yang
5a7aad9681 Convert logging f-strings to use % format, part four (#98705)
This does multi-line concatenated string literals.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705
Approved by: https://github.com/voznesenskym
2023-04-11 13:17:59 +00:00
Rohan Varma
8e6287264d [Optim in backward] register_hook=False API (#95096)
Use this API to avoid registering hooks for applications that do their
own custom logic. This eliminates the need for DDP to have to de-register these
hooks.

Differential Revision: [D43383794](https://our.internmc.facebook.com/intern/diff/D43383794/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43383794/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95096
Approved by: https://github.com/zhaojuanmao
2023-03-15 14:33:13 +00:00
Sergii Dymchenko
35bf5bac26 Fix "sandcastle_skip_if decorator name is confusing" (#95649)
Fixes https://github.com/pytorch/pytorch/issues/89473
See the issue https://github.com/pytorch/pytorch/issues/89473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649
Approved by: https://github.com/atalman, https://github.com/malfet
2023-03-03 09:29:40 +00:00
fduwjj
a88bfc60c7 [2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450)
No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450
Approved by: https://github.com/wanchaol
2023-02-26 03:03:37 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
fduwjj
e7ace1ff93 [PT-D][NamedOptimizer][6/N] Upstream init_state from keyed to NamedOptimizer (#93887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93887
Approved by: https://github.com/rohan-varma
2023-02-02 07:14:49 +00:00
Jane Xu
b90496eef5 [nn] zero_grad() set_to_none default True (#92731)
Attempts to fix #92656

BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
2023-01-26 01:04:28 +00:00
fduwjj
368c737603 [PT-D][5/N] Enable add_param_group for named optimizer (#91928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91928
Approved by: https://github.com/rohan-varma
2023-01-18 10:53:31 +00:00
fduwjj
32356aaee6 [4/N] Add test for partial training for NamedOptimizer (#91344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91344
Approved by: https://github.com/rohan-varma
2023-01-09 22:19:49 +00:00
fduwjj
5fabd96f3c [PT-D][3/N] Add FSDP hook with Named Optimizer (#91321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91321
Approved by: https://github.com/fegin
2023-01-06 23:51:33 +00:00
fduwjj
c7e7ea92e2 [NamedOptimizer][2/N] Prepare the enablement of state_dict for FSDP (#91147)
1. Add param_group check logic and unit test
2. Remove unnecessary check for conditional param update
3. Return the param_group from the inner optimizer so that when param_group is None or not all params are specified, we still return the expected result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91147
Approved by: https://github.com/fegin
2022-12-20 23:23:04 +00:00
fduwjj
1a48ae96ba [PT-D][Easy] Reformat the optim code within PTD code base (#90399)
Just run two commands:
```
ufmt format torch/distributed/optim/
ufmt format test/distributed/optim/
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90399
Approved by: https://github.com/awgu
2022-12-08 06:38:59 +00:00
fduwjj
1abe264ef0 [Upstream _NamedOptimzer] Reland PR (89480) (#90293)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Reland https://github.com/pytorch/pytorch/pull/89480/
* #90294
* __->__ #90293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293
Approved by: https://github.com/awgu
2022-12-06 21:47:12 +00:00
PyTorch MergeBot
176b962f4b Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)"
This reverts commit 31ec1a1ef7.

Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names
2022-12-06 07:22:37 +00:00
fduwjj
31ec1a1ef7 [PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)
In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters.

Now composability workstream need a FQN based way to index optimizer state_dict for parameters..

For example, SGD optimizer might have something in its `state_dict` like:

```
{'state':
  {0:
    {'momentum_buffer': tensor(...)},
  {1:
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}]
}
```

And in NamedOptimizer we want the `state_dict` can be:

```
{'state':
  {'net1.0.weight':
    {'momentum_buffer': tensor(...)},
  {'net1.0.bias':
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}]
}
```

We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer.

For the next couple PR/diffs, we also need to:
1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components.
2. Make `NamedOptimizer` works well with apply_optim_in_backward
3. Upstream also `CombinedOptimizer`.

Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480
Approved by: https://github.com/rohan-varma
2022-12-06 04:34:19 +00:00
Rohan Varma
404f254e20 Upstream apply_optim_in_backward from TorchRec (#87397) (#88539)
Summary:

Upstreaming this as part of sharing common APIs. This is just a plain
move, any changes needed to support DDP / FSDP will come in follow up diffs.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D40564646

fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539
Approved by: https://github.com/awgu
2022-11-05 18:28:07 +00:00
Michael Carilli
ba27ee9e8f [CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862)
Near term fix for https://github.com/pytorch/pytorch/issues/76368.

Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic?
A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph.

Q. Ok, why not just do the capture-safe approach with device-side state variables all the time?
A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling.

Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here?
A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like
```python
graph.will_use_optimizer(opt)
graph.capture_begin()
...
```
but that seems clunkier than an optimizer constructor arg.

I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach.

Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862
Approved by: https://github.com/ezyang
2022-06-13 01:56:47 +00:00
pritam
3a38f175dd Convert DDP parameters to ReplicatedTensor during forward pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753

As per the design in https://github.com/pytorch/pytorch/issues/72138,
convert DDP parameters to ReplicatedTensor during its forward pass. Concretely,
this is done as follows:

1) Create a separate `_replicated_tensor_module` which is a copy of self.module
without creating copies of the Tensors themselves.
2) Use `_replicated_tensor_module` instead of `self.module` during the forward
pass.
3) Have a context manager `_ddp_replicated_tensor` to enable this, since
certain edge cases can fail where self.module is changed out of band resulting
in discrepancy between self.module and `_replicated_tensor_module`.

Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/)

Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
2022-04-18 03:27:23 +00:00
Andrew Gu
9012e8d65a [ZeRO][BE] Clean up ZeRO tests (#73842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842

**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.

The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
    - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
    - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
    - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
    - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
    - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
    - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.

**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34675709

Pulled By: awgu

fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
2022-03-08 13:15:20 +00:00
Andrew Gu
c30659ffcc [ZeRO] (Reland) Add ctor support for multiple param groups (#72932)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.

**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).

To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.

**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932

Reviewed By: rohan-varma

Differential Revision: D34281482

Pulled By: awgu

fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
2022-02-22 16:29:55 +00:00
Nikita Shulga
84cb810b3f Revert D34106940: [ZeRO] Add ctor support for multiple param groups
Test Plan: revert-hammer

Differential Revision:
D34106940 (5dd0732457)

Original commit changeset: 7e70fc0b3cec

Original Phabricator Diff: D34106940 (5dd0732457)

fbshipit-source-id: 08f846c9c02be8756475f4e0b57eb381f10c27bd
(cherry picked from commit 7675497d83)
2022-02-16 03:45:15 +00:00
Andrew Gu
5dd0732457 [ZeRO] Add ctor support for multiple param groups (#72578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72578

**Overview**
This adds `ZeroRedundancyOptimizer` constructor support for multiple parameter groups (i.e. passing an `iterable` of `dict`s instead of an `iterable` of `torch.Tensor` as the `parameters` argument) to mirror the API for non-sharded optimizers.

Fixes https://github.com/pytorch/pytorch/issues/71347 and https://github.com/pytorch/pytorch/issues/59973.

This modifies `test_collect_shards()` to skip if ROCm.

**Test Plan**
I adjusted the existing constructor test, and I added a test for parity between constructing with two parameter groups up front versus constructor with one parameter group and adding the second parameter group after (via `add_param_group()`) versus a non-sharded optimizer.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34106940

Pulled By: awgu

fbshipit-source-id: 7e70fc0b3cec891646e0698eaedf02ff4354c128
(cherry picked from commit 40f2d45172)
2022-02-15 16:51:30 +00:00
Adnios
a9c7d626e1 Add the maximize flag to AdamW (#70146)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/68052

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70146

Reviewed By: malfet

Differential Revision: D33254561

Pulled By: albanD

fbshipit-source-id: f190c836a4162f936c5953e076747c345df21421
2021-12-23 09:20:29 -08:00
oliver
3d358a7678 Adds a maximize flag to Adam (#68164)
Summary:
Solves the next most important use case in https://github.com/pytorch/pytorch/issues/68052.

I have kept the style as close to that in SGD as seemed reasonable, given the slight differences in their internal implementations.

All feedback welcome!

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68164

Reviewed By: VitalyFedyunin

Differential Revision: D32994129

Pulled By: albanD

fbshipit-source-id: 65c57c3f3dbbd3e3e5338d51def54482503e8850
2021-12-13 05:53:53 -08:00
oliver
f8297d40fc Adds a maximize flag to SGD. (#67847)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46480 -- for SGD.

## Notes:
- I have modified the existing tests to take a new `constructor_accepts_maximize` flag. When this is set to true, the ` _test_basic_cases_template` function will test both maximizing and minimizing the sample function.
- This was the clearest way I could think of testing the changes -- I would appreciate feedback on this strategy.

## Work to be done:
[] I need to update the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67847

Reviewed By: H-Huang

Differential Revision: D32252631

Pulled By: albanD

fbshipit-source-id: 27915a3cc2d18b7e4d17bfc2d666fe7d2cfdf9a4
2021-11-09 00:43:07 -08:00
Caspar van Leeuwen
a20a64af4e Increased tolerance for test_zero_model_parallel tests (#67765)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67764

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67765

Reviewed By: malfet

Differential Revision: D32171621

Pulled By: mrshenli

fbshipit-source-id: 8c34f4714289cb41824f3a18822a28ed670fa0a6
2021-11-04 15:17:45 -07:00
Jane Xu
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
Rohan Varma
3bd26792c0 Skip test_multiple_groups on windows (#66154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66154

Skips as the test is flaky:
https://github.com/pytorch/pytorch/issues/66059
ghstack-source-id: 139763149

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31403153

fbshipit-source-id: 7f47f17cee148a708346d6d9454c44a194d13a78
2021-10-05 18:33:23 -07:00
Rohan Varma
1c8949c51a [BE] Run Zero test internally (#65519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65519

Adds buck target so we can run this internally.
ghstack-source-id: 139009957

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D31072784

fbshipit-source-id: 7185cc1e6f9df3d79251eb017270471942a9d7dd
2021-09-25 13:26:50 -07:00
Rohan Varma
f70147b426 [BE] Enable ZeRO test on windows (#65385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65385

Enables the ZeRO tests to run on windows. Closes
https://github.com/pytorch/pytorch/issues/63086.

Backend == NCCL was used as a proxy to see if we were running under CUDA, but Windows GPU tests uses Gloo. In this case use Gloo on GPU.

For some reason these tests don't seem to test Gloo on GPU with ZeRO in general (picks NCCL backend when GPU is available), so kept that behavior for now.
ghstack-source-id: 139003920

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31071181

fbshipit-source-id: 45a76309ac5e882f5aa6c4b130118a68800754bb
2021-09-25 13:25:40 -07:00
Andrew Gu
2d75703c6a Remove req to call step() in training loop (#63164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284616

Pulled By: andwgu

fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1
2021-08-13 08:22:44 -07:00
Andrew Gu
28f9e108b1 Pass _allow_empty_param_list into func opt ctor (#63163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63163

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284615

Pulled By: andwgu

fbshipit-source-id: 4857f5b618ec5b007648737ab532ce605e5d70dc
2021-08-13 08:22:42 -07:00
Andrew Gu
bd81c9178a Simplify data structures, add uniform approximation, fix mem leak (#63162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284617

Pulled By: andwgu

fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f
2021-08-13 08:20:59 -07:00
Rohan Varma
eea52b7d47 Skip zero test on windows (#63087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63087

Test failed on windows unexpectedly see
https://github.com/pytorch/pytorch/issues/63086. Skip for now while we
investigate
ghstack-source-id: 135631811

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D30251300

fbshipit-source-id: 8acb1ea8863c654c171fe989ac24446c321c085d
2021-08-12 00:38:42 -07:00
Rohan Varma
39ec1da935 [reland] Gate DistributedOptimizers on RPC availability (#62937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62937

reland due to windows + cuda failure, fix by running it on gloo on windows even with cuda.
ghstack-source-id: 135306176

Test Plan: ci

Reviewed By: mrshenli

Differential Revision: D30177734

fbshipit-source-id: 7625746984c8f858648c1b3632394b98bd4518d2
2021-08-09 14:41:06 -07:00
Andrew Gu
1b1f1e36b4 Add `allow_empty_param_list` to functional optimizers (#62522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522

Addresses https://github.com/pytorch/pytorch/issues/62481

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30072074

Pulled By: andwgu

fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3
2021-08-09 11:18:56 -07:00
Andrew Gu
62a90c227f Make _Join, _Joinable, _JoinHook public (#62605)
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
2021-08-03 12:20:11 -07:00
Andrew Gu
51f687fd4b Add overlap with DDP to ZeRO (two approaches) (#62157)
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.

Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157

Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`

These were tested on the AI AWS cluster.

An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.

Both approaches have been verified using an internal accuracy benchmark.

Reviewed By: mrshenli

Differential Revision: D29971046

Pulled By: andwgu

fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
2021-08-02 08:33:34 -07:00
Yu Guo
a50a389ca6 Revert D29701479: [pytorch][PR] Remove _broadcast_object() from ZeroRedundancyOptimizer
Test Plan: revert-hammer

Differential Revision:
D29701479 (9b5d9b4049)

Original commit changeset: c8d5f9057b32

fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994
2021-07-15 10:03:08 -07:00
Andrew Gu
9b5d9b4049 Remove _broadcast_object() from ZeroRedundancyOptimizer (#61539)
Summary:
Revised version of https://github.com/pytorch/pytorch/issues/60573.

**Overview:**
This makes two changes:
- It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively.
- It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`.

The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device.

In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU.

Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows:
`source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU`
Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes:
`destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539

Test Plan:
I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct.

The existing `ZeroRedundancyOptimizer` tests pass.
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```

The existing `broadcast_object_list()` test passes:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list
```

Reviewed By: zou3519

Differential Revision: D29701479

Pulled By: andwgu

fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9
2021-07-14 17:36:30 -07:00
Andrew Gu
57feb35474 Refactor non-joined process computation (#61555)
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.

**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)

The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.

This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.

Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.

**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
    ...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
    ...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).

**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555

Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```

Reviewed By: zou3519

Differential Revision: D29690359

Pulled By: andwgu

fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
2021-07-14 08:20:40 -07:00
Andrew Gu
4f4beb8286 Add Model Parallel Support to ZeRO (#61370)
Summary:
**Overview:**
The existing `ZeroRedundancyOptimizer` implementation assumes that all model parameters are stored on the same device (due to the recent [refactor](https://github.com/pytorch/pytorch/pull/59834)). This change allows model parameters to be sharded across multiple devices, as in the DDP with Model Parallelism example [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).

The only logic affected is the bucketing strategy used when `parameters_as_bucket_view=True`. Let `n` denote the world size and `k` denote the number of devices per process.
- Previously, `k = 1`, and `self._buckets` was a `List[torch.Tensor]`, where `self._buckets[j]` is a tensor (i.e. bucket) containing the parameters assigned to rank `j` for `j = 0, ..., n - 1`.
- Now, `self._buckets` is a `List[List[torch.Tensor]]`, where `self._buckets[i][j]` is a tensor containing the parameters stored on device `i` assigned to rank `j` for `i = 0, ..., k - 1` and `j = 0, ..., n - 1`.

This bucket construction uses an auxiliary data structure `self._device_to_per_rank_params`, which is a `Dict[torch.device, List[List[torch.Tensor]]]`. It maps:
- `dev_0` to `[rank 0's assigned parameters on dev_0, rank 1's assigned parameters on dev_1, ...]`,
- `...`
- `dev_{k-1}` to `[rank 0's assigned parameters on dev_{k-1}, rank 1's assigned parameters on dev_{k-1}, ...]`

I removed the invariant checker `_verify_same_param_device()` and its corresponding test since it is no longer an invariant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61370

Test Plan: I added a new test `test_zero_model_parallel()` that checks for parity between a DDP model with model parallelism using `ZeroRedundancyOptimizer` and a local model with the same architecture using a local optimizer. I also verified that the existing tests still pass.

Reviewed By: soulitzer

Differential Revision: D29637132

Pulled By: andwgu

fbshipit-source-id: 07112959fa4e94a3f40e67e88cbb58ce3cd1e033
2021-07-09 14:27:47 -07:00
Andrew Gu
179249084b Refactor DDP join() API, adding hooks (#60757)
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.

**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)

There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.

The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.

The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.

**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in  the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757

Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```

Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.

Reviewed By: iramazanli, mrshenli

Differential Revision: D29624636

Pulled By: andwgu

fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
2021-07-09 08:29:20 -07:00
Andrew Gu
f0e4e4be72 Clean Up ZeRO (#60285)
Summary:
**Overview:**
Being relatively new to PyTorch and ZeRO, I found parts of the code slightly hard to follow. This change strives to clean up the `ZeroRedundancyOptimizer` code in `zero_redundancy_optimizer.py` by reorganizing some computations, making variable names more explicit and consistent, and unifying terminology in the documentation. The goal is for the code to be easier to extend afterwards.

**Changes:**
1) `state_dict()`: The [logic](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L510)) for updating the global `state_dict` with each rank's local `state_dict` is simplified and made more explicit. Notably, the `dict` [`local_index_to_param_id`](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L513)) is unneeded. It maps `local_pg["params"][i]` to `id(global_pg["params"][i])`, so it is equivalent to make a single pass over both lists in tandem, effectively iterating over `i`, without a need for the explicit `dict`.
2) `_update_trainable()`: The function [initializes](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L597)) the local optimizer if it does not exist. I am unaware of any reason for the local optimizer to be destroyed after initialization, so I moved that logic to its own function `_init_local_optimizer()`, which is called once in the constructor.
After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654706728), I removed the function `_update_trainable()` itself in favor of adding a check for `parameters_as_bucket_view` in `build_param_buckets()` directly.
3) `rank_local_state_dict()`: This [function](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L528)) is currently broken. It appears to be legacy and relies on the input `state_dict` to have the key `"partitions"`. For now, I have removed it and added an [issue](https://github.com/pytorch/pytorch/issues/60284). Is it a notable use case to want to access another rank's `state_dict` in particular (as opposed to consolidating the entire state and then accessing)?
4) `local_state_dict():` After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r655571043), I removed the function.
5) `partition_parameters()`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654708183), I renamed the function to `_partition_parameters()` to mark it as private.
6) `_param_to_index`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654828100), I changed the key to be the parameter itself rather than its integer ID.
7) `buckets`: I renamed the data structure to `_buckets` to mark it as private.
8) Terminology: I tried to reduce the set of terms being used instead of juggling a number of synonyms. In particular, I made an effort to distinguish between "local" and "global" and to make names more indicative of typing.
9) Style: Per the [PyTorch contributing guide](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation), I made all docstrings abide by the 80 character limit, except for the one [line](554891f6fa/torch/distributed/optim/zero_redundancy_optimizer.py (L142)) showing the example ZeRO usage. Some code lines violate the limit for readability. Also, I unified some of the minor stylistic usages out of habit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60285

Test Plan:
The test suite passes as expected (on the AI AWS cluster):
```
gpurun python test/distributed/optim/test_zero_redundancy_optimizer.py
```
I visually inspected the generated HTML doc (as generated following [this](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation)).

Reviewed By: mrshenli

Differential Revision: D29320726

Pulled By: andwgu

fbshipit-source-id: 23f69a19ecc5e877a38fe1df0da11329428311dd
2021-06-23 07:21:40 -07:00