Commit Graph

352 Commits

Author SHA1 Message Date
Hubert Lu
cd18b78daa [ROCm] Enable bf16-related tests in test_c10d_nccl.py and test_grad_layout_1devicemodule_1replicaperprocess (#82020)
### Description
Enable bf16-related unit tests in test_c10d_nccl.py and test_grad_layout_1devicemodule_1replicaperprocess as follows:

- distributed/test_c10d_nccl test_bf16_compress_wrapper_is_view (main.DistributedDataParallelTest)
- distributed/test_c10d_nccl test_bf16_compress_wrapper_nccl (main.DistributedDataParallelTest)
- distributed/test_c10d_nccl test_grad_layout_1devicemodule_1replicaperprocess (main.DistributedDataParallelTest)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82020
Approved by: https://github.com/ezyang
2022-08-11 21:16:33 +00:00
Yi Wang
08d54b5cd5 Correct DDP example (#83034)
remove undefined `pg` from DDP example code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83034
Approved by: https://github.com/mrshenli
2022-08-09 18:58:33 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
anjali411
3bcc19b29a Add __all__ to various submodules in torch.fx, distributions, distributed, package (#80367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80367
Approved by: https://github.com/albanD
2022-06-27 21:27:30 +00:00
Rohan Varma
e7cb44b6c4 Guard distributed imports (#77727)
Move distributed import after dist.is_avail check to fix builds with USE_DISTRIBUTED=0. Although, note that this issue is not caught by any CI at the moment.

Closes https://github.com/pytorch/pytorch/issues/77704
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77727
Approved by: https://github.com/malfet
2022-05-18 11:27:52 +00:00
Rohan Varma
6f954d7bbb FSDP parameter sync
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77492

Approved by: https://github.com/zhaojuanmao
2022-05-17 19:58:49 +00:00
Rohan Varma
bbb1f106c7 Separate input moving to utils file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77187

Test fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77235

Lint fix

Approved by: https://github.com/awgu
2022-05-11 21:55:38 +00:00
Rohan Varma
ffb0946504 Generalize param verification and broadcast
New PR for https://github.com/pytorch/pytorch/pull/75970 to be compatible with GHF.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76374
Approved by: https://github.com/awgu
2022-04-26 22:25:53 +00:00
pritam
b26df43f15 Fix bug where __getstate__ of DDP looks for self._replicated_tensor_module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76349

When we are not using ReplicatedTensor in DDP and try to save a DDP
module it will error out since it tries to delete the _replicated_tensor_module
attribute.

Fixing this by checking if this mode is enabled before triggering the delete.

Differential Revision: [D35875167](https://our.internmc.facebook.com/intern/diff/D35875167/)

Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao
2022-04-26 02:49:49 +00:00
pritam
3a38f175dd Convert DDP parameters to ReplicatedTensor during forward pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753

As per the design in https://github.com/pytorch/pytorch/issues/72138,
convert DDP parameters to ReplicatedTensor during its forward pass. Concretely,
this is done as follows:

1) Create a separate `_replicated_tensor_module` which is a copy of self.module
without creating copies of the Tensors themselves.
2) Use `_replicated_tensor_module` instead of `self.module` during the forward
pass.
3) Have a context manager `_ddp_replicated_tensor` to enable this, since
certain edge cases can fail where self.module is changed out of band resulting
in discrepancy between self.module and `_replicated_tensor_module`.

Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/)

Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
2022-04-18 03:27:23 +00:00
Junjie Wang (PyTorch)
0a6ac31797 [PT-D][DDP][BE] Add unit tests for Forward and Backward Hook (#74063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74063

Address the issue https://github.com/pytorch/pytorch/issues/66229 as part of BE effort.

Basically:
1. We remove the stale comment which confuses users.
2. Add more unit tests to test the forward/backward hook working for DDP.
ghstack-source-id: 151463380

Test Plan: CI

Reviewed By: rohan-varma

Differential Revision: D34800830

fbshipit-source-id: 21133209323b2b5eda0dd6472f6309d4fb779b97
(cherry picked from commit b9b165c8305572128395daffafc13fcac8b85e29)
2022-03-16 23:18:28 +00:00
Shihao Xu
bcd0843bec [torch.distributed][DDP] Disable DDP bucketing for the first iteration (#72843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843

# [Debug Story] Training Hanging and DDP Bucketing

**What are the characteristics of the hanging training instance?**

The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution.

The model config difference to trigger this hanging issue is turning on position weighted embedding tables.

A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli).

**What was the initial manifested error?**

The training was stuck in the first iteration.

**What are useful debugging tools this time?**

After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure.

After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all.

With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks.

**What is DDP bucketing?**

Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks.

Say we have 4 tensor ops. A, B, C, D.

In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready.

The time sequence would be,

* D.grad
* C.grad
* B.grad
* A.grad
* All reduce on [D.grad, C.grad, B.grad, A.grad].

But that would be a huge waste of communication channel bandwidth.

With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes,

* D.grad
* C.grad
* All reduce on [D.grad, C.grad].
* B.grad
* A.grad
* All reduce on [B.grad, A.grad].

With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance.

**What exactly went wrong in this case?**

1. The bucketing doesn’t honor backward graph execution order.
2. There are other collectives comm ops in backward graph.
3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world.

Using the above example again, we have 4 tensor ops. A, B, C, D.

Say we have 2 trainers,

B is the feature processor module.

B only runs on trainer 0 (both forward and backward), but not on trainer1.

C is the All-to-all (Pooled embeddings distribution).

C sends out all-to-all collective in both its forward and backward pass.

Keep assuming all other ops run on both trainers.

trainer_0 op sequence is,

A, B (feature preproc), C (all-to-all), D | D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad

trainer_1 op sequence is,

A, C (all-to-all), D | D.grad, C.grad (reverse all-to-all), A.grad

Even though the correct bucketing should be (same bucketing for both ranks),

* bucket_0, [D.grad, C.grad]
* bucket_1, [B.grad, A.grad]

but because of 1), they end up like,

* bucket_0, [B.grad, D.grad]
* bucket_1, [C.grad, A.grad]

Plus 2) and 3), the time sequence could like,

(check mark represents the gradient is ready)

(bucket is ready to do synchronization if all its enclosing gradients are ready)

* trainer_0
   * t0,
      * D.grad
      * bucket_0, [B.grad, D.grad ✓]
   * t1,
      * **C.grad all-to-all**
      * C.grad ✓
      * bucket_1, [C.grad ✓, A.grad]
   * t2
      * B.grad
      * bucket_0, [B.grad ✓, D.grad ✓] ✓
   * t3
      * All-reduce for bucket_0
   * t4
      * A.grad
      * bucket_1, [C.grad ✓, A.grad ✓] ✓
* trainer_1
   * t0,
      * D.grad
      * bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.)
   * t1,
      * **All-reduce for bucket_0**
   * t2
      * C.grad all-to-all
      * bucket_1, [C.grad ✓, A.grad]
   * t3
      * A.grad
      * bucket_1, [C.grad ✓, A.grad ✓] ✓

This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce.

**What is the solution for fixing DDP?**

Disable DDP bucketing for the first iteration. D34051938

This is because after the first iteration, buckets will be built again based on real backward graph execution order.

So the slow gradient synchronization only affects the first iteration.

Test Plan:
buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn
BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu

P484179296

buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn
BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes
P484177200

Reviewed By: zhaojuanmao

Differential Revision: D34051938

fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9
(cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)
2022-03-04 18:29:36 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Andrew Gu
59dd84cab6 [Join][BE] Fix typo; remove obsolete method (#72886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886

**Test Plan**
Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34255651

Pulled By: awgu

fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d
(cherry picked from commit 4492f03a3f)
2022-02-16 15:03:09 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Rohan Varma
4feef6c970 Log static graph in constructor if it is set (#72456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456

It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug.
ghstack-source-id: 148840679

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34045204

fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc
(cherry picked from commit 1d622c88f3)
2022-02-11 15:55:09 +00:00
Rohan Varma
37651894f9 [Easy] Small DDP fixes (#72455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72455

- Improve helper function
- Improve/fix some logging
ghstack-source-id: 148840678

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34044865

fbshipit-source-id: d2ae820effaaaecdd7155ffa8d3a1d8ebbd9f39e
(cherry picked from commit 3efbea8f41)
2022-02-11 15:55:09 +00:00
Rohan Varma
1c8fcc44cb [Opt Overlap] Support optimizing partial set of parameters (#71608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608

Per title
ghstack-source-id: 147577178

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33696382

fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
(cherry picked from commit 436b547fb0)
2022-01-26 19:33:49 +00:00
Rohan Varma
d3354602fc [Easy] DDP typo fix (#71607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71607

Per title
ghstack-source-id: 147577177

Test Plan: N/a

Reviewed By: cbalioglu

Differential Revision: D33694038

fbshipit-source-id: 5a5a618f13bc8b91127169efcebb90b5a36474a1
(cherry picked from commit 62f17f116d)
2022-01-26 07:32:04 +00:00
Rohan Varma
10ca760c0a [Opt Overlap] Implement register_fused_optim in DDP (#71606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71606

Per title
ghstack-source-id: 147577172

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33694037

fbshipit-source-id: a148d5ce6031f0cc20f33785cfe2c27d1fc2d682
(cherry picked from commit ace3261e0c)
2022-01-26 07:32:04 +00:00
Yanli Zhao
4b3cf1eaf7 [BE]Clarify how to check memory saving if using gradient_as_bucket_view (#71483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71483

claify that peak memory saving should be checked after first iteration when using gradient_as_bucket_view
ghstack-source-id: 147271113

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D33662424

fbshipit-source-id: f760da38e166ae85234e526ddf1526269ea25d42
(cherry picked from commit a40dda20da)
2022-01-20 19:38:41 +00:00
Yanli Zhao
1c61d8c43f [PT1.11] make static graph to be stable (#71459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459

1. add static_graph feature to DDP constructor;
2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor
3. four cases are covered:
    static_graph = False, _set_static_graph() is called;
    static_graph = False, _set_static_graph() is not called;
    static_graph = True, _set_static_graph() is not called;
    static_graph = True, _set_static_graph() is called;
ghstack-source-id: 147263797

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D33646738

fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855
(cherry picked from commit dc246a1129)
2022-01-20 19:38:41 +00:00
Rohan Varma
fcd1375b2b [DDP][BE][Docs] Clarify checkpoint support (#68827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68827

Add a note about current checkpoint support with DDP. Note that this
does not include the features enabled with _set_static_graph yet, as it is an
undocumented private API. Once we support static graph as beta feature in OSS
we can add to the note here.
ghstack-source-id: 144285041

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D32624957

fbshipit-source-id: e21d156a1c4744b6e2a807b5b5289ed26701886f
2021-11-30 12:37:37 -08:00
Santiago Castro
f776f30780 Keep the sequence or mapping type in default_collate (#68779)
Summary:
`default_collate`, `default_convert`, and `pin_memory` convert sequences into lists. I believe they should keep the original type when possible (e.g., I have a class that inherits from `list`, which comes from a 3rd party library that I can't change, and provides extra functionality).

Note it's easy to do when the type supports an iterable in its creation but it's not always the case (e.g., `range`).

Even though this can be accomplished if using a custom `default_collate`/`default_convert`, 1) this is behavior they should support out-of-the-box IMHO, and 2) `pin_memory` still does it.

cc VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68779

Reviewed By: wenleix

Differential Revision: D32651129

Pulled By: ejguan

fbshipit-source-id: 17c390934bacc0e4ead060469cf15dde815550b4
2021-11-29 13:14:20 -08:00
Yifan Xiong
c7eaec86f0 [NCCL] Patch bfloat16 support (#67843)
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.

This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
  v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
  operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843

Reviewed By: H-Huang

Differential Revision: D32248132

Pulled By: mrshenli

fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
2021-11-09 13:46:13 -08:00
James Reed
80178d6152 [DDP] Fix some issues with code example in DDP docstring (#67883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67883

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D32190946

Pulled By: jamesr66a

fbshipit-source-id: a376324b95cbe833ffa606ecdfc6156432880f70
2021-11-05 17:32:45 -07:00
Rohan Varma
bff64e84cd [DDP] Track models with sync bn (#66680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680

Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31679477

fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
2021-10-18 22:31:52 -07:00
Rohan Varma
38f5144eae Fix https://github.com/pytorch/pytorch/issues/61982 (#66015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015

Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of
tensors in DDPSink. Only applies once for static_graph and generally for unused
params which already has overhead, so perf hit should not be an issue. Will
verify with benchmark.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D31346633

fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b
2021-10-07 18:11:18 -07:00
Rohan Varma
71704349aa [DDP] Allow await of custom buffer reduction in backward (#64515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64515

For performance reasons, we would like to ensure that we can await
user collectives as part of custom buffer reduction in parallel to other work.
As a result, add support to return futures from custom buffer hooks and await
those futures at end of backwards pass.

Also added some docs to clarify how to use these APIs.
ghstack-source-id: 138793803

Test Plan: I

Reviewed By: zhaojuanmao

Differential Revision: D30757761

fbshipit-source-id: e1a2ead9ca850cb345fbee079cf0614e91bece44
2021-09-23 13:02:53 -07:00
Wanchao Liang
2f67579864 [ddp] use named_params and named_buffers explicitly (#65181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65181

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons).
ghstack-source-id: 138701159

Test Plan: wait for ci

Reviewed By: divchenko, rohan-varma

Differential Revision: D31007085

fbshipit-source-id: 4e1c4fbc07110163fb9b09b043ef7b4b75150f18
2021-09-22 17:32:54 -07:00
Rohan Varma
5739f77775 [DDP] Refactor and remove sync_params (#64514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64514

sync_params is a misnomer since we don't actually synchroniz
parameters. While removing this I realized
`self._check_and_sync_module_buffers` does almost everything we need it to, so
just refactored that and made DDP forward call into it.
ghstack-source-id: 138684982

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30751231

fbshipit-source-id: add7c684f5c6c71dad9e9597c7759849fa74f47a
2021-09-22 14:12:51 -07:00
Rohan Varma
ce5981e431 [DDP] Custom buffer reduction (#64513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64513

Proposal: https://github.com/pytorch/pytorch/issues/63041
Support custom buffer reduction in DDP via hook
ghstack-source-id: 138655663

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30751152

fbshipit-source-id: 257a9d46bb178d8812d4ea5a4d9c6140b8a1791f
2021-09-22 14:11:35 -07:00
Jessica Choi
f24bd43375 Changing type and name of local_used_maps to reflect that it is only one map (#65380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65380

Fixing bugs that arise when running setup.py develop

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31104844

Pulled By: jaceyca

fbshipit-source-id: acfd4cf316c71177df758ca55b470f51a17f776b
2021-09-22 11:35:33 -07:00
Jessica Choi
158b8bdc8a Cleaning up DDP SPMD in reducer.cpp (#64113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113

Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.

Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30615965

fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
2021-09-21 16:13:18 -07:00
Rohan Varma
45bd0f6181 Back out "Revert D30745960: [DDP] Remove SPMD from self.modules_buffers" (#64778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64778

Original commit changeset: d3f3fb813c45
ghstack-source-id: 138326910

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849443

fbshipit-source-id: 15dab8a959a29d2e2fefac6ad52b8d8168eacc02
2021-09-17 12:28:36 -07:00
Rohan Varma
70f286c1e2 Back out "Revert D30745961: [DDP] Remove self.modules_params" (#64777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64777

Original commit changeset: 59f7cc50d369
ghstack-source-id: 138326909

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849442

fbshipit-source-id: bb87ba83935374d8a3ebbc29365df1417dd4f26f
2021-09-17 12:28:34 -07:00
Rohan Varma
61dfcbf4bc Back out "Revert D30745921: [DDP] Fix when buffers are reassigned in module" (#64776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64776

Original commit changeset: 343ead86bf1e
ghstack-source-id: 138326914

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849444

fbshipit-source-id: 9a72805416fe7d6c68e51bdcdb88f6e1fecb614d
2021-09-17 12:28:32 -07:00
Howard Huang
459653a0f6 Revert D30745921: [DDP] Fix when buffers are reassigned in module
Test Plan: revert-hammer

Differential Revision:
D30745921 (d59ecc02df)

Original commit changeset: 25eb1edbf445

fbshipit-source-id: 343ead86bf1e2d0b2d4124be331ea2fa437303ad
2021-09-09 08:23:16 -07:00
Howard Huang
5bc53ac5ef Revert D30745961: [DDP] Remove self.modules_params
Test Plan: revert-hammer

Differential Revision:
D30745961 (8c09510294)

Original commit changeset: 32d102502570

fbshipit-source-id: 59f7cc50d369b6cc2856cf4ebd0f58b96202336d
2021-09-09 08:23:14 -07:00
Howard Huang
f1aaf8afcd Revert D30745960: [DDP] Remove SPMD from self.modules_buffers
Test Plan: revert-hammer

Differential Revision:
D30745960 (1553259520)

Original commit changeset: 66a8f9847e9f

fbshipit-source-id: d3f3fb813c45ac1b0ff15c6154b2e99e5dbab433
2021-09-09 08:22:12 -07:00
Rohan Varma
1553259520 [DDP] Remove SPMD from self.modules_buffers (#64474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64474

No need for a nested list here.
ghstack-source-id: 137526312

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745960

fbshipit-source-id: 66a8f9847e9fe1e02c51b79647e93bf7665cf4d9
2021-09-08 19:16:15 -07:00
Rohan Varma
8c09510294 [DDP] Remove self.modules_params (#64473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64473

Unused after SPMD deprecated.
ghstack-source-id: 137526305

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745961

fbshipit-source-id: 32d102502570291e01579e5b47a6d74dc71013bb
2021-09-08 19:16:13 -07:00
Rohan Varma
d59ecc02df [DDP] Fix when buffers are reassigned in module (#64472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64472

Sometimes, user module can reassign tensor buffer, as in:

```
self.buffer = torch.randn(1, 2) # in init
self.buffer += 1 # in forward
```

in this case, `self.modules_buffers` will become outdated and we should
repopulate self.modules_buffers if we need to sync module buffers.

See https://github.com/pytorch/pytorch/issues/63916 for full description of the
issue.
ghstack-source-id: 137526309

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745921

fbshipit-source-id: 25eb1edbf445703a481802e07f3058d38ea6fc64
2021-09-08 19:14:55 -07:00
Yinbin Ma
0d437fe6d0 BF16 allreduce hook (#63260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260

Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.

Reviewed By: SciPioneer

Differential Revision: D30238317

fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
2021-08-18 20:53:49 -07:00
Rohan Varma
5fb79f61a8 [DDP] Dont set thread local state in reducer autograd hook. (#62996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996

No need to set this because autograd engine already propagates TLS
states.
ghstack-source-id: 135438220

Test Plan: CI

Reviewed By: albanD

Differential Revision: D30202078

fbshipit-source-id: e5e917269a03afd7a6b8e61f28b45cdb71ac3e64
2021-08-10 10:50:16 -07:00
Rohan Varma
3df4870343 [Reland][DDP] Support not all outputs used in loss calculation (#61753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753

Reland of https://github.com/pytorch/pytorch/pull/57081.
Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`.

In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression.

We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire.

Added the unittests that were part of the reverted diff.
ghstack-source-id: 135388925

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29726179

fbshipit-source-id: 54c8819e0aa72c61554104723a5b9c936501e719
2021-08-09 22:29:11 -07:00
Rohan Varma
80091cb0f7 [DDP] Allow tuning of first bucket (#62748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62748

Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.
ghstack-source-id: 135074696

Test Plan: CI

Reviewed By: SciPioneer, wanchaol

Differential Revision: D30110041

fbshipit-source-id: 96f76bec012de129d1645e7f50e266d4b255ec66
2021-08-05 16:35:07 -07:00
Sean Lawlor
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
Andrew Gu
62a90c227f Make _Join, _Joinable, _JoinHook public (#62605)
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
2021-08-03 12:20:11 -07:00
Rohan Varma
4d5607bb25 [Reland][DDP] log bucket sizes (#62625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625

reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race.

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D30058217

fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091
2021-08-03 10:55:46 -07:00
Andrew Gu
51f687fd4b Add overlap with DDP to ZeRO (two approaches) (#62157)
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.

Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157

Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`

These were tested on the AI AWS cluster.

An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.

Both approaches have been verified using an internal accuracy benchmark.

Reviewed By: mrshenli

Differential Revision: D29971046

Pulled By: andwgu

fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
2021-08-02 08:33:34 -07:00
Yi Wang
32b37ba246 [DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457

Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.

Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type

Reviewed By: rohan-varma

Differential Revision: D30007390

fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
2021-07-30 20:51:34 -07:00
Yi Wang
72295da6c3 Reformat (#62456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456

as title
ghstack-source-id: 134771417

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D30006493

fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef
2021-07-30 20:50:19 -07:00
Eli Uriegas
bd9f35313a Revert D29922299: [DDP] log bucket sizes
Test Plan: revert-hammer

Differential Revision:
D29922299 (5429f68f00)

Original commit changeset: 538b331c96e7

fbshipit-source-id: 3595fe04e8dea38bc9d05e8c70f2dcd2ad496ced
2021-07-30 20:27:36 -07:00
Rohan Varma
5429f68f00 [DDP] log bucket sizes (#62232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232

Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf.

Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after).
ghstack-source-id: 134663867

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29922299

fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761
2021-07-30 18:07:04 -07:00
Rohan Varma
1f2b96e7c4 [DDP] Make compute_bucket_assignment_by_size return per bucket sizes (#62231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231

`compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket.

This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact.
ghstack-source-id: 134480575

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29919056

fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19
2021-07-28 20:21:01 -07:00
Rohan Varma
10c6811a6b [DDP] Run test_ddp_new_tensor_in_fwd with static graph (#61992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992

This test previously was not enabled for static graph but to ensure
this feature is supported with DDPSink, enable it for static graph which
currently passes outputs to DDPSink.
ghstack-source-id: 134471406

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29830887

fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc
2021-07-28 09:49:12 -07:00
Andrew Gu
3e3acf8a9a Minor documentation fixes (#61785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29746648

Pulled By: andwgu

fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6
2021-07-19 09:01:29 -07:00
Andrew Gu
813b887dad Fix indent (#61784)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61784

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29746647

Pulled By: andwgu

fbshipit-source-id: f42d3a0864a8291941d695a0cf575a5737cbb35c
2021-07-19 09:00:25 -07:00
Rohan Varma
f1114364ad [DDP] Enhance comm hook docs (#61677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61677

Specify return type more clearly, 2) Misc fixes
ghstack-source-id: 133657895

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29701384

fbshipit-source-id: 7f77b99065bd2977153f397745e07b75bbdd7a94
2021-07-16 08:35:49 -07:00
Rohan Varma
7177509380 Revert [DDP] Support not all outputs used in loss calculation (#61497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61497

Reverts [DDP] Support not all outputs used in loss calculation
ghstack-source-id: 133589153

Test Plan: CI, ping authors to run their workflow on this diff

Reviewed By: zhaojuanmao

Differential Revision: D29642892

fbshipit-source-id: 81a15b9ab3329602f34d3758bb0799005a053d4f
2021-07-15 10:28:14 -07:00
Rohan Varma
25f9c35dd7 Revert [DDP] Support for multiple backwards (#61401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61401

Reverts https://github.com/pytorch/pytorch/pull/59359, which is causing a few internal issues in DDP training. We will evaluate the internal use cases and reland it after reconsidering the design.

Also moves `prepare_for_backward` back into forward pass instead of DDP Sink for `find_unused_parameters`. This ensures that hooks will always fire in the backwards pass, which is behavior that internal training workloads rely on. Calling `prepare_for_backward` in DDPSink autograd function is not the best solution since other autograd threads may have been executing which can cause races.

ghstack-source-id: 133589152

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29608948

fbshipit-source-id: f060f41cd103573ddff8da50cdbb6c56768dab46
2021-07-15 10:28:13 -07:00
Andrew Gu
57feb35474 Refactor non-joined process computation (#61555)
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.

**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)

The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.

This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.

Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.

**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
    ...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
    ...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).

**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555

Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```

Reviewed By: zou3519

Differential Revision: D29690359

Pulled By: andwgu

fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
2021-07-14 08:20:40 -07:00
Andrew Gu
179249084b Refactor DDP join() API, adding hooks (#60757)
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.

**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)

There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.

The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.

The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.

**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in  the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757

Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```

Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.

Reviewed By: iramazanli, mrshenli

Differential Revision: D29624636

Pulled By: andwgu

fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
2021-07-09 08:29:20 -07:00
Yi Wang
4beb5f9ad6 [DDP Comm Hook] Fix some comments (#61376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61376

After SPMD is retired, the API of `get_tensors` becomes `get_tensor`. Fix some comments that refer to the obsolete API.

The `allreduce` hook example does not do division inside, which actually is incorrect.
ghstack-source-id: 133174272

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D29596857

fbshipit-source-id: 2046b185225cd6d1d104907b5f9b4009b6e87c99
2021-07-08 12:30:24 -07:00
Rohan Varma
43fb39c3eb [DDP] Make uneven inputs work with comm. hook (#61020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61020

Makes uneven input support with `join` context manager work with
custom communication hooks. This will ensure that the two features can work
well together. Added relevant unittests to test allreduce and powerSGD hooks.

Instead of calling `allreduce`, the join manager now calls into `_run_reduction_hook` which will automatically run whatever hook is installed.
ghstack-source-id: 132950108

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29480028

fbshipit-source-id: c91dc467a62c5f1e0ec702a2944ae3deb10f93f4
2021-07-02 18:48:21 -07:00
Rohan Varma
94b730681f [DDP] Refactor uneven inputs to take GradBucket (#61019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61019

Changes uneven input logic of running allreduce to using `GradBucket` structure. This is to enable support for comm. hook with join in the next diff.
ghstack-source-id: 132950107

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D29480027

fbshipit-source-id: 7c42c53653052f71b86a75e14a5fc7ae656433f7
2021-07-02 18:47:23 -07:00
Rohan Varma
b21df03f3b [DDP] Remove SPMD from get_bucket_tensors (#61017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61017

Removes SPMD nested vector logic from this codepath. This is mostly in preparation for the next diffs in this stack which enable support for join with comm. hook.
ghstack-source-id: 132924223

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29477360

fbshipit-source-id: f8132a94b1abfe28586aa78ac47e13a7ce6bb137
2021-07-01 20:40:53 -07:00
Rohan Varma
60509f8921 Update DDP documentation to mention outputs not used in loss is supported (#60275)
Summary:
We recently landed a change to ensure that when running under ``find_unused_parameters=True``, not all module outputs have to be used in loss computation and DDP will work as expected. Mention this update in the documentation and add some additional clarification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60275

Reviewed By: SciPioneer

Differential Revision: D29502609

Pulled By: rohan-varma

fbshipit-source-id: ddb3129cff9492018e61813413b30711af212309
2021-07-01 11:56:53 -07:00
Rohan Varma
12b63f4046 [DDP] Fix case where new tensors with no grad_fn are returned in DDP forward. (#60882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60882

Fixes https://github.com/pytorch/pytorch/issues/60733, which
identified an issue with a previous PR that resulted in DDP no longer
supporting cases where newly created tensors are returned that don't have a
grad_fn. The result of this is the grad_fn is set to that of the `DDPSink`
custom backward which results in errors during the backwards pass.

This PR fixes the issue by ensuring we don't touch the `grad_fn` of the tensors
if it is `None`. Added relevant tests as well.
ghstack-source-id: 132632515

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29423822

fbshipit-source-id: a9e01046c7be50aa43ffb955f6e0f48fef4bc881
2021-06-29 12:50:48 -07:00
Rohan Varma
d5df274ea5 [DDP] Support for multiple backwards (#59359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359

Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`.

ghstack-source-id: 131774159

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28855226

fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c
2021-06-18 09:23:57 -07:00
Rohan Varma
acd914f039 Fix Pipe + DDP for unused parameters, static graph (#60118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081

The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
ghstack-source-id: 131688187

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29167283

fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8
2021-06-17 13:41:51 -07:00
Mike Ruberry
5686fe5817 Revert D29154971: Training resnext with msuru_suru_union and ig_msuru_suru_union datasets
Test Plan: revert-hammer

Differential Revision:
D29154971 (9f68f93aca)

Original commit changeset: d534d830020f

fbshipit-source-id: a3d16acc8e6b66a6010b501c28dbe295f573bc86
2021-06-16 15:33:14 -07:00
Zhuangzhuang Zhang
9f68f93aca Training resnext with msuru_suru_union and ig_msuru_suru_union datasets
Summary: We updated the training scripts and re-trained the Resnext model with msuru_suru_union and ig_msuru_suru_union datasets

Test Plan:
Main command line to run:
*./deeplearning/projects/classy_vision/fb/projects/msuru_suru/scripts/train_cluster.sh*

Config we used is *msuru_suru_config.json*, which is "Normal ResNeXt101 with finetunable head".

Experiments:
- msuru_suru_union f279939874
    - Train/test split
        - msuru_suru_union_dataset_train_w_shard: 143,632,674 rows
        - msuru_suru_union_dataset_test_w_shard: 1,831,236  rows
    - Results
       {F625232741}
       {F625232819}
- ig_msuru_suru_union f279964200
    - Train/test split
        - ig_msuru_suru_union_dataset_train_w_shard: 241,884,760 rows
        - ig_msuru_suru_union_dataset_test_w_shard: 3,477,181 rows
    - Results
{F625234126}
{F625234457}

Differential Revision: D29154971

fbshipit-source-id: d534d830020f4f8e596bb6b941966eb84a1e8adb
2021-06-16 11:22:50 -07:00
Rohan Varma
eb55b086b7 [DDP] Log some python-side errors (#59284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284

Logs a few python-side errors to DDP logging.

TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28820290

fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
2021-06-02 19:49:26 -07:00
Rohan Varma
79aeca0b00 [DDP] Log when errors happen (#59281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28652717

fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
2021-06-02 19:48:26 -07:00
Rohan Varma
2a78e896a0 [DDP] use work.result() in _check_global_requires_backward_grad_sync (#59065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59065

Cleaner to use work.result() instead of sending back the tensor from
this function.
ghstack-source-id: 130338813

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28551203

fbshipit-source-id: d871fed78be91f0647687ea9d6fc86e576dc53a6
2021-06-02 17:19:07 -07:00
Andrew Gu
5a42a97c49 Add NCCL_ASYNC_ERROR_HANDLING as an environment variable (#59109)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57878.

This adds `NCCL_ASYNC_ERROR_HANDLING` as a DDP relevant environment variable and includes a check for that variable in the test `test_dump_DDP_relevant_env_vars()`. Notably, the modified test now checks for the new variable but does not check for any of the other previously-existing relevant environment variables that were not already tested for (e.g. `NCCL_BLOCKING_WAIT`).

The change was tested via the following on an AI AWS cluster:
`WORLD_SIZE=2 BACKEND=nccl gpurun pytest test/distributed/test_distributed_spawn.py -k test_dump_DDP_relevant_env_vars -vs`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59109

Reviewed By: H-Huang, SciPioneer

Differential Revision: D28761148

Pulled By: andwgu

fbshipit-source-id: 7be4820e61a670b001408d0dd273f65029b1d2fe
2021-06-01 14:02:41 -07:00
Sureyya Emre Kurt
3d2b55553b Retiring _module_copies field in DDP reducer. (#59094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59094

Commented out _module_copies fields and changed for loops accordingly

Test Plan: Test cases mentioned in T91292908 passed succesfully

Reviewed By: SciPioneer

Differential Revision: D28736135

fbshipit-source-id: 1857102f0c57a734026f3025e9653d8fad57d0b6
2021-05-27 15:09:14 -07:00
Rohan Varma
1d67c6d639 [DDP] Remove train call to module copies (#58595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58595

No longer needed since this list is always of size 1.
ghstack-source-id: 129498229

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28548426

fbshipit-source-id: 7d6dba92fff685ec7f52ba7a3d350e36405e2578
2021-05-20 22:34:20 -07:00
Rohan Varma
faa7d3793d [DDP] Support not all outputs used in loss calculation (#57081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081

Changes in this diff:

Enable passthrough autograd function when find_unused_parameters=True.
With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True.
Enhance process of unused parameter checking to account for outputs not being used in loss.
The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched.

Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after.

Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance.
ghstack-source-id: 129425139

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28048628

fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61
2021-05-20 08:34:33 -07:00
Alexander Golynski
bc30c3165c Update docs for get_future support (#58107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58107

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28387374

Pulled By: agolynski

fbshipit-source-id: 70052afbb0b07ba341ea55f7ec30f7d9759b7bd4
2021-05-12 18:29:28 -07:00
Yanli Zhao
166a8df65f [reland] make ddp logging api to be private (#58089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58089

make ddp logging api to be private
ghstack-source-id: 128796419

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D28365412

fbshipit-source-id: 374c01d443ffb47a3706f59e296d6e47eb5f4c85
2021-05-12 16:45:13 -07:00
Rohan Varma
a0ac80ec76 [DDP] Don't find tensors if static graph (#58105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58105

When find_unused_parameters=True but static_graph is also set, static graph handles unused parameter accounting, so this code path is not needed
ghstack-source-id: 128736289

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28371954

fbshipit-source-id: 0b42a9c0fd2bba26a0de288436e9c7139e292578
2021-05-12 11:40:18 -07:00
Rohan Varma
c52700dbcd [wip] enhance DDPSink to work for general outputs (#57073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57073

Enhances use of DDPSink to work for all output types DDP supports as per https://github.com/pytorch/pytorch/issues/55876.

TODO: Add additional testing for tuple, list, dict return types
ghstack-source-id: 128726768

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27756985

fbshipit-source-id: 2e0408649fb2d6a46d6c33155a24c4c1723dd799
2021-05-12 09:45:10 -07:00
Kimish Patel
ad4cd6ef89 Revert D28338485: make ddp logging api to be private
Test Plan: revert-hammer

Differential Revision:
D28338485 (ac44569b0d)

Original commit changeset: bd2ae7c78904

fbshipit-source-id: d383f42a2051457147dec42ea273ed4fa82ffa1f
2021-05-11 12:12:51 -07:00
Yanli Zhao
ac44569b0d make ddp logging api to be private (#57999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57999

make ddp logging api to be private
ghstack-source-id: 128607185

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D28338485

fbshipit-source-id: bd2ae7c78904e93eed88be91876f5a832b5b7886
2021-05-11 10:37:03 -07:00
Yanli Zhao
ea421fb249 enable static graph training in DDP (#55248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248

This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464

Test Plan: unit tests. adding more tests

Reviewed By: rohan-varma

Differential Revision: D27539964

fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
2021-05-11 10:23:25 -07:00
Rohan Varma
fe3c63d9d3 [DDP] fix param to name mapping (#57771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57771

This mapping didn't work properly when certain parameters didn't
require grad. Fixed that and added a test.
ghstack-source-id: 128527537

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28265636

fbshipit-source-id: 7b342ce012b2b7e33058b4c619ffb98992ed05b7
2021-05-10 11:47:46 -07:00
Rohan Varma
d115e81a32 Fix document around DDP uneven inputs (#57448)
Summary:
Typo fix and additional clarifications on the API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57448

Reviewed By: SciPioneer

Differential Revision: D28153264

Pulled By: rohan-varma

fbshipit-source-id: 9bd35d918299ad7e080785d755f97b966f826615
2021-05-10 09:33:59 -07:00
Rohan Varma
57f72b8433 [DDP] Uneven inputs: option to throw early (#56755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56755

Rehash of https://github.com/pytorch/pytorch/pull/47488

Adds a flag to ddp join() context manager that enables throwing a
StopIteration across all ranks when this flag is specified.

To do this, we implement the design in #47250. When running with this flag, we schedule an additional allreduce in the case that a joined rank needs to throw a StopIteration. In non-joined ranks forward pass, we match this allreduce and if at least one rank tells us to throw, we raise a StopIteration.

Tested by modifying existing tests, as well as adding additional tests validating that this works with SyncBatchNorm models and a model with custom collectives in the forward pass.

Currently running perf benchmarks, will post when those are available, but we expect a small (~2%) perf reduction when enabling this feature due to the blocking allreduce. Hence we will only recommend it for models with collective comm.
ghstack-source-id: 127883115

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27958369

fbshipit-source-id: c26f7d315d95f17bbdc28b4a0561916fcbafb7ca
2021-05-02 15:41:50 -07:00
Yanli Zhao
3f81912885 static graph api skeleton (#54995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995

provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27444965

fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
2021-04-30 11:07:26 -07:00
Yanli Zhao
2c8ea63cbb add a test for grad view with torch amp (#56730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56730

add a test to verify DDP with torch map will result in the same results when using grad_as_bucket_view=true and false.

torch.amp scale factor does not have dependencies on old gradients, thus it is not affected by grad_as_bucket_view=true or false, see
how torch.amp is implemeted here https://github.com/pytorch/pytorch/pull/33366/files.

This diff verified ddp can work as expected with amp.GradScaler and amp.autocast when when using grad_as_bucket_view=true and false.
ghstack-source-id: 127526358

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27950132

fbshipit-source-id: 8ed26935fdcb4514fccf01bb510e31bf6aedac69
2021-04-29 10:06:07 -07:00
Yanli Zhao
1e77ba36db change ddpLoggingData struct to map or dict (#56641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641

currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.

With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.

Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923723

fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
2021-04-28 06:43:25 -07:00
Yi Wang
07653b7fe0 [SPMD] Remove ddp_gpu_size field from SyncBatchNorm (#55946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55946

As `ddp_gpu_size` field of `SyncBatchNorm` will always be 1 for GPU modules, remove this field and the relevant code.
ghstack-source-id: 126883498

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27746021

fbshipit-source-id: b4518c07e6f0c6943fbd7a7548500a7d4337126c
2021-04-19 21:41:29 -07:00
Mike Guo
5b4c3a9da1 record Torch DP and DDP modules forward (#55578)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55578

Reviewed By: gdankel

Differential Revision: D27862392

Pulled By: ilia-cher

fbshipit-source-id: 18545d23e35a97c8f760707fecb696a24d47dc0a
2021-04-19 17:52:59 -07:00
Michael Carilli
a24b17248f Short circuits DistributedDataParallel._recursive_to's copy and stream syncs if input is already on the right device (#55624)
Summary:
^

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55624

Reviewed By: pbelevich, agolynski

Differential Revision: D27836170

Pulled By: rohan-varma

fbshipit-source-id: 954bf336d70f9e80c045a6a96c1d8843c7f1cf2c
2021-04-18 14:08:08 -07:00
Rohan Varma
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
Yi Wang
d398a705c6 Clang-format batchnorm.py and distributed.py (#55971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55971

Per title
ghstack-source-id: 126454339

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27752315

fbshipit-source-id: 64ca5dea7b2689037594a6bd9a75641a9bb817c1
2021-04-13 18:40:23 -07:00
Yi Wang
4b09756d26 [SPMD] Move a comment (#55877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55877

Address a comment in: 10bc1dae40 (r610930244)
ghstack-source-id: 126369525

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27729567

fbshipit-source-id: 5509ebfba2b741cd3532c69044227e5af0fb54fc
2021-04-13 05:53:31 -07:00
Yi Wang
3e9cbe5ef7 [SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353

Remove all the code branches that will only be executed when `device_ids > 1`.

Some helper functions are also removed:
1.  `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`

The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27552201

fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
2021-04-09 17:27:56 -07:00
Yi Wang
b986a76d91 Clang-format distributed.py (#55254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55254

ghstack-source-id: 125680320

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27542846

fbshipit-source-id: 700c3e59a9df98233fdb27054b472f5cb33eb604
2021-04-05 16:48:22 -07:00
Yi Wang
e589247a19 [SPMD] Change assertions to raising value errors in distributed.py (#54825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54825

These assertions are tested in test_c10d.py

Context: https://github.com/pytorch/pytorch/pull/54454#discussion_r602657818
ghstack-source-id: 125602462

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config

Reviewed By: rohan-varma

Differential Revision: D27381649

fbshipit-source-id: 9b994e9c2acf796770c2f2af2cebdd5561834d14
2021-04-02 15:13:45 -07:00
Yi Wang
6a40339920 [SPMD] Error out SPMD mode (#54454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54454

According to the pitch in https://github.com/pytorch/pytorch/issues/47012

1. Let DDP error out if `device_ids` contains multiple devices.
2. If device_ids is not specified, DDP will use the provided model (module argument in DDP constructor) as-is, regardless if the model is on one GPU or multiple GPUs or on CPU.
3. Remove the assertion that prevents SPMD in DDP `join()` method, because now SPMD is already forbidden by the constructor. Also remove the relevant unit test `test_ddp_uneven_inputs_replicated_error`.

#Closes: https://github.com/pytorch/pytorch/issues/47012

ghstack-source-id: 125644392

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_cuda
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_rnn

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_ids_not_allowed
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_single_device_module_device_ids_None
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_module_device_ids_None

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config

waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D27226092

fbshipit-source-id: 3ee1e4bc46e5e362fc82cf7a24b2fafb34fcf1b9
2021-04-02 15:11:59 -07:00
Rohan Varma
3575e71be8 [DDP Logging] Log use of uneven inputs API (#54919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919

Log the use of uneven inputs API for better tracking and use case
detection.
ghstack-source-id: 125446499

Test Plan: CI, added ut

Reviewed By: zhaojuanmao, SciPioneer

Differential Revision: D27410764

fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e
2021-04-01 16:22:32 -07:00
Rohan Varma
8c13dde458 [DDP] Remove redundant pass statement (#54219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54219

There is no need for this ``pass``.
ghstack-source-id: 125124311

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27105234

fbshipit-source-id: 95496fa785fdc66a6c3c8ceaa14af565588325df
2021-03-29 14:15:39 -07:00
Yi Wang
6e7a3c1fdd Clang-format distributed.py (#54402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54402

ghstack-source-id: 124497872

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27225942

fbshipit-source-id: 277f466554fbc034fb76de161bf4b3b7c431daf7
2021-03-22 11:39:58 -07:00
Shen Li
ef9ee46756 Avoid modifying rebuild buckets state in no_grad context (#54159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54159

See https://github.com/pytorch/pytorch/issues/54059 for discussion.

In short, users might want to run evaluation on a single rank
in `torch.no_grad()` mode. When this happens, we need to make
sure that we skip all rebuild bucket logics, as the forward only
runs on one rank and not all peers can sure the bucket configuration
sync communication.

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D27119666

Pulled By: mrshenli

fbshipit-source-id: 4b2f8cce937cdd893e89d8d10c9267d255ba52ea
2021-03-17 19:50:29 -07:00
Rohan Varma
e09e97ebf9 [DDP] add _distributed_rank helper function (#53795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53795

There are 4 calls in ddp implementation to dist.get_rank(), move these
to a helper property to ensure that users don't actually call `dist.get_rank()`
instead of `dist.get_rank(self.process_group)`.

Keeping API private for now because not sure if there is a user need to call `model.distributed_rank`, but can make it public if we think it's a useful api.
ghstack-source-id: 123640713

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D26972368

fbshipit-source-id: a5f1cac243bca5c6f90a44f74d39cfffcc2b9a5a
2021-03-11 21:20:05 -08:00
Rohan Varma
0c2fe02ec1 [DDP] Fix wrong call to dist.get_rank() (#53793)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53793

This call should pass in the process group so it works appropriately
for subgroups instead of whole world being passed into DDP.

Aside: This wasn't caught by tests since we don't have good testing around
passing subgroups into DDP, I believe nearly all tests use the entire world.
Should we add better testing for subgroups which may potentially bring up more
subtle bugs?
ghstack-source-id: 123640712

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D26972367

fbshipit-source-id: 8330bd51e2ad66841e4c12e96b67d3e78581ec74
2021-03-11 21:18:31 -08:00
Yi Wang
d726ce6668 Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224

Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata.

One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239.

#Closes: https://github.com/pytorch/pytorch/issues/41048/
ghstack-source-id: 123722976

Test Plan:
buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint

Reviewed By: rohan-varma, mrshenli

Differential Revision: D26798495

fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8
2021-03-11 18:43:33 -08:00
Yanli Zhao
a08fc1a7fc allow users to set sample rate and add per iteration latency breakdowns (#53145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145

add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g.
if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged)
ghstack-source-id: 123443369

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26763957

fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e
2021-03-10 11:35:18 -08:00
Michael Carilli
e787872a47 [RELAND] Deduplicate shared params before constructing Reducer in DDP (#53279)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/51929 seemed to trigger failures in `pytorch_linux_xenial_py3_clang5_asan_test2`. Resubmitting to figure out why, and hopefully reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53279

Reviewed By: mrshenli

Differential Revision: D26916701

Pulled By: zhaojuanmao

fbshipit-source-id: 75c74c8ad8ad24154eb59eddb2b222da0a09897e
2021-03-10 07:56:20 -08:00
Rohan Varma
14fa47631b [DDP Logging] Log comm. hook in ddp logging (#52966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966

Logs registerd comm hook if there is one, else logs
"builtin_allreduce"
ghstack-source-id: 123174803

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26709388

fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490
2021-03-05 11:23:26 -08:00
Rohan Varma
68134374cb Refactor/fix DDP model check during init (#52887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887

This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.

There were a few things that could be improved with the way we verify model across processes in DDP initialization:

1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP

In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877

Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks

Reviewed By: zhaojuanmao

Differential Revision: D26565290

fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
2021-03-05 11:21:45 -08:00
Mike Ruberry
30a8a13a7d Revert D26625807: [pytorch][PR] Deduplicate shared params before constructing Reducer in DDP
Test Plan: revert-hammer

Differential Revision:
D26625807 (5c15a5bb46)

Original commit changeset: f5f5959fef90

fbshipit-source-id: c875cc86b8fd21d9d64f934559f8e3126ed1d23d
2021-03-03 20:05:47 -08:00
Yi Wang
68b62493b8 [Gradient Compression] Make GradBucket class public (#53099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099

Publish GradBucket APIs for publishing DDP communication hooks.

s/_GradBucket/GradBucket
ghstack-source-id: 123030921

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26721121

fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a
2021-03-03 19:22:15 -08:00
Michael Carilli
5c15a5bb46 Deduplicate shared params before constructing Reducer in DDP (#51929)
Summary:
Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens.

We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further).

This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929

Reviewed By: zou3519

Differential Revision: D26625807

Pulled By: zhaojuanmao

fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3
2021-03-03 10:13:24 -08:00
Shen Li
d697090260 Add a note in DDP doc to point to ZeroRedundancyOptimizer (#53113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53113

Test Plan: Imported from OSS

Reviewed By: blefaudeux

Differential Revision: D26752339

Pulled By: mrshenli

fbshipit-source-id: 7a082f1007bc550eabb82b559d020bbe717fa497
2021-03-02 14:18:06 -08:00
Yanli Zhao
d0795ab358 log newly added construction and runtime stats at randomly selected iterations (#51394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51394

log newly added construction and runtime stats at randomly selected iterations
ghstack-source-id: 121934040

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26161885

fbshipit-source-id: add6e02c1a03e6f74f08b9a9aecf90fa81631d60
2021-02-19 00:15:04 -08:00
Yanli Zhao
c75fa39b6c add stats that can only be collected at runtime (#51386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
2021-02-19 00:13:11 -08:00
Rohan Varma
6dabe0b291 [Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031

Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26356192

fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
2021-02-17 12:21:37 -08:00
Rohan Varma
a86027ded3 Use side-stream in CPU to GPU copies in DDP (#50180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50180

Resolves the regression in
https://github.com/pytorch/pytorch/issues/49819 by adding copy over background
stream similar to scatter. For internal use cases, this is gated with an env var that maintains the previous behavior when it is off.

Test Plan: CI

Reviewed By: mrshenli, ngimel

Differential Revision: D25818170

fbshipit-source-id: e50c76c035504b2a44e2be084701cee45c90df75
2021-02-13 00:57:32 -08:00
Yanli Zhao
18e0a61388 add more logging fields that can be set in construction time (#51260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
2021-02-09 21:58:58 -08:00
Yi Wang
4b3c99ce4a [Resubmission] Add a documentation page for DDP communication hooks (#51773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773

Resubmission of #51715.

Minor changes:
1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py.

2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst.

Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels .

It seems that `python_doc_test` was flaky. The previous error message was not informative:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch.

Rebasing to a new master branch seems to get this fixed:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps

Screenshot:

{F369899792}
ghstack-source-id: 121199613

Test Plan: View locally

Reviewed By: mingzhe09088

Differential Revision: D26272687

fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db
2021-02-06 21:22:04 -08:00
Natalia Gimelshein
d3023d86ba Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks
Test Plan: revert-hammer

Differential Revision:
D26249330 (e62aabac43)

Original commit changeset: ab973390ddb7

fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd
2021-02-04 22:38:06 -08:00
Yi Wang
e62aabac43 [Gradient Compression] Add a documentation page for DDP communication hooks (#51715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715

Add a documentation page for DDP communication hooks.

Screenshot:

{F369781049}

Test Plan: View locally

Reviewed By: pritamdamania87

Differential Revision: D26249330

fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e
2021-02-04 18:53:53 -08:00
Yanli Zhao
250c71121b Create a DDPLoggingData and expose it to python interface (#50622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
2021-01-25 15:23:07 -08:00
Pritam Damania
f39f258dfd Ensure DDP + Pipe works with find_unused_parameters. (#49908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49908

As described in https://github.com/pytorch/pytorch/issues/49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119573413

Test Plan:
1) unit tests added.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25719922

fbshipit-source-id: 948bcc758d96f6b3c591182f1ec631830db1b15c
2021-01-11 16:52:37 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
Rohan Varma
c9f6e70c09 Refactor DDP uneven inputs control flags (#47394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394

This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24739509

fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
2020-11-11 16:51:56 -08:00
Zhicheng Chen
3dd266304c Fix inaccurate note in DistributedDataParallel (#47156)
Summary:
Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ).

Here are some toy code to illustrate my point:

* non-DistributedDataParallel version

```python
import torch

if __name__ == "__main__":
    torch.manual_seed(0)
    inp = torch.randn(1,16)
    inp = torch.cat([inp, inp], dim=0)
    model = torch.nn.Linear(16, 2)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    loss = loss_func(model(inp), torch.tensor([0, 0]))
    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
```

* DistributedDataParallel version

```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    torch.manual_seed(0)
    x = torch.randn(1,16)

    model = torch.nn.Linear(16, 2)
    model = torch.nn.parallel.DistributedDataParallel(model)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.tensor([0])
    loss = loss_func(y, label)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)

def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()
```

Both of these two pieces of code have the same output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156

Reviewed By: mruberry

Differential Revision: D24675199

Pulled By: mrshenli

fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
2020-11-09 17:42:57 -08:00
Yi Wang
fccfe7bd1a [Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158

1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation.
2. Typo fix.
3. Reformat default_hooks.py.
4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.)

The new style can be used for testing any new comm hook like PowerSGD easily.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

ghstack-source-id: 116012600

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D24669639

fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939
2020-11-06 00:28:09 -08:00
Yi Wang
f91fcefc81 [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115783237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

//arvr/projects/eye_tracking/Masquerade:python_test

USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install

Reviewed By: mrshenli

Differential Revision: D24700959

fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
2020-11-03 18:33:50 -08:00
Yi Wang
b1b77148ac Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234

Revert the diff because of https://github.com/pytorch/pytorch/issues/47153

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115720415

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24691866

fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539
2020-11-02 20:51:18 -08:00
Yi Wang
ee0033af9b [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115629230

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

Reviewed By: pritamdamania87

Differential Revision: D24471910

fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a
2020-10-30 23:19:42 -07:00
Rohan Varma
ecdbea77bc Fix DDP documentation (#46861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46861

Noticed that in the DDP documentation:
https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel
there were some examples with `torch.nn.DistributedDataParallel`, fix this to
read `torch.nn.parallel.DistributedDataParallel`.
ghstack-source-id: 115453703

Test Plan: ci

Reviewed By: pritamdamania87, SciPioneer

Differential Revision: D24534486

fbshipit-source-id: 64b92dc8a55136c23313f7926251fe825a2cb7d5
2020-10-29 09:13:47 -07:00
Rohan Varma
7245d2c939 Avoid scatter for single-device case in DDP (#46304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304

In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.

The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677

Test Plan: Added unittest, and CI

Reviewed By: pritamdamania87

Differential Revision: D24296377

fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
2020-10-22 08:29:37 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Emilio Castillo
d38a71d579 torch.nn.modules.LazyModuleMixin and torch.nn.LazyLinear (Shape Inference II) (#44538)
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD

This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.

The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.

As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()
        self.conv = torch.nn.Conv2d(20, 4, 2)
        self.linear = torch.nn.LazyLinear(10)
    def forward(self, x):
        y = self.conv(x).clamp(min=0)
        return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).

TODO:

Add convolutions once the design is OK
Fix docstrings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538

Reviewed By: ngimel

Differential Revision: D24162854

Pulled By: albanD

fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
2020-10-19 13:13:54 -07:00
Rohan Varma
181afd5220 Add an option to DDP to take a list of parameters to ignore upfront. (#44826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826

As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.

To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109

Test Plan: Added unittest

Reviewed By: xw285cornell, mrshenli

Differential Revision: D23740639

fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
2020-09-30 11:52:50 -07:00
Shen Li
c5ade5f698 Fix no_sync docs (#45455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973365

Pulled By: mrshenli

fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb
2020-09-28 20:48:09 -07:00
Shen Li
6967e6295e Fix DDP docs (#45454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973367

Pulled By: mrshenli

fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0
2020-09-28 20:43:22 -07:00
Yanli Zhao
c6500bcf14 [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing  to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 112845787

Test Plan:
1. When grad_is_view=false:
a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli
b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588
https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli
d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli
e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli

2. When grad_is_view=true:
a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details
b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli
d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli
e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli

Reviewed By: mrshenli

Differential Revision: D23588186

fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
2020-09-24 20:54:51 -07:00
Rohan Varma
e57a08119b Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238

Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23719270

fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
2020-09-24 09:50:44 -07:00
Bugra Akyildiz
1b059f2c6d Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914)
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.

Fixes https://github.com/pytorch/pytorch/issues/43960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914

Reviewed By: rohan-varma

Differential Revision: D23798888

Pulled By: bugra

fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
2020-09-22 06:28:47 -07:00
Yanli Zhao
e14b2080be [reland] move rebuild buckets from end of first iteration to beginning of second iteration (#44798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798

[test all]

Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well.

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112279261
ghstack-source-id: 112279261

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D23735185

fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e
2020-09-17 17:10:21 -07:00
Ailing Zhang
fb085d90e3 Revert D23583017: move rebuild buckets from end of first iteration to beginning of second iteration
Test Plan: revert-hammer

Differential Revision:
D23583017 (f5d231d593)

Original commit changeset: ef67f79437a8

fbshipit-source-id: fd914b7565aba6a5574a32b31403525abb80ff07
2020-09-15 15:10:52 -07:00
Yanli Zhao
f5d231d593 move rebuild buckets from end of first iteration to beginning of second iteration (#44326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583017

fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
2020-09-15 09:51:33 -07:00
Yi Wang
ace81b6794 Remove an extra empty line in the warning comments. (#44622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622

Remove an extra empty line in the warning comments.Remove an extra empty line.

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23674070

fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
2020-09-14 11:15:35 -07:00