Commit Graph

250 Commits

Author SHA1 Message Date
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
anjali411
3bcc19b29a Add __all__ to various submodules in torch.fx, distributions, distributed, package (#80367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80367
Approved by: https://github.com/albanD
2022-06-27 21:27:30 +00:00
Rohan Varma
e7cb44b6c4 Guard distributed imports (#77727)
Move distributed import after dist.is_avail check to fix builds with USE_DISTRIBUTED=0. Although, note that this issue is not caught by any CI at the moment.

Closes https://github.com/pytorch/pytorch/issues/77704
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77727
Approved by: https://github.com/malfet
2022-05-18 11:27:52 +00:00
Rohan Varma
6f954d7bbb FSDP parameter sync
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77492

Approved by: https://github.com/zhaojuanmao
2022-05-17 19:58:49 +00:00
Rohan Varma
bbb1f106c7 Separate input moving to utils file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77187

Test fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77235

Lint fix

Approved by: https://github.com/awgu
2022-05-11 21:55:38 +00:00
Rohan Varma
ffb0946504 Generalize param verification and broadcast
New PR for https://github.com/pytorch/pytorch/pull/75970 to be compatible with GHF.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76374
Approved by: https://github.com/awgu
2022-04-26 22:25:53 +00:00
pritam
b26df43f15 Fix bug where __getstate__ of DDP looks for self._replicated_tensor_module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76349

When we are not using ReplicatedTensor in DDP and try to save a DDP
module it will error out since it tries to delete the _replicated_tensor_module
attribute.

Fixing this by checking if this mode is enabled before triggering the delete.

Differential Revision: [D35875167](https://our.internmc.facebook.com/intern/diff/D35875167/)

Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao
2022-04-26 02:49:49 +00:00
pritam
3a38f175dd Convert DDP parameters to ReplicatedTensor during forward pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753

As per the design in https://github.com/pytorch/pytorch/issues/72138,
convert DDP parameters to ReplicatedTensor during its forward pass. Concretely,
this is done as follows:

1) Create a separate `_replicated_tensor_module` which is a copy of self.module
without creating copies of the Tensors themselves.
2) Use `_replicated_tensor_module` instead of `self.module` during the forward
pass.
3) Have a context manager `_ddp_replicated_tensor` to enable this, since
certain edge cases can fail where self.module is changed out of band resulting
in discrepancy between self.module and `_replicated_tensor_module`.

Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/)

Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
2022-04-18 03:27:23 +00:00
Junjie Wang (PyTorch)
0a6ac31797 [PT-D][DDP][BE] Add unit tests for Forward and Backward Hook (#74063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74063

Address the issue https://github.com/pytorch/pytorch/issues/66229 as part of BE effort.

Basically:
1. We remove the stale comment which confuses users.
2. Add more unit tests to test the forward/backward hook working for DDP.
ghstack-source-id: 151463380

Test Plan: CI

Reviewed By: rohan-varma

Differential Revision: D34800830

fbshipit-source-id: 21133209323b2b5eda0dd6472f6309d4fb779b97
(cherry picked from commit b9b165c8305572128395daffafc13fcac8b85e29)
2022-03-16 23:18:28 +00:00
Shihao Xu
bcd0843bec [torch.distributed][DDP] Disable DDP bucketing for the first iteration (#72843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843

# [Debug Story] Training Hanging and DDP Bucketing

**What are the characteristics of the hanging training instance?**

The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution.

The model config difference to trigger this hanging issue is turning on position weighted embedding tables.

A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli).

**What was the initial manifested error?**

The training was stuck in the first iteration.

**What are useful debugging tools this time?**

After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure.

After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all.

With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks.

**What is DDP bucketing?**

Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks.

Say we have 4 tensor ops. A, B, C, D.

In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready.

The time sequence would be,

* D.grad
* C.grad
* B.grad
* A.grad
* All reduce on [D.grad, C.grad, B.grad, A.grad].

But that would be a huge waste of communication channel bandwidth.

With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes,

* D.grad
* C.grad
* All reduce on [D.grad, C.grad].
* B.grad
* A.grad
* All reduce on [B.grad, A.grad].

With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance.

**What exactly went wrong in this case?**

1. The bucketing doesn’t honor backward graph execution order.
2. There are other collectives comm ops in backward graph.
3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world.

Using the above example again, we have 4 tensor ops. A, B, C, D.

Say we have 2 trainers,

B is the feature processor module.

B only runs on trainer 0 (both forward and backward), but not on trainer1.

C is the All-to-all (Pooled embeddings distribution).

C sends out all-to-all collective in both its forward and backward pass.

Keep assuming all other ops run on both trainers.

trainer_0 op sequence is,

A, B (feature preproc), C (all-to-all), D | D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad

trainer_1 op sequence is,

A, C (all-to-all), D | D.grad, C.grad (reverse all-to-all), A.grad

Even though the correct bucketing should be (same bucketing for both ranks),

* bucket_0, [D.grad, C.grad]
* bucket_1, [B.grad, A.grad]

but because of 1), they end up like,

* bucket_0, [B.grad, D.grad]
* bucket_1, [C.grad, A.grad]

Plus 2) and 3), the time sequence could like,

(check mark represents the gradient is ready)

(bucket is ready to do synchronization if all its enclosing gradients are ready)

* trainer_0
   * t0,
      * D.grad
      * bucket_0, [B.grad, D.grad ✓]
   * t1,
      * **C.grad all-to-all**
      * C.grad ✓
      * bucket_1, [C.grad ✓, A.grad]
   * t2
      * B.grad
      * bucket_0, [B.grad ✓, D.grad ✓] ✓
   * t3
      * All-reduce for bucket_0
   * t4
      * A.grad
      * bucket_1, [C.grad ✓, A.grad ✓] ✓
* trainer_1
   * t0,
      * D.grad
      * bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.)
   * t1,
      * **All-reduce for bucket_0**
   * t2
      * C.grad all-to-all
      * bucket_1, [C.grad ✓, A.grad]
   * t3
      * A.grad
      * bucket_1, [C.grad ✓, A.grad ✓] ✓

This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce.

**What is the solution for fixing DDP?**

Disable DDP bucketing for the first iteration. D34051938

This is because after the first iteration, buckets will be built again based on real backward graph execution order.

So the slow gradient synchronization only affects the first iteration.

Test Plan:
buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn
BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu

P484179296

buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn
BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes
P484177200

Reviewed By: zhaojuanmao

Differential Revision: D34051938

fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9
(cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)
2022-03-04 18:29:36 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Andrew Gu
59dd84cab6 [Join][BE] Fix typo; remove obsolete method (#72886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886

**Test Plan**
Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34255651

Pulled By: awgu

fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d
(cherry picked from commit 4492f03a3f)
2022-02-16 15:03:09 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Rohan Varma
4feef6c970 Log static graph in constructor if it is set (#72456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456

It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug.
ghstack-source-id: 148840679

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34045204

fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc
(cherry picked from commit 1d622c88f3)
2022-02-11 15:55:09 +00:00
Rohan Varma
37651894f9 [Easy] Small DDP fixes (#72455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72455

- Improve helper function
- Improve/fix some logging
ghstack-source-id: 148840678

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34044865

fbshipit-source-id: d2ae820effaaaecdd7155ffa8d3a1d8ebbd9f39e
(cherry picked from commit 3efbea8f41)
2022-02-11 15:55:09 +00:00
Rohan Varma
1c8fcc44cb [Opt Overlap] Support optimizing partial set of parameters (#71608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608

Per title
ghstack-source-id: 147577178

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33696382

fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
(cherry picked from commit 436b547fb0)
2022-01-26 19:33:49 +00:00
Rohan Varma
d3354602fc [Easy] DDP typo fix (#71607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71607

Per title
ghstack-source-id: 147577177

Test Plan: N/a

Reviewed By: cbalioglu

Differential Revision: D33694038

fbshipit-source-id: 5a5a618f13bc8b91127169efcebb90b5a36474a1
(cherry picked from commit 62f17f116d)
2022-01-26 07:32:04 +00:00
Rohan Varma
10ca760c0a [Opt Overlap] Implement register_fused_optim in DDP (#71606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71606

Per title
ghstack-source-id: 147577172

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33694037

fbshipit-source-id: a148d5ce6031f0cc20f33785cfe2c27d1fc2d682
(cherry picked from commit ace3261e0c)
2022-01-26 07:32:04 +00:00
Yanli Zhao
4b3cf1eaf7 [BE]Clarify how to check memory saving if using gradient_as_bucket_view (#71483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71483

claify that peak memory saving should be checked after first iteration when using gradient_as_bucket_view
ghstack-source-id: 147271113

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D33662424

fbshipit-source-id: f760da38e166ae85234e526ddf1526269ea25d42
(cherry picked from commit a40dda20da)
2022-01-20 19:38:41 +00:00
Yanli Zhao
1c61d8c43f [PT1.11] make static graph to be stable (#71459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459

1. add static_graph feature to DDP constructor;
2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor
3. four cases are covered:
    static_graph = False, _set_static_graph() is called;
    static_graph = False, _set_static_graph() is not called;
    static_graph = True, _set_static_graph() is not called;
    static_graph = True, _set_static_graph() is called;
ghstack-source-id: 147263797

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D33646738

fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855
(cherry picked from commit dc246a1129)
2022-01-20 19:38:41 +00:00
Rohan Varma
fcd1375b2b [DDP][BE][Docs] Clarify checkpoint support (#68827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68827

Add a note about current checkpoint support with DDP. Note that this
does not include the features enabled with _set_static_graph yet, as it is an
undocumented private API. Once we support static graph as beta feature in OSS
we can add to the note here.
ghstack-source-id: 144285041

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D32624957

fbshipit-source-id: e21d156a1c4744b6e2a807b5b5289ed26701886f
2021-11-30 12:37:37 -08:00
Santiago Castro
f776f30780 Keep the sequence or mapping type in default_collate (#68779)
Summary:
`default_collate`, `default_convert`, and `pin_memory` convert sequences into lists. I believe they should keep the original type when possible (e.g., I have a class that inherits from `list`, which comes from a 3rd party library that I can't change, and provides extra functionality).

Note it's easy to do when the type supports an iterable in its creation but it's not always the case (e.g., `range`).

Even though this can be accomplished if using a custom `default_collate`/`default_convert`, 1) this is behavior they should support out-of-the-box IMHO, and 2) `pin_memory` still does it.

cc VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68779

Reviewed By: wenleix

Differential Revision: D32651129

Pulled By: ejguan

fbshipit-source-id: 17c390934bacc0e4ead060469cf15dde815550b4
2021-11-29 13:14:20 -08:00
Yifan Xiong
c7eaec86f0 [NCCL] Patch bfloat16 support (#67843)
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.

This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
  v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
  operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843

Reviewed By: H-Huang

Differential Revision: D32248132

Pulled By: mrshenli

fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
2021-11-09 13:46:13 -08:00
James Reed
80178d6152 [DDP] Fix some issues with code example in DDP docstring (#67883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67883

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D32190946

Pulled By: jamesr66a

fbshipit-source-id: a376324b95cbe833ffa606ecdfc6156432880f70
2021-11-05 17:32:45 -07:00
Rohan Varma
bff64e84cd [DDP] Track models with sync bn (#66680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680

Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31679477

fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
2021-10-18 22:31:52 -07:00
Rohan Varma
38f5144eae Fix https://github.com/pytorch/pytorch/issues/61982 (#66015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015

Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of
tensors in DDPSink. Only applies once for static_graph and generally for unused
params which already has overhead, so perf hit should not be an issue. Will
verify with benchmark.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D31346633

fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b
2021-10-07 18:11:18 -07:00
Rohan Varma
71704349aa [DDP] Allow await of custom buffer reduction in backward (#64515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64515

For performance reasons, we would like to ensure that we can await
user collectives as part of custom buffer reduction in parallel to other work.
As a result, add support to return futures from custom buffer hooks and await
those futures at end of backwards pass.

Also added some docs to clarify how to use these APIs.
ghstack-source-id: 138793803

Test Plan: I

Reviewed By: zhaojuanmao

Differential Revision: D30757761

fbshipit-source-id: e1a2ead9ca850cb345fbee079cf0614e91bece44
2021-09-23 13:02:53 -07:00
Wanchao Liang
2f67579864 [ddp] use named_params and named_buffers explicitly (#65181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65181

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons).
ghstack-source-id: 138701159

Test Plan: wait for ci

Reviewed By: divchenko, rohan-varma

Differential Revision: D31007085

fbshipit-source-id: 4e1c4fbc07110163fb9b09b043ef7b4b75150f18
2021-09-22 17:32:54 -07:00
Rohan Varma
5739f77775 [DDP] Refactor and remove sync_params (#64514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64514

sync_params is a misnomer since we don't actually synchroniz
parameters. While removing this I realized
`self._check_and_sync_module_buffers` does almost everything we need it to, so
just refactored that and made DDP forward call into it.
ghstack-source-id: 138684982

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30751231

fbshipit-source-id: add7c684f5c6c71dad9e9597c7759849fa74f47a
2021-09-22 14:12:51 -07:00
Rohan Varma
ce5981e431 [DDP] Custom buffer reduction (#64513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64513

Proposal: https://github.com/pytorch/pytorch/issues/63041
Support custom buffer reduction in DDP via hook
ghstack-source-id: 138655663

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30751152

fbshipit-source-id: 257a9d46bb178d8812d4ea5a4d9c6140b8a1791f
2021-09-22 14:11:35 -07:00
Jessica Choi
f24bd43375 Changing type and name of local_used_maps to reflect that it is only one map (#65380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65380

Fixing bugs that arise when running setup.py develop

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31104844

Pulled By: jaceyca

fbshipit-source-id: acfd4cf316c71177df758ca55b470f51a17f776b
2021-09-22 11:35:33 -07:00
Jessica Choi
158b8bdc8a Cleaning up DDP SPMD in reducer.cpp (#64113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113

Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.

Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30615965

fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
2021-09-21 16:13:18 -07:00
Rohan Varma
45bd0f6181 Back out "Revert D30745960: [DDP] Remove SPMD from self.modules_buffers" (#64778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64778

Original commit changeset: d3f3fb813c45
ghstack-source-id: 138326910

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849443

fbshipit-source-id: 15dab8a959a29d2e2fefac6ad52b8d8168eacc02
2021-09-17 12:28:36 -07:00
Rohan Varma
70f286c1e2 Back out "Revert D30745961: [DDP] Remove self.modules_params" (#64777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64777

Original commit changeset: 59f7cc50d369
ghstack-source-id: 138326909

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849442

fbshipit-source-id: bb87ba83935374d8a3ebbc29365df1417dd4f26f
2021-09-17 12:28:34 -07:00
Rohan Varma
61dfcbf4bc Back out "Revert D30745921: [DDP] Fix when buffers are reassigned in module" (#64776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64776

Original commit changeset: 343ead86bf1e
ghstack-source-id: 138326914

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849444

fbshipit-source-id: 9a72805416fe7d6c68e51bdcdb88f6e1fecb614d
2021-09-17 12:28:32 -07:00
Howard Huang
459653a0f6 Revert D30745921: [DDP] Fix when buffers are reassigned in module
Test Plan: revert-hammer

Differential Revision:
D30745921 (d59ecc02df)

Original commit changeset: 25eb1edbf445

fbshipit-source-id: 343ead86bf1e2d0b2d4124be331ea2fa437303ad
2021-09-09 08:23:16 -07:00
Howard Huang
5bc53ac5ef Revert D30745961: [DDP] Remove self.modules_params
Test Plan: revert-hammer

Differential Revision:
D30745961 (8c09510294)

Original commit changeset: 32d102502570

fbshipit-source-id: 59f7cc50d369b6cc2856cf4ebd0f58b96202336d
2021-09-09 08:23:14 -07:00
Howard Huang
f1aaf8afcd Revert D30745960: [DDP] Remove SPMD from self.modules_buffers
Test Plan: revert-hammer

Differential Revision:
D30745960 (1553259520)

Original commit changeset: 66a8f9847e9f

fbshipit-source-id: d3f3fb813c45ac1b0ff15c6154b2e99e5dbab433
2021-09-09 08:22:12 -07:00
Rohan Varma
1553259520 [DDP] Remove SPMD from self.modules_buffers (#64474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64474

No need for a nested list here.
ghstack-source-id: 137526312

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745960

fbshipit-source-id: 66a8f9847e9fe1e02c51b79647e93bf7665cf4d9
2021-09-08 19:16:15 -07:00
Rohan Varma
8c09510294 [DDP] Remove self.modules_params (#64473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64473

Unused after SPMD deprecated.
ghstack-source-id: 137526305

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745961

fbshipit-source-id: 32d102502570291e01579e5b47a6d74dc71013bb
2021-09-08 19:16:13 -07:00
Rohan Varma
d59ecc02df [DDP] Fix when buffers are reassigned in module (#64472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64472

Sometimes, user module can reassign tensor buffer, as in:

```
self.buffer = torch.randn(1, 2) # in init
self.buffer += 1 # in forward
```

in this case, `self.modules_buffers` will become outdated and we should
repopulate self.modules_buffers if we need to sync module buffers.

See https://github.com/pytorch/pytorch/issues/63916 for full description of the
issue.
ghstack-source-id: 137526309

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745921

fbshipit-source-id: 25eb1edbf445703a481802e07f3058d38ea6fc64
2021-09-08 19:14:55 -07:00
Yinbin Ma
0d437fe6d0 BF16 allreduce hook (#63260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260

Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.

Reviewed By: SciPioneer

Differential Revision: D30238317

fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
2021-08-18 20:53:49 -07:00
Rohan Varma
5fb79f61a8 [DDP] Dont set thread local state in reducer autograd hook. (#62996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996

No need to set this because autograd engine already propagates TLS
states.
ghstack-source-id: 135438220

Test Plan: CI

Reviewed By: albanD

Differential Revision: D30202078

fbshipit-source-id: e5e917269a03afd7a6b8e61f28b45cdb71ac3e64
2021-08-10 10:50:16 -07:00
Rohan Varma
3df4870343 [Reland][DDP] Support not all outputs used in loss calculation (#61753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753

Reland of https://github.com/pytorch/pytorch/pull/57081.
Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`.

In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression.

We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire.

Added the unittests that were part of the reverted diff.
ghstack-source-id: 135388925

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29726179

fbshipit-source-id: 54c8819e0aa72c61554104723a5b9c936501e719
2021-08-09 22:29:11 -07:00
Rohan Varma
80091cb0f7 [DDP] Allow tuning of first bucket (#62748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62748

Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.
ghstack-source-id: 135074696

Test Plan: CI

Reviewed By: SciPioneer, wanchaol

Differential Revision: D30110041

fbshipit-source-id: 96f76bec012de129d1645e7f50e266d4b255ec66
2021-08-05 16:35:07 -07:00
Sean Lawlor
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
Andrew Gu
62a90c227f Make _Join, _Joinable, _JoinHook public (#62605)
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
2021-08-03 12:20:11 -07:00
Rohan Varma
4d5607bb25 [Reland][DDP] log bucket sizes (#62625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625

reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race.

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D30058217

fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091
2021-08-03 10:55:46 -07:00
Andrew Gu
51f687fd4b Add overlap with DDP to ZeRO (two approaches) (#62157)
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.

Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157

Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`

These were tested on the AI AWS cluster.

An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.

Both approaches have been verified using an internal accuracy benchmark.

Reviewed By: mrshenli

Differential Revision: D29971046

Pulled By: andwgu

fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
2021-08-02 08:33:34 -07:00
Yi Wang
32b37ba246 [DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457

Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.

Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type

Reviewed By: rohan-varma

Differential Revision: D30007390

fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
2021-07-30 20:51:34 -07:00