Commit Graph

71 Commits

Author SHA1 Message Date
Eddie Yan
7710d872fc [DDP] Fix broadcast for channels-last tensors (#79060)
#79043

CC @pritamdamania87 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79060
Approved by: https://github.com/pritamdamania87
2022-06-08 21:52:58 +00:00
pritam
b333a752c0 Validate that tensors are contiguous in ProcessGroupNCCL
Fixes https://github.com/pytorch/pytorch/issues/77554 by ensuring we
require contiguous tensors for send/recv in ProcessGroupNCCL.

Differential Revision: [D36500769](https://our.internmc.facebook.com/intern/diff/D36500769/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77809

Approved by: https://github.com/rohan-varma, https://github.com/wanchaol
2022-05-19 17:48:22 +00:00
magialiao
7c8c8cc248 Use batched operations for PowerSGD
This PR is a rebased version of #75157 which fixes CI issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76041
Approved by: https://github.com/albanD, https://github.com/rohan-varma
2022-04-21 03:25:09 +00:00
PyTorch MergeBot
c5d57e7be9 Revert "Use batched operations for PowerSGD"
This reverts commit 5654e63398.

Reverted https://github.com/pytorch/pytorch/pull/75157 on behalf of https://github.com/albanD
2022-04-18 13:10:29 +00:00
magialiao
5654e63398 Use batched operations for PowerSGD
This implements method proposed in #74907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75157
Approved by: https://github.com/wayi1, https://github.com/rohan-varma
2022-04-18 04:34:17 +00:00
Rohan Varma
a5ea3b7fd9 [DDP] Generalize activation checkpoint tests (#74130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74130

enable these tests to run for all dist backends not just nccl.
ghstack-source-id: 151429410

Test Plan: CI

Reviewed By: awgu

Differential Revision: D34281684

fbshipit-source-id: 956c1b0cafe0502b593dd42b157d518e89a47d8e
(cherry picked from commit 15d58b88362c49565123823f24ca122d5344acc9)
2022-03-16 17:04:30 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Wanchao Liang
6feba4bc7e Implement scatter primitive for ProcessGroupNCCL (#70029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029

This PR implements NCCL scatter and add scatter to ProcessGroupNCCL.

NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API.

1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp.
2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank.
ghstack-source-id: 147754837

Test Plan:
test_scatter_ops
test_scatter_stress
test_scatter_checks

Reviewed By: pritamdamania87

Differential Revision: D33154823

fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd
(cherry picked from commit 93201f9d4a)
2022-01-27 19:37:55 +00:00
Wanchao Liang
9b53d3194c Implement gather primitive for ProcessGroupNCCL (#66745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745

This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.

NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838

Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress

Reviewed By: pritamdamania87

Differential Revision: D29616361

fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
2022-01-27 19:37:55 +00:00
Michael Carilli
f37d2046f8 Implements allreduce_coalesced for ProcessGroupNCCL (#62140)
Summary:
Implements allreduce_coalesced for ProcessGroupNCCL as an NCCL group of allreduces on separate tensors, as proposed in https://github.com/pytorch/pytorch/issues/38995#issuecomment-882804595. In recent versions of NCCL, performance of grouped comms has improved significantly. A group can execute with just one kernel, so a grouped comm on a set of unflattened tensors can be more performant than flattening+a single flat nccl call.

The same approach can easily extend to broadcast_coalesced and reduce_coalesced.

I'm still not sure how (hypothetical) all_gather_coalesced and reduce_scatter_coalesced ops should be exposed or implemented, because we need to consider "_base" variants where the output or input tensor is pre-flattened. For example, https://github.com/pytorch/pytorch/issues/61781 effectively wants "allgather_base_coalesced".

I'm also not sure how the _multigpu variants should enter the picture. With the approach I've written here, ProcessGroupNCCL::allreduce accepts a vector of tensors that are either all on the same device (in which case it'll do an allreduce_coalesced) or all on different devices (in which case it'll do an allreduce_multigpu). In other words it can do _coalesced or _multigpu but not both at once.

for some reason github wont let me add agolynski to the reviewers

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62140

Reviewed By: fduwjj

Differential Revision: D33781010

Pulled By: cbalioglu

fbshipit-source-id: f0c233da9ebae57d7ccecf6d8dc432d936d4d3ce
(cherry picked from commit e43cb81d30)
2022-01-26 13:31:30 +00:00
Rohan Varma
ba08440e88 [Opt Overlap] Remove redundant tests (#71600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71600

These tests in test_c10d_nccl test a subset of functionality that's
already covered by distributed_test.py, no need for these additional tests.
ghstack-source-id: 147458823

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D33662679

fbshipit-source-id: 2d1c1223fdd72a851c537b4793a71d65190d2553
(cherry picked from commit 14565ac5a6)
2022-01-23 00:04:32 +00:00
Yanli Zhao
1c61d8c43f [PT1.11] make static graph to be stable (#71459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459

1. add static_graph feature to DDP constructor;
2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor
3. four cases are covered:
    static_graph = False, _set_static_graph() is called;
    static_graph = False, _set_static_graph() is not called;
    static_graph = True, _set_static_graph() is not called;
    static_graph = True, _set_static_graph() is called;
ghstack-source-id: 147263797

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D33646738

fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855
(cherry picked from commit dc246a1129)
2022-01-20 19:38:41 +00:00
Rohan Varma
3b589c3497 [DDP Checkpointing] non-reentrant checkpoint tests (#69060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69060

Saved variable hooks checkpointing was added in https://github.com/pytorch/pytorch/pull/69508, this PR adds some tests for DDP.

Specifically, we can support almost all DDP use cases with this new API, such as dynamic module with find_unused_parameters=True. One case remains to be supported, which is static_graph + non-reentrant based checkpointing. The underlying reason this does not work is https://github.com/pytorch/pytorch/issues/58111.
ghstack-source-id: 147219887

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32712126

fbshipit-source-id: ba5ae9ca77fd8929ee020c7dc97838bae9a1931b
(cherry picked from commit 9c7f93e217)
2022-01-19 18:09:41 +00:00
Junjie Wang
7c2489bdae [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68786

To enable the auto grad for the sharded linear, we find we need to make some changes to the current nn function api (c10d api with auto grad enabled). So we made the following several changes:

1. Add a new api `reduce_scatter` since we need it in the rowwise sharding.
2. Modify the `all_to_all` api to make sure it consistent with the ones in distributed_c10d.py.
3. Found the cpp input params of `reduce_scatter` is missing input param, added more unit test to cover these cases.
4. Sync the NN test from gloo to nccl.
ghstack-source-id: 144860208

Test Plan: CI + Unit Test

Reviewed By: pritamdamania87

Differential Revision: D32569674

fbshipit-source-id: 9bd613f91bbf7a39eede0af32a5a5db0f2ade43b
2021-12-06 13:38:58 -08:00
Rohan Varma
c95277e92a [FSDP] Remove auto_wrap() (#69356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69356

Per title
ghstack-source-id: 144807949

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32816150

fbshipit-source-id: 6b4eacc63edd267bc1eb8a1c1d6c753bc581d63a
2021-12-06 12:11:14 -08:00
Wanchao Liang
ff3fc37267 [BE] rewrite ProcessGroupNCCLTest to be MultiProcess (#67705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67705

This PR rewrites ProcessGroupNCCLTest to be MultiProcessTestCase. It was originally written in a single process multi-GPU fashion, we change it to multi-process instead to align with other c10d tests.
ghstack-source-id: 144555092

Test Plan: wait for CI

Reviewed By: pritamdamania87, fduwjj

Differential Revision: D32113626

fbshipit-source-id: 613d36aeae36bf441de1c2c83aa4755f4d33df4d
2021-12-02 10:12:05 -08:00
Rohan Varma
994f110a6f Refactor DDP checkpoint tests (#68792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68792

Refactor tests to be more clear what features are supported and
unsupported under certain DDP configs.
ghstack-source-id: 144285040

Test Plan: Ci

Reviewed By: pbelevich

Differential Revision: D32609498

fbshipit-source-id: 5231242054d4ff6cd8e7acc4a50b096771ef23d1
2021-11-30 12:36:14 -08:00
Rohan Varma
183dcdf551 [reland] Fix flaky test_nccl_timeout (#68544)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66882

In addition to changes in https://github.com/pytorch/pytorch/pull/68403, add one more error check that can be raised when a collective times out

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68544

Reviewed By: albanD

Differential Revision: D32508706

Pulled By: rohan-varma

fbshipit-source-id: 7d41b91f547d4ad763c44cd11e7b9914b452b617
2021-11-19 13:25:24 -08:00
Mike Ruberry
3e3bf40b0a Revert D32452012: [pytorch][PR] Fix flaky test_nccl_timeout
Test Plan: revert-hammer

Differential Revision:
D32452012 (faa1e8b7cf)

Original commit changeset: c959b25957f2

fbshipit-source-id: a2786744b12ceed350eec0ca2834f5176a4e21ee
2021-11-17 06:08:53 -08:00
Rohan Varma
faa1e8b7cf Fix flaky test_nccl_timeout (#68403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66882

- Remove time.sleep call
- Use gloo barrier to enforce rank synchronization
- Reduce timeouts for allrduce
- Pass in timeout and call wait() in _check_for_nccl_abort()

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68403

Reviewed By: H-Huang

Differential Revision: D32452012

Pulled By: rohan-varma

fbshipit-source-id: c959b25957f2eb8d59c506075da6023d25bbcfd9
2021-11-16 23:43:23 -08:00
Yifan Xiong
c7eaec86f0 [NCCL] Patch bfloat16 support (#67843)
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.

This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
  v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
  operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843

Reviewed By: H-Huang

Differential Revision: D32248132

Pulled By: mrshenli

fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
2021-11-09 13:46:13 -08:00
Shen Li
18955d3564 Raise warning when calling collectives on non-member group objects (#67639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639

Due to BC considerations, we cannot directly error out, as that
might break existing applications. Raise warnings first to improve
debuggability.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32075151

Pulled By: mrshenli

fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363
2021-11-02 20:04:07 -07:00
Rohan Varma
885da61d7d [PG NCCL] Disable NCCL health check (#67668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67668

This adds an env var to enable NCCL health check, which when left unspecified, results in the check not being run. Unit tests that need to test this functionality have the env variable set. Please see internal diff for more details.

Test Plan: CI

Reviewed By: yuguo68, mrshenli

Differential Revision: D32089763

fbshipit-source-id: dff5664a5e607f711515cd1042089ca769914fbb
2021-11-02 16:21:59 -07:00
Jane Xu
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
Rohan Varma
06fa6c15c0 Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" (#66393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66393

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed https://github.com/pytorch/pytorch/issues/66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140560113

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31534735

fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2
2021-10-14 22:23:22 -07:00
Rohan Varma
901df0cc22 Skip test_nccl_errors_nonblocking (#66394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66394

Skips this test as it currently does not seem to pass after several
internal local runs.
ghstack-source-id: 140210583

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31534806

fbshipit-source-id: 799849a6a715506a85c9697b46f7098d9b71b32e
2021-10-11 10:08:31 -07:00
Jane Xu
0a48f56318 Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"
Test Plan: revert-hammer

Differential Revision:
D31299350 (f1f3bd8c36)

Original commit changeset: 9ad5c8fa17f7

fbshipit-source-id: d63d889922f507a4a0e2e042e451b95b9591c317
2021-10-08 17:55:28 -07:00
Rohan Varma
f1f3bd8c36 Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor" (#65883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65883

Original commit changeset: d8e962b8aab6
ghstack-source-id: 139836954

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D31299350

fbshipit-source-id: 9ad5c8fa17f7038ba579cb1eda6d9271ac07a130
2021-10-08 16:04:20 -07:00
Mike Ruberry
91f8755b0e Revert D31005792: [NCCL] Init dummy NCCL comms in constructor
Test Plan: revert-hammer

Differential Revision:
D31005792 (2b22a5dde2)

Original commit changeset: c2c582dee25a

fbshipit-source-id: d8e962b8aab6fda8a6c013e8577492dff9568c27
2021-09-29 20:46:38 -07:00
Rohan Varma
2b22a5dde2 [NCCL] Init dummy NCCL comms in constructor (#65173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65173

Initializes dummy NCCL communicators in constructor for a basic health
check that communicators can be initialized prior to launching the first
collective.

After successful init, we immediately use `ncclCommAbort` to destroy these
communicators to ensure they don't interfere with regular communicator creation
during collectives.

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31005792

fbshipit-source-id: c2c582dee25a098361ead6ef03f541e7833c606b
2021-09-29 15:36:54 -07:00
Tingting Markstrum
2a0208f4dc fixed comments referring fairscale master branch (#65531)
Summary:
replace comments referring fairscale master branch with main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65531

Test Plan:
buck build

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Reviewed By: H-Huang, anj-s

Differential Revision: D31132552

Pulled By: tmarkstrum

fbshipit-source-id: d3ee8920ab5cccad99f640934c21e8eee022e9b9
2021-09-23 14:37:58 -07:00
Michael Carilli
64d3c7388f [RELAND] Enable ncclAvg for reductions (#62835)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/62303.

Reverts the revert, and restores some diffs that were mysteriously missing from the reverted revert. I think some of the diffs I pushed to the original PR raced with its import or landing, such that the original PR's merge didn't pick up all the diffs I wanted. I don't know enough about the landing process to do more than speculate wildly, but hopefully this resubmit sorts things out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62835

Reviewed By: zhouzhuojie, seemethere, janeyx99, heitorschueroff

Differential Revision: D30999982

Pulled By: malfet

fbshipit-source-id: 1f70ab4055208f1c6a80c9fc9fbe292ce68ecaa9
2021-09-21 18:09:45 -07:00
Rohan Varma
e0e832c2ba [c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241

When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see  https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.

This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22

The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.

Test Plan: CI

Reviewed By: pallab-zz, cbalioglu

Differential Revision: D30658855

fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1
2021-09-08 09:19:24 -07:00
Rohan Varma
16a4434422 [BE] Enable functional optim tests for windows (#63462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63462

Now that `torch.distributed.optim` gates DistributedOptimizer on RPC availability, these tests can be run on windows.
ghstack-source-id: 136437635

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358923

fbshipit-source-id: 36739bdfe7214789f17de652d30c62c2bc124c73
2021-08-23 17:49:01 -07:00
Pritam Damania
2d671ca41b [8/N] Remove c10d/ddp fork tests. (#63454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454

Continuation of https://github.com/pytorch/pytorch/pull/63443, this
PR removes all fork tests from torch.distributed.
ghstack-source-id: 136285511

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30387872

fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513
2021-08-20 12:23:18 -07:00
Yinbin Ma
0d437fe6d0 BF16 allreduce hook (#63260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260

Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.

Reviewed By: SciPioneer

Differential Revision: D30238317

fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
2021-08-18 20:53:49 -07:00
Rohan Varma
dcf90b797c [BE] remove _SUPPORTED_OPTIM_MAP from tests (#63383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63383

Per title
ghstack-source-id: 135966157

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358921

fbshipit-source-id: 965e054e525194b1ee55980340df275bab355c9b
2021-08-17 17:17:25 -07:00
Rohan Varma
5b8862abf1 [DDP] Support step_param for AdamW (#63382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63382

Per title
ghstack-source-id: 135966156

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30255446

fbshipit-source-id: e6ffbf339db0bc5b4702d02b74a462309df07c75
2021-08-17 17:16:11 -07:00
Pritam Damania
f7611b31aa [4/N] Enable opt-asan for distributed unit tests. (#62051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051

The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.

This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29854514

fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
2021-08-10 22:38:31 -07:00
Eddie Yan
d893b44cd8 change nccl version reporting (#62916)
Summary:
https://github.com/pytorch/pytorch/issues/62295

Previously the packing and unpacking of the NCCL version "integer" was done to have parity with the upstream NCCL version encoding. However, there doesn't seem to be any place where this integer is directly compared with a version integer sourced from upstream NCCL, and syncing the encoding seems to be error-prone (e.g., a recent change where a special case was added for minor versions >= 10 7e51592129/src/nccl.h.in (L22)).

This patch changes the reporting to return a tuple of version numbers instead (to preserve ease-of-use for comparisons) and tweaks the passing between C/Python to avoid the digit overflow problem.

CC ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62916

Reviewed By: anjali411

Differential Revision: D30201069

Pulled By: mrshenli

fbshipit-source-id: 2e4e7c69f001c3f22bd04aa6df6a992e538bea45
2021-08-10 17:46:27 -07:00
Rohan Varma
1dba329d20 Enable step_param for Adam functional optimizer (#62611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62611

Enables optimizer overlap with backwards in DDP for Adam. Additional optimizers, especially Adagrad will be done in follow up diffs.

1. Implement `step_param` method based on `step` in _FunctionalAdam (perf permitting we can later dedupe `step` to call `step_param`
2. Modify tests to test all current functional optimizers.
ghstack-source-id: 135207143

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29891783

fbshipit-source-id: 321915982afd5cb0a9c2e43d27550f433bff00d1
2021-08-06 10:53:55 -07:00
Pavel Belevich
0c8ed042f2 Revert D30095246: [pytorch][PR] Enable ncclAvg for reductions
Test Plan: revert-hammer

Differential Revision:
D30095246 (a749180e4e)

Original commit changeset: d3a3475345fa

fbshipit-source-id: 34b5100b925859461296cae5a717a70e5eca6af6
2021-08-05 07:56:40 -07:00
Michael Carilli
a749180e4e Enable ncclAvg for reductions (#62303)
Summary:
[ncclAvg](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html?highlight=ncclavg#c.ncclAvg) is a new `ncclRedOpt_t` that fuses a div-by-world-size with ncclAllReduce, Reduce, or ReduceScatter. This PR adds support.

This PR and https://github.com/pytorch/pytorch/pull/62140 lay the foundation for to DDP allreduce+average grad tensors in place with a single nccl call without additional memory pass(es) to flatten or average or unflatten. I'll write the necessary DDP changes once this PR and https://github.com/pytorch/pytorch/pull/62140 land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62303

Reviewed By: soulitzer

Differential Revision: D30095246

Pulled By: rohan-varma

fbshipit-source-id: d3a3475345fafb0ab265c11d36db74d7c5613a0a
2021-08-04 19:43:50 -07:00
Sean Lawlor
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
Yi Wang
2ec4f69b48 [DDP Comm Hook] Do not expose hook_then_optimizer as a public method (#62532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532

This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl

Reviewed By: rohan-varma

Differential Revision: D30031222

fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
2021-08-02 12:25:01 -07:00
Yi Wang
32b37ba246 [DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457

Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.

Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type

Reviewed By: rohan-varma

Differential Revision: D30007390

fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
2021-07-30 20:51:34 -07:00
Yi Wang
acba9b3104 [DDP Communication Hook] Simplify the implementation of parseHookResult of PythonCommHook (#62389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389

Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list.

Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280

Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`.
ghstack-source-id: 134626246

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D29982485

fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a
2021-07-29 17:27:35 -07:00
Yi Wang
554daef820 Reformat test_c10d_nccl.py and distributed_test.py (#62388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62388

as title
ghstack-source-id: 134626247

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D29984086

fbshipit-source-id: 0960e5acc379ccdf08813387e11d3fb0a5f0e4b0
2021-07-29 17:27:33 -07:00
Pritam Damania
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
Rohan Varma
64283fe146 [DDP/Functional Optim] Support kwarg arguments (#62079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079

Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29838127

fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2
2021-07-26 22:12:50 -07:00