Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74130
enable these tests to run for all dist backends not just nccl.
ghstack-source-id: 151429410
Test Plan: CI
Reviewed By: awgu
Differential Revision: D34281684
fbshipit-source-id: 956c1b0cafe0502b593dd42b157d518e89a47d8e
(cherry picked from commit 15d58b88362c49565123823f24ca122d5344acc9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029
This PR implements NCCL scatter and add scatter to ProcessGroupNCCL.
NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp.
2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank.
ghstack-source-id: 147754837
Test Plan:
test_scatter_ops
test_scatter_stress
test_scatter_checks
Reviewed By: pritamdamania87
Differential Revision: D33154823
fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd
(cherry picked from commit 93201f9d4a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745
This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.
NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838
Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress
Reviewed By: pritamdamania87
Differential Revision: D29616361
fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
Summary:
Implements allreduce_coalesced for ProcessGroupNCCL as an NCCL group of allreduces on separate tensors, as proposed in https://github.com/pytorch/pytorch/issues/38995#issuecomment-882804595. In recent versions of NCCL, performance of grouped comms has improved significantly. A group can execute with just one kernel, so a grouped comm on a set of unflattened tensors can be more performant than flattening+a single flat nccl call.
The same approach can easily extend to broadcast_coalesced and reduce_coalesced.
I'm still not sure how (hypothetical) all_gather_coalesced and reduce_scatter_coalesced ops should be exposed or implemented, because we need to consider "_base" variants where the output or input tensor is pre-flattened. For example, https://github.com/pytorch/pytorch/issues/61781 effectively wants "allgather_base_coalesced".
I'm also not sure how the _multigpu variants should enter the picture. With the approach I've written here, ProcessGroupNCCL::allreduce accepts a vector of tensors that are either all on the same device (in which case it'll do an allreduce_coalesced) or all on different devices (in which case it'll do an allreduce_multigpu). In other words it can do _coalesced or _multigpu but not both at once.
for some reason github wont let me add agolynski to the reviewers
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62140
Reviewed By: fduwjj
Differential Revision: D33781010
Pulled By: cbalioglu
fbshipit-source-id: f0c233da9ebae57d7ccecf6d8dc432d936d4d3ce
(cherry picked from commit e43cb81d30)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71600
These tests in test_c10d_nccl test a subset of functionality that's
already covered by distributed_test.py, no need for these additional tests.
ghstack-source-id: 147458823
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33662679
fbshipit-source-id: 2d1c1223fdd72a851c537b4793a71d65190d2553
(cherry picked from commit 14565ac5a6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459
1. add static_graph feature to DDP constructor;
2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor
3. four cases are covered:
static_graph = False, _set_static_graph() is called;
static_graph = False, _set_static_graph() is not called;
static_graph = True, _set_static_graph() is not called;
static_graph = True, _set_static_graph() is called;
ghstack-source-id: 147263797
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D33646738
fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855
(cherry picked from commit dc246a1129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69060
Saved variable hooks checkpointing was added in https://github.com/pytorch/pytorch/pull/69508, this PR adds some tests for DDP.
Specifically, we can support almost all DDP use cases with this new API, such as dynamic module with find_unused_parameters=True. One case remains to be supported, which is static_graph + non-reentrant based checkpointing. The underlying reason this does not work is https://github.com/pytorch/pytorch/issues/58111.
ghstack-source-id: 147219887
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32712126
fbshipit-source-id: ba5ae9ca77fd8929ee020c7dc97838bae9a1931b
(cherry picked from commit 9c7f93e217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68786
To enable the auto grad for the sharded linear, we find we need to make some changes to the current nn function api (c10d api with auto grad enabled). So we made the following several changes:
1. Add a new api `reduce_scatter` since we need it in the rowwise sharding.
2. Modify the `all_to_all` api to make sure it consistent with the ones in distributed_c10d.py.
3. Found the cpp input params of `reduce_scatter` is missing input param, added more unit test to cover these cases.
4. Sync the NN test from gloo to nccl.
ghstack-source-id: 144860208
Test Plan: CI + Unit Test
Reviewed By: pritamdamania87
Differential Revision: D32569674
fbshipit-source-id: 9bd613f91bbf7a39eede0af32a5a5db0f2ade43b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69356
Per title
ghstack-source-id: 144807949
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32816150
fbshipit-source-id: 6b4eacc63edd267bc1eb8a1c1d6c753bc581d63a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67705
This PR rewrites ProcessGroupNCCLTest to be MultiProcessTestCase. It was originally written in a single process multi-GPU fashion, we change it to multi-process instead to align with other c10d tests.
ghstack-source-id: 144555092
Test Plan: wait for CI
Reviewed By: pritamdamania87, fduwjj
Differential Revision: D32113626
fbshipit-source-id: 613d36aeae36bf441de1c2c83aa4755f4d33df4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68792
Refactor tests to be more clear what features are supported and
unsupported under certain DDP configs.
ghstack-source-id: 144285040
Test Plan: Ci
Reviewed By: pbelevich
Differential Revision: D32609498
fbshipit-source-id: 5231242054d4ff6cd8e7acc4a50b096771ef23d1
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.
This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843
Reviewed By: H-Huang
Differential Revision: D32248132
Pulled By: mrshenli
fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639
Due to BC considerations, we cannot directly error out, as that
might break existing applications. Raise warnings first to improve
debuggability.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D32075151
Pulled By: mrshenli
fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67668
This adds an env var to enable NCCL health check, which when left unspecified, results in the check not being run. Unit tests that need to test this functionality have the env variable set. Please see internal diff for more details.
Test Plan: CI
Reviewed By: yuguo68, mrshenli
Differential Revision: D32089763
fbshipit-source-id: dff5664a5e607f711515cd1042089ca769914fbb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66393
Third try!
Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed https://github.com/pytorch/pytorch/issues/66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140560113
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D31534735
fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66394
Skips this test as it currently does not seem to pass after several
internal local runs.
ghstack-source-id: 140210583
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D31534806
fbshipit-source-id: 799849a6a715506a85c9697b46f7098d9b71b32e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65173
Initializes dummy NCCL communicators in constructor for a basic health
check that communicators can be initialized prior to launching the first
collective.
After successful init, we immediately use `ncclCommAbort` to destroy these
communicators to ensure they don't interfere with regular communicator creation
during collectives.
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D31005792
fbshipit-source-id: c2c582dee25a098361ead6ef03f541e7833c606b
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/62303.
Reverts the revert, and restores some diffs that were mysteriously missing from the reverted revert. I think some of the diffs I pushed to the original PR raced with its import or landing, such that the original PR's merge didn't pick up all the diffs I wanted. I don't know enough about the landing process to do more than speculate wildly, but hopefully this resubmit sorts things out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62835
Reviewed By: zhouzhuojie, seemethere, janeyx99, heitorschueroff
Differential Revision: D30999982
Pulled By: malfet
fbshipit-source-id: 1f70ab4055208f1c6a80c9fc9fbe292ce68ecaa9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241
When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.
This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22
The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.
Test Plan: CI
Reviewed By: pallab-zz, cbalioglu
Differential Revision: D30658855
fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63462
Now that `torch.distributed.optim` gates DistributedOptimizer on RPC availability, these tests can be run on windows.
ghstack-source-id: 136437635
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D30358923
fbshipit-source-id: 36739bdfe7214789f17de652d30c62c2bc124c73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260
Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.
Reviewed By: SciPioneer
Differential Revision: D30238317
fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63383
Per title
ghstack-source-id: 135966157
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D30358921
fbshipit-source-id: 965e054e525194b1ee55980340df275bab355c9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63382
Per title
ghstack-source-id: 135966156
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D30255446
fbshipit-source-id: e6ffbf339db0bc5b4702d02b74a462309df07c75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051
The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.
This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29854514
fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
Summary:
https://github.com/pytorch/pytorch/issues/62295
Previously the packing and unpacking of the NCCL version "integer" was done to have parity with the upstream NCCL version encoding. However, there doesn't seem to be any place where this integer is directly compared with a version integer sourced from upstream NCCL, and syncing the encoding seems to be error-prone (e.g., a recent change where a special case was added for minor versions >= 10 7e51592129/src/nccl.h.in (L22)).
This patch changes the reporting to return a tuple of version numbers instead (to preserve ease-of-use for comparisons) and tweaks the passing between C/Python to avoid the digit overflow problem.
CC ngimel mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62916
Reviewed By: anjali411
Differential Revision: D30201069
Pulled By: mrshenli
fbshipit-source-id: 2e4e7c69f001c3f22bd04aa6df6a992e538bea45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62611
Enables optimizer overlap with backwards in DDP for Adam. Additional optimizers, especially Adagrad will be done in follow up diffs.
1. Implement `step_param` method based on `step` in _FunctionalAdam (perf permitting we can later dedupe `step` to call `step_param`
2. Modify tests to test all current functional optimizers.
ghstack-source-id: 135207143
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29891783
fbshipit-source-id: 321915982afd5cb0a9c2e43d27550f433bff00d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662
Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.
Reviewed By: SciPioneer
Differential Revision: D30012869
fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532
This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl
Reviewed By: rohan-varma
Differential Revision: D30031222
fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457
Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.
Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type
Reviewed By: rohan-varma
Differential Revision: D30007390
fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389
Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list.
Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280
Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`.
ghstack-source-id: 134626246
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork
Reviewed By: rohan-varma
Differential Revision: D29982485
fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887
1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`
Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29784152
fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079
Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29838127
fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2