Commit Graph

95 Commits

Author SHA1 Message Date
Tristan Rice
2673a440d0 [distributed] add PG APIs and general doc cleanups (#140853)
Doc updates:

* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis

Test plan:

```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
2024-11-19 02:06:32 +00:00
Will Constable
3d93caf664 [c10d] Add thread-safety initialization warning (#139638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139638
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-11-04 21:38:47 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Ke Wen
01601ebd41 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-07 08:11:58 +00:00
PyTorch MergeBot
0ff60236ab Revert "Retire torch.distributed.pipeline (#127354)"
This reverts commit b9c058c203.

Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit b9c058c203 ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))
2024-06-04 18:19:31 +00:00
Ke Wen
b9c058c203 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-04 07:03:26 +00:00
Will Constable
26b942c4fc [C10D] Document destroy_process_group usage (#122358)
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.
<img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9">

Fixes #48203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122358
Approved by: https://github.com/shuqiangzhang
2024-05-09 16:51:31 +00:00
Muralidhar Andoorveedu
b96b1e8cff [Distributed] Add P2P versions of *object_list operations (#124379)
This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream.

With this change, sending and receiving arbitrary picklable python objects is possible.

Relevant issue: https://github.com/pytorch/pytorch/issues/3473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-03 23:22:58 +00:00
Tianyu Liu
af5376c444 [dtensor] add support for loss parallel (#119877)
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.

Here are the underlying rationales why we are going through these op replacements:

1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
2024-03-02 05:06:26 +00:00
Lucas Pasqualin
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
Will Constable
28e4004286 Add doc for torch.distributed.breakpoint (#115656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115656
Approved by: https://github.com/wanchaol, https://github.com/fegin
ghstack dependencies: #115705
2023-12-14 14:45:36 +00:00
Iris Zhang (PyTorch)
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
Lucas Pasqualin
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
Howard Huang
3e66385ddd Add Work to distributed docs (#115172)
Summary:
Documenting the `Work` object

For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented

Test Plan: Preview the docs build in OSS

Differential Revision: D51854974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
2023-12-07 18:12:10 +00:00
Nikita Shulga
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d6.
2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
PyTorch MergeBot
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00
Lucas Pasqualin
f073dcd4f7 Stateful Checkpointing for Distributed [1/N] (#113867)
First pass at adding a save/load API, as well as definition of Stateful objects.

Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113867
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 19:21:03 +00:00
Ke Wen
dc65f6c601 [c10d] Remove deprecated multi-gpu-per-thread APIs (#114156)
As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document.  The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156
Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 03:50:23 +00:00
Edward Z. Yang
3a3a979984 Add torch.distributed.breakpoint (#113775)
I tested it works by patching

```
diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py
index 96b3a82bdfa..dea9bac9302 100644
--- a/test/distributed/test_dynamo_distributed.py
+++ b/test/distributed/test_dynamo_distributed.py
@@ -18,6 +18,7 @@ from torch._dynamo import config
 from torch._dynamo.utils import same
 from torch._dynamo.testing import collect_results
 from torch.utils._triton import has_triton
+import torch.distributed as dist
 from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy, lambda_auto_wrap_policy
 from torch._higher_order_ops.wrap import tag_activation_checkpoint
 from torch.nn.parallel import DistributedDataParallel as DDP
@@ -398,6 +399,7 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
     @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
     def test_fsdp_activation_checkpointing(self):
         with _dynamo_dist_per_rank_init(self.rank, self.world_size):
+            dist.breakpoint()
             model, inputs = get_toy_model_for_activation_checkpointing(f"cuda:{self.rank}")
             is_inner = lambda module: isinstance(module, ToyInnerModel)  # noqa: E731
             wrap_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=is_inner)
```

and then running `python test/distributed/test_dynamo_distributed.py -k test_fsdp_activation_checkpointing`

It prints:

```
ATTENTION!!!

Type 'up' to get to the frame that called dist.breakpoint(rank=0)

> /data/users/ezyang/c/pytorch/torch/distributed/__init__.py(71)breakpoint()
-> barrier()
(Pdb) up
> /data/users/ezyang/c/pytorch/test/distributed/test_dynamo_distributed.py(402)test_fsdp_activation_checkpointing()
-> dist.breakpoint()
(Pdb) list
397
398         @skip_if_lt_x_gpu(1)
399         @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
400         def test_fsdp_activation_checkpointing(self):
401             with _dynamo_dist_per_rank_init(self.rank, self.world_size):
402  ->             dist.breakpoint()
403                 model, inputs = get_toy_model_for_activation_checkpointing(f"cuda:{self.rank}")
404                 is_inner = lambda module: isinstance(module, ToyInnerModel)  # noqa: E731
405                 wrap_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=is_inner)
406                 model = apply_fsdp_with_checkpointing(model, wrap_policy, is_inner)
407                 correct_outputs = model(inputs)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113775
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2023-11-16 19:30:57 +00:00
albanD
c4db607607 Doc test non packages (#110568)
Add non-package python modules to the public API checks.
The original change is to remove the `ispkg` check in this line
https://github.com/pytorch/pytorch/blob/main/docs/source/conf.py#L518

Everything else is to add the appropriate modules to the rst files, make sure every module we provide can be imported (fixed by either making optional dependencies optional or just deleting files that have been un-importable for 3 years), make API that are both modules and functions (like torch.autograd.gradcheck) properly rendered on the docs website without confusion and add every non-documented API to the allow list (~3k of them).

Next steps will be to try and fix these missing docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110568
Approved by: https://github.com/zou3519
2023-10-06 14:16:01 +00:00
Howard Huang
1ca68c971c distributed doc fix (#110157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110157
Approved by: https://github.com/awgu
2023-09-28 01:34:02 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
Svetlana Karslioglu
d425da8bf3 Replace master with main in links and docs/conf.py (#100176)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100176
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-02 18:20:32 +00:00
xiny
57bb4cd046 [Doc][Distributed] Add missing functions to distributed.rst (#89905)
Add missing documents for `torch.distributed.all_to_all_single` and other functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89905
Approved by: https://github.com/kit1980
2022-12-04 07:22:54 +00:00
Wanchao Liang
4451eb24e6 Move tensor_parallel out to distributed.tensor folder (#89878)
This PR moves tensor parallel from torch.distributed._tensor.parallel
to torch.distributed.tensor.parallel, to prepare for beta release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878
Approved by: https://github.com/fduwjj
2022-11-30 22:13:10 +00:00
Howard Huang
bc66ddb5cb Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)
Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8

Changes:
- introduce new error type
- Update `C10D_NCCL_CHECK`

Sample script to demonstrate new error type

```python
# python -m torch.distributed.run --nproc_per_node=2 <script>.py

import torch
import torch.distributed as dist

if __name__ == "__main__":
    dist.init_process_group("nccl")
    dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
```

Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
Approved by: https://github.com/rohan-varma
2022-11-08 13:26:42 +00:00
Masaki Kozuki
28593a8339 [docs] batch_isend_irecv and P2POp of torch.distributed (#86438)
Reopening https://github.com/pytorch/pytorch/pull/79722

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86438
Approved by: https://github.com/kit1980
2022-10-25 00:11:50 +00:00
Howard Huang
cc9183eb4c Update distributed.rst backend collective support chart (#86406)
NCCL `scatter` was added by Wanchao in https://github.com/pytorch/pytorch/pull/70029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86406
Approved by: https://github.com/wanchaol
2022-10-07 12:59:09 +00:00
Ke Wen
05d1128106 [c10d] Start deprecating *_multigpu APIs (#85961)
### Deprecation reasons:
- For most users training is on one GPU per process so these APIs are rarely used
- They added one more API dimension
- They can be expressed in a composed manner
- They are not abstracted – specific to GPU
- They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:59:39 +00:00
Ke Wen
ade1c19612 Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867)
This is a twin PR similar to the one for `all_gather_into_tensor` (#85686).
The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686.

Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867
Approved by: https://github.com/crcrpar, https://github.com/H-Huang
2022-09-30 05:48:16 +00:00
Ke Wen
775a22c7c6 Add all_gather_into_tensor in place of _all_gather_base (#85686)
### Description
- This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning.
- The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors.
- This PR also adds deprecation warning to `_all_gather_base`.

### Issue
`_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output.

There are, however, two "blockers" that make the merge difficult:
(i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list.
(ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information.

In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name.

### Testing
Added tests:
- `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example:
```
        >>> tensor_in
        tensor([1, 2], device='cuda:0') # Rank 0
        tensor([3, 4], device='cuda:1') # Rank 1
        >>> tensor_out
        tensor([1, 2, 3, 4], device='cuda:0') # Rank 0
        tensor([1, 2, 3, 4], device='cuda:1') # Rank 1
```
- `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example:
```
        >>> tensor_out2
        tensor([[1, 2],
                [3, 4]], device='cuda:0') # Rank 0
        tensor([[1, 2],
                [3, 4]], device='cuda:1') # Rank 1
```
The output form is determined by the shape of the output tensor passed by the user, no flag used.

Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686
Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
2022-09-27 22:50:22 +00:00
Shawn Zhong
9c902f4749 Add TORCH_CPP_LOG_LEVEL to the docs
Fixes #70667

`TORCH_CPP_LOG_LEVEL=INFO` is needed for `TORCH_DISTRIBUTED_DEBUG` to be effective.

For reference, https://github.com/pytorch/pytorch/pull/71746 introduced the environment variable `TORCH_CPP_LOG_LEVEL` and https://github.com/pytorch/pytorch/pull/73361 documented it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76625
Approved by: https://github.com/rohan-varma
2022-05-03 17:01:11 +00:00
Ke Wen
1f04a00ccf [PyTorch Distributed] Update documentation about NCCL environment variables (#74006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74006

updated recommendations about environment variables to use during debug
and performance tuning

Test Plan: `make html`

Reviewed By: rohan-varma

Differential Revision: D34767454

fbshipit-source-id: 08cd58469bf72b58702e50e82020fa19b43b5911
(cherry picked from commit ac7e6630f8043f85d3d16be17c6a8ad1ebb2990c)
2022-03-11 23:57:17 +00:00
Alban Desmaison
734281c3d6 Cleanup all module references in doc (#73983)
Summary:
Working towards https://docs.google.com/document/d/10yx2-4gs0gTMOimVS403MnoAWkqitS8TUHX73PN8EjE/edit?pli=1#

This PR:
- Ensure that all the submodules are listed in a rst file (that ensure they are considered by the coverage tool)
- Remove some long deprecated code that just error out on import
- Remove the allow list altogether to ensure nothing gets added back there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73983

Reviewed By: anjali411

Differential Revision: D34787908

Pulled By: albanD

fbshipit-source-id: 163ce61e133b12b2f2e1cbe374f979e3d6858db7
(cherry picked from commit c9edfead7a01dc45bfc24eaf7220d2a84ab1f62e)
2022-03-10 22:26:29 +00:00
Can Balioglu
0e7a7a5fe7 Add documentation for c10d log levels (#73361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73361

This PR adds the documentation for the newly introduced `TORCH_CPP_LOG_LEVEL` and how it can be used along with `TORCH_DISTRIBUTED_DEBUG` to adjust the log level of c10d.
ghstack-source-id: 149874995

Test Plan: Locally rendered and checked the documentation.

Reviewed By: rohan-varma

Differential Revision: D34452352

fbshipit-source-id: ecb54590f3030ddef9921a7152ca9f7fc9438345
(cherry picked from commit f4c7c6f3b27dbd3006686cf26a6e9e53cd2c8f09)
2022-02-24 20:38:15 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Wanchao Liang
9b53d3194c Implement gather primitive for ProcessGroupNCCL (#66745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745

This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.

NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838

Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress

Reviewed By: pritamdamania87

Differential Revision: D29616361

fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
2022-01-27 19:37:55 +00:00
Shen Li
7bc220e060 Update distributed.rst for ProcessGroup Extensions (#71482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71482

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D33745986

Pulled By: mrshenli

fbshipit-source-id: fe2d0491901bf00be09deb5c556bc1e1d359b725
(cherry picked from commit be5104bfd7)
2022-01-25 00:30:08 +00:00
mrshenli
b8c3693281 Remove autograd-enabled collective APIs from distributed docs (#69011)
Summary:
These APIs are not yet officially released and are still under discussion. Hence, this commit removes those APIs from docs and will add them back when ready.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69011

Reviewed By: fduwjj

Differential Revision: D32703124

Pulled By: mrshenli

fbshipit-source-id: ea049fc7ab6b0015d38cc40c5b5daf47803b7ea0
2021-11-29 18:14:50 -08:00
Yi Wang
7f25c3e666 Update distributed.rst to show that CUDA send/recv on GPU is supported (#65601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65601

I believe this feature was supported one year ago:
https://github.com/pytorch/pytorch/pull/44921

#Closes: https://github.com/pytorch/pytorch/issues/65525
ghstack-source-id: 138918961

Test Plan: N/A

Reviewed By: pritamdamania87, mingzhe09088

Differential Revision: D31163535

fbshipit-source-id: 9321a0a5137a3e265e2b54bd78730ac28c7acd55
2021-09-24 12:30:10 -07:00
Kiuk Chung
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
Howard Huang
cdc027679b Add compare_set in distributed docs (#61351)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61351

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29588206

Pulled By: H-Huang

fbshipit-source-id: 9db48e7b6de29503275f10616470ad2d66b075f9
2021-07-08 12:30:32 -07:00
Rohan Varma
2f395f3b54 [reland] Document debugability features in torch.distributed (#59726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59726

Reland of https://github.com/pytorch/pytorch/pull/59604 with indentation fix
ghstack-source-id: 130979356

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D29001923

fbshipit-source-id: 225d9dc5054c223b453f3b39749e2b62f61b9a2c
2021-06-09 16:40:11 -07:00
Luca Wehrstedt
f1786b293d Revert D28972444: [pytorch][PR] Document debugability features in torch.distributed
Test Plan: revert-hammer

Differential Revision:
D28972444 (a9d2810817)

Original commit changeset: da5e8ee84f0d

fbshipit-source-id: 94d3b3b75ddec74ea5b2b76f6a7519dc921ee2a7
2021-06-09 03:04:36 -07:00