This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
for i in range(num_coll):
dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
for i in range(num_coll):
dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330
### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.
### Changes
#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.
#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`
### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig
### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)
# Example
This is a basic script (using 2 backends within a process group)
```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os
if __name__ == "__main__":
rank = os.environ.get("RANK")
# initialize with both gloo and nccl
dist.init_process_group()
# with gloo
dist.all_reduce(torch.tensor([1.0]))
print(f"Rank {rank} finished")
# with nccl
dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```
Test Plan: Imported from OSS
Differential Revision: D42069829
Pulled By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.
### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.
Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735
Approved by: https://github.com/kwen2501
### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.
#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.
Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
land after #83122
This PR explores solutions for 2 issues:
1. Collective comm ops are inplace ops, and does not return a tensor.
With that, `make_fx` cannot include comm ops in the traced graph.
The current solution is to make comm ops return a tuple of
`(output_tensors, work_handle)`, so that
[`proxy_call`](90821aab10/torch/fx/experimental/proxy_tensor.py (L170-L172))
can handle that. It won't change the behavior of existing c10d
Python/C++ APIs, so I directly added the code to `Ops.cpp`.
2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
the `wait()` call on the work when tracing graph. However, this
might break correctness, as when running the traced function, it
could consume a tensor before it's ready. The current solution
is to create a `CommTensor` tensor subclass to explicitly call
`wait()`. In this PR, I am only doing this in the test, as we
will need more discussion to see if we can add this to c10d Python
implementations. kudos to Chillee wanchaol
Edit: `print_tabular` breaks CI. removing that from tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84221
Approved by: https://github.com/wanchaol
land after #83122
This PR explores solutions for 2 issues:
1. Collective comm ops are inplace ops, and does not return a tensor.
With that, `make_fx` cannot include comm ops in the traced graph.
The current solution is to make comm ops return a tuple of
`(output_tensors, work_handle)`, so that
[`proxy_call`](90821aab10/torch/fx/experimental/proxy_tensor.py (L170-L172))
can handle that. It won't change the behavior of existing c10d
Python/C++ APIs, so I directly added the code to `Ops.cpp`.
2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
the `wait()` call on the work when tracing graph. However, this
might break correctness, as when running the traced function, it
could consume a tensor before it's ready. The current solution
is to create a `CommTensor` tensor subclass to explicitly call
`wait()`. In this PR, I am only doing this in the test, as we
will need more discussion to see if we can add this to c10d Python
implementations. kudos to @Chillee @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84126
Approved by: https://github.com/wanchaol
Summary:
This patch makes send/recv as custom ops such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_send_recv
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79779
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
Summary:
This patch makes barrier as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_nccl_barrier
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79777
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
Summary:
This patch makes alltoall as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda
and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79691
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
Summary:
This patch makes scatter as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_scatter_ops
python test/distributed/test_c10d_gloo.py -k test_scatter_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79688
Approved by: https://github.com/mrshenli
Summary:
This patch makes gather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_gather_ops
python test/distributed/test_c10d_gloo.py -k test_gather_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79687
Approved by: https://github.com/mrshenli
Summary:
This patch makes reduce as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_reduce_ops
python test/distributed/test_c10d_gloo.py -k test_reduce_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79686
Approved by: https://github.com/mrshenli
Summary:
This patch makes reduce_scatter as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_reduce_scatter_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79683
Approved by: https://github.com/mrshenli
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79669
Approved by: https://github.com/mrshenli
Summary:
This patch makes allreduce as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allreduce_ops
python test/distributed/test_c10d_gloo.py -k test_allreduce_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79582
Approved by: https://github.com/wanchaol
Summary:
This patch makes broadcast as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_broadcast_ops
python test/distributed/test_c10d_gloo.py -k test_broadcast_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76722
Approved by: https://github.com/pritamdamania87