Commit Graph

42 Commits

Author SHA1 Message Date
cyy
f2900420da fix missing-prototypes warnings in torch_cpu (Part 6) (#101845)
This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147, #100245, #100849 and #101788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101845
Approved by: https://github.com/albanD
2023-06-15 16:48:28 +00:00
Ke Wen
4dbab17edb [c10d] Use macro to deduplicate codes (#101243)
Ops.cpp copies code for each of the three device keys (CPU, CUDA PrivateUse1).
Use macro to deduplicate it.
No logic change.
Cc @kumpera @H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101243
Approved by: https://github.com/H-Huang
2023-05-12 22:12:28 +00:00
Ke Wen
daed3bf8f9 Implement coalesced all_gather_into_tensor (#101157)
This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-05-11 20:58:47 +00:00
shaoyf42
cc5f64957b Add PrivateUse1 for dispatching PyTorch Distributed Collectives. (#98137)
Add PrivateUse1 for dispatching PyTorch Distributed Collectives to support custom device. This PR is to fix https://github.com/pytorch/pytorch/issues/97938#issue-1646833919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98137
Approved by: https://github.com/kumpera
2023-04-06 01:41:43 +00:00
Howard Huang
aa6f0ace2f Remove API declarations in Ops.hpp (#94532)
In #91257, we removed direct calls to methods in ops.cpp, so this is updating to also remove ops.hpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94532
Approved by: https://github.com/kwen2501
2023-02-11 18:13:09 +00:00
Howard Huang
b2ea1d06aa Collective dispatching from Process Group (#91257)
Fixes https://github.com/pytorch/pytorch/issues/90932
Fixes https://github.com/pytorch/pytorch/issues/90659

Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup`

Context:
https://github.com/pytorch/pytorch/issues/86225

Differential Revision: [D42854676](https://our.internmc.facebook.com/intern/diff/D42854676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91257
Approved by: https://github.com/kwen2501
2023-02-09 18:31:28 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Wanchao Liang
f30694c700 Add allgather_into_tensor to CommTensor (#90565)
This PR adds _all_gather_base_ to CommTensor to support allgather_base
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90565
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
Wanchao Liang
b782927ed4 Add reduce_scatter_tensor to CommTensor (#90564)
This PR adds reduce_scatter_base to the CommTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90564
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
Wanchao Liang
3ba9e4cd55 Add alltoall_ to CommTensor (#90512)
This PR adds alltoall_ to the CommTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90512
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
Yuxin Wu
c8ed84ad06 Fix a static initialization order fiasco in c10d (#90149)
The `TORCH_LIBRARY_IMPL` registrations in `OpsImpl.cpp` needs to happen after `ProcessGroup` is registered as a torch class -- which happens in `Ops.cpp`. However, the order of the registrations is undefined between the two files.

If the registration in `OpsImpl.cpp` runs before `Ops.cpp`, we get a crash at program launch similar to #83255 . This happens in our internal build.

This PR moves `OpsImpl.cpp` to the end of `Oops.cpp`. Because according to the omniscient lord of chatGPT:
<img width="600" alt="2022-12-04_19-25" src="https://user-images.githubusercontent.com/1381301/205542847-3535b319-3c2a-4e8e-bc11-27913f6afb39.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90149
Approved by: https://github.com/kwen2501, https://github.com/H-Huang, https://github.com/soumith
2022-12-12 08:21:54 +00:00
Howard Huang
80150788bc [21/N] Add alltoall_base custom op with CPU/CUDA implementations (#89813)
Differential Revision: [D41812670](https://our.internmc.facebook.com/intern/diff/D41812670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89813
Approved by: https://github.com/kwen2501
2022-12-08 23:39:26 +00:00
Howard Huang
e65ee3975f [20/N] Add recv_any_source custom op with CPU/CUDA implementations (#89505)
Differential Revision: [D41812671](https://our.internmc.facebook.com/intern/diff/D41812671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89505
Approved by: https://github.com/kwen2501
2022-12-08 23:39:26 +00:00
Howard Huang
5797f74924 [19/N] Add monitored_barrier custom op with CPU implementation (#89318)
Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89318
Approved by: https://github.com/kwen2501
2022-11-22 14:18:40 +00:00
Howard Huang
be22b5d39f [18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (#89317)
Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89317
Approved by: https://github.com/kwen2501
2022-11-22 14:14:17 +00:00
Howard Huang
58a74f34f9 [17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation (#88903)
Differential Revision: [D41415325](https://our.internmc.facebook.com/intern/diff/D41415325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88903
Approved by: https://github.com/kwen2501
2022-11-22 00:42:11 +00:00
Howard Huang
df1df9d10a [16/N] Add _allgather_base custom op with CPU/CUDA implementation (#88889)
Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88889
Approved by: https://github.com/kwen2501
2022-11-12 22:31:07 +00:00
Howard Huang
6e5f736d86 [15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846)
Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846
Approved by: https://github.com/kwen2501
2022-11-12 14:23:45 +00:00
Howard Huang
8a1fc5d2f8 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations (#83916)
### Changes
- Updates for the reduce collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83916
Approved by: https://github.com/kwen2501
2022-10-10 15:58:37 +00:00
Howard Huang
ccac8d13d5 [3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations (#83735)
### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735
Approved by: https://github.com/kwen2501
2022-09-28 03:24:06 +00:00
Howard Huang
74ead61944 [2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680)
### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.

Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
2022-09-14 13:05:58 +00:00
Shen Li
89c4654ba9 Add scatter_ to CommTensor (#84606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84606
Approved by: https://github.com/wanchaol
2022-09-07 14:00:20 +00:00
Shen Li
f43c38bdc8 Add broadcast_ to CommTensor (#84604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84604
Approved by: https://github.com/wanchaol
2022-09-07 14:00:20 +00:00
Shen Li
a24d7a8565 Add reduce_scatter_ to CommTensor (#84592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84592
Approved by: https://github.com/wanchaol
2022-09-07 14:00:18 +00:00
Shen Li
e4519548a5 Supported nested lists in CommTensor and enable tracing allgather_ (#84585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84585
Approved by: https://github.com/wanchaol
2022-09-07 14:00:16 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
Shen Li
34e5b0997e [reland] Make allreduce compatible with make_fx (#84221)
land after #83122

This PR explores solutions for 2 issues:

1. Collective comm ops are inplace ops, and does not return a tensor.
   With that, `make_fx` cannot include comm ops in the traced graph.
   The current solution is to make comm ops return a tuple of
   `(output_tensors, work_handle)`, so that
   [`proxy_call`](90821aab10/torch/fx/experimental/proxy_tensor.py (L170-L172))
   can handle that. It won't change the behavior of existing c10d
   Python/C++ APIs, so I directly added the code to `Ops.cpp`.
2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
   the `wait()` call on the work when tracing graph. However, this
   might break correctness, as when running the traced function, it
   could consume a tensor before it's ready. The current solution
   is to create a `CommTensor` tensor subclass to explicitly call
   `wait()`. In this PR, I am only doing this in the test, as we
   will need more discussion to see if we can add this to c10d Python
   implementations. kudos to Chillee wanchaol

Edit: `print_tabular` breaks CI. removing that from tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84221
Approved by: https://github.com/wanchaol
2022-08-30 02:13:00 +00:00
PyTorch MergeBot
b8fe0edcf5 Revert "Make allreduce compatible with fx ProxyTensor (#84126)"
This reverts commit ec5b83f768.

Reverted https://github.com/pytorch/pytorch/pull/84126 on behalf of https://github.com/malfet due to Likely broke multigpu periodic jobs, see https://github.com/pytorch/pytorch/runs/8044611438?check_suite_focus=true
2022-08-27 14:14:58 +00:00
Shen Li
ec5b83f768 Make allreduce compatible with fx ProxyTensor (#84126)
land after #83122

This PR explores solutions for 2 issues:

1. Collective comm ops are inplace ops, and does not return a tensor.
   With that, `make_fx` cannot include comm ops in the traced graph.
   The current solution is to make comm ops return a tuple of
   `(output_tensors, work_handle)`, so that
   [`proxy_call`](90821aab10/torch/fx/experimental/proxy_tensor.py (L170-L172))
   can handle that. It won't change the behavior of existing c10d
   Python/C++ APIs, so I directly added the code to `Ops.cpp`.
2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
   the `wait()` call on the work when tracing graph. However, this
   might break correctness, as when running the traced function, it
   could consume a tensor before it's ready. The current solution
   is to create a `CommTensor` tensor subclass to explicitly call
   `wait()`. In this PR, I am only doing this in the test, as we
   will need more discussion to see if we can add this to c10d Python
   implementations. kudos to @Chillee @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84126
Approved by: https://github.com/wanchaol
2022-08-26 19:10:04 +00:00
Shen Li
527a160169 Expose ProcessGroup::Work.wait() API to TorchScript (#83303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83303
Approved by: https://github.com/rohan-varma
2022-08-26 19:10:04 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
Jiewen Tan
79c2dfcd8e [c10d] Make send/recv as custom ops (#79779)
Summary:
This patch makes send/recv as custom ops such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_send_recv
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79779
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
2022-06-27 22:07:17 +00:00
Jiewen Tan
238eaf2094 [c10d] Make barrier as a custom op (#79777)
Summary:
This patch makes barrier as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_nccl_barrier
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79777
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
2022-06-27 10:57:36 +00:00
Jiewen Tan
dcd17357a4 [c10d] Make alltoall as a custom op (#79691)
Summary:
This patch makes alltoall as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda
    BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex
    BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda
    and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79691
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
2022-06-27 10:57:36 +00:00
Jiewen Tan
84c0a308a1 [c10d] Make scatter as a custom op (#79688)
Summary:
This patch makes scatter as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_scatter_ops
python test/distributed/test_c10d_gloo.py -k test_scatter_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79688
Approved by: https://github.com/mrshenli
2022-06-24 21:22:58 +00:00
Jiewen Tan
9d6e3f81d4 [c10d] Make gather as a custom op (#79687)
Summary:
This patch makes gather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_gather_ops
python test/distributed/test_c10d_gloo.py -k test_gather_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79687
Approved by: https://github.com/mrshenli
2022-06-24 21:19:56 +00:00
Jiewen Tan
3367e632b2 [c10d] Make reduce as a custom op (#79686)
Summary:
This patch makes reduce as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_reduce_ops
python test/distributed/test_c10d_gloo.py -k test_reduce_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79686
Approved by: https://github.com/mrshenli
2022-06-24 21:16:18 +00:00
Jiewen Tan
80b50dfa3a [c10d] Make reduce_scatter as a custom op (#79683)
Summary:
This patch makes reduce_scatter as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_reduce_scatter_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79683
Approved by: https://github.com/mrshenli
2022-06-24 20:58:04 +00:00
Jiewen Tan
3359af9390 [c10d] Make allgather as a custom op (#79669)
Summary:
This patch makes allgather as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allgather_ops
python test/distributed/test_c10d_gloo.py -k test_allgather_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79669
Approved by: https://github.com/mrshenli
2022-06-24 20:29:29 +00:00
Jiewen Tan
e5841bafbd [c10d] Make allreduce as a custom op
Summary:
This patch makes allreduce as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_allreduce_ops
python test/distributed/test_c10d_gloo.py -k test_allreduce_basics
...and other existing distributed tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79582

Approved by: https://github.com/wanchaol
2022-06-23 08:34:29 +00:00
Jiewen Tan
e757cf40cc [c10d] Make broadcast as a custom op
Summary:
This patch makes broadcast as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.

Test Plan:
python test/distributed/test_c10d_nccl.py -k test_broadcast_ops
python test/distributed/test_c10d_gloo.py -k test_broadcast_basics
...and other existing distributed tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76722

Approved by: https://github.com/pritamdamania87
2022-06-14 01:54:29 +00:00