pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	f2900420da	fix missing-prototypes warnings in torch_cpu (Part 6) (#101845 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147, #100245, #100849 and #101788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101845 Approved by: https://github.com/albanD	2023-06-15 16:48:28 +00:00
Ke Wen	4dbab17edb	[c10d] Use macro to deduplicate codes (#101243 ) Ops.cpp copies code for each of the three device keys (CPU, CUDA PrivateUse1). Use macro to deduplicate it. No logic change. Cc @kumpera @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101243 Approved by: https://github.com/H-Huang	2023-05-12 22:12:28 +00:00
Ke Wen	daed3bf8f9	Implement coalesced all_gather_into_tensor (#101157 ) This PR adds support for the following use cases: - Sync style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) ``` - Async style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the all-gather's ``` Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-05-11 20:58:47 +00:00
shaoyf42	cc5f64957b	Add PrivateUse1 for dispatching PyTorch Distributed Collectives. (#98137 ) Add PrivateUse1 for dispatching PyTorch Distributed Collectives to support custom device. This PR is to fix https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98137 Approved by: https://github.com/kumpera	2023-04-06 01:41:43 +00:00
Howard Huang	aa6f0ace2f	Remove API declarations in Ops.hpp (#94532 ) In #91257, we removed direct calls to methods in ops.cpp, so this is updating to also remove ops.hpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/94532 Approved by: https://github.com/kwen2501	2023-02-11 18:13:09 +00:00
Howard Huang	b2ea1d06aa	Collective dispatching from Process Group (#91257 ) Fixes https://github.com/pytorch/pytorch/issues/90932 Fixes https://github.com/pytorch/pytorch/issues/90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` Context: https://github.com/pytorch/pytorch/issues/86225 Differential Revision: [D42854676](https://our.internmc.facebook.com/intern/diff/D42854676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91257 Approved by: https://github.com/kwen2501	2023-02-09 18:31:28 +00:00
Howard Huang	7a0f29b776	Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj	2022-12-16 23:15:00 +00:00
Wanchao Liang	f30694c700	Add allgather_into_tensor to CommTensor (#90565 ) This PR adds _all_gather_base_ to CommTensor to support allgather_base Pull Request resolved: https://github.com/pytorch/pytorch/pull/90565 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Wanchao Liang	b782927ed4	Add reduce_scatter_tensor to CommTensor (#90564 ) This PR adds reduce_scatter_base to the CommTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/90564 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Wanchao Liang	3ba9e4cd55	Add alltoall_ to CommTensor (#90512 ) This PR adds alltoall_ to the CommTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/90512 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Yuxin Wu	c8ed84ad06	Fix a static initialization order fiasco in c10d (#90149 ) The `TORCH_LIBRARY_IMPL` registrations in `OpsImpl.cpp` needs to happen after `ProcessGroup` is registered as a torch class -- which happens in `Ops.cpp`. However, the order of the registrations is undefined between the two files. If the registration in `OpsImpl.cpp` runs before `Ops.cpp`, we get a crash at program launch similar to #83255 . This happens in our internal build. This PR moves `OpsImpl.cpp` to the end of `Oops.cpp`. Because according to the omniscient lord of chatGPT: <img width="600" alt="2022-12-04_19-25" src="https://user-images.githubusercontent.com/1381301/205542847-3535b319-3c2a-4e8e-bc11-27913f6afb39.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90149 Approved by: https://github.com/kwen2501, https://github.com/H-Huang, https://github.com/soumith	2022-12-12 08:21:54 +00:00
Howard Huang	80150788bc	[21/N] Add alltoall_base custom op with CPU/CUDA implementations (#89813 ) Differential Revision: [D41812670](https://our.internmc.facebook.com/intern/diff/D41812670) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89813 Approved by: https://github.com/kwen2501	2022-12-08 23:39:26 +00:00
Howard Huang	e65ee3975f	[20/N] Add recv_any_source custom op with CPU/CUDA implementations (#89505 ) Differential Revision: [D41812671](https://our.internmc.facebook.com/intern/diff/D41812671) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89505 Approved by: https://github.com/kwen2501	2022-12-08 23:39:26 +00:00
Howard Huang	5797f74924	[19/N] Add monitored_barrier custom op with CPU implementation (#89318 ) Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89318 Approved by: https://github.com/kwen2501	2022-11-22 14:18:40 +00:00
Howard Huang	be22b5d39f	[18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (#89317 ) Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89317 Approved by: https://github.com/kwen2501	2022-11-22 14:14:17 +00:00
Howard Huang	58a74f34f9	[17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation (#88903 ) Differential Revision: [D41415325](https://our.internmc.facebook.com/intern/diff/D41415325) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88903 Approved by: https://github.com/kwen2501	2022-11-22 00:42:11 +00:00
Howard Huang	df1df9d10a	[16/N] Add _allgather_base custom op with CPU/CUDA implementation (#88889 ) Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88889 Approved by: https://github.com/kwen2501	2022-11-12 22:31:07 +00:00
Howard Huang	6e5f736d86	[15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846 ) Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846 Approved by: https://github.com/kwen2501	2022-11-12 14:23:45 +00:00
Howard Huang	8a1fc5d2f8	[7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations (#83916 ) ### Changes - Updates for the reduce collective ### Context https://github.com/pytorch/pytorch/issues/86225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83916 Approved by: https://github.com/kwen2501	2022-10-10 15:58:37 +00:00
Howard Huang	ccac8d13d5	[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations (#83735 ) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735 Approved by: https://github.com/kwen2501	2022-09-28 03:24:06 +00:00
Howard Huang	74ead61944	[2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680 ) ### Changes - Move ProcessGroup::Work into its own class and update all the references to it / header includes. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work. Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680 Approved by: https://github.com/kwen2501	2022-09-14 13:05:58 +00:00
Shen Li	89c4654ba9	Add scatter_ to CommTensor (#84606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84606 Approved by: https://github.com/wanchaol	2022-09-07 14:00:20 +00:00
Shen Li	f43c38bdc8	Add broadcast_ to CommTensor (#84604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84604 Approved by: https://github.com/wanchaol	2022-09-07 14:00:20 +00:00
Shen Li	a24d7a8565	Add reduce_scatter_ to CommTensor (#84592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84592 Approved by: https://github.com/wanchaol	2022-09-07 14:00:18 +00:00
Shen Li	e4519548a5	Supported nested lists in CommTensor and enable tracing allgather_ (#84585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84585 Approved by: https://github.com/wanchaol	2022-09-07 14:00:16 +00:00
Masaki Kozuki	ab6c57217a	Add NCCL PreMul Sum to c10d `redce` ops (#84243 ) This is based on #81272 but this conforms to TorchScript Compiler ## TODO - [ ] Update `abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73)` to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit. cc @ptrblck @kwen2501 @aazzolini cc @zasdfgbnm for visibility to the TODO above Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243 Approved by: https://github.com/kwen2501	2022-09-02 21:57:45 +00:00
Shen Li	34e5b0997e	[reland] Make allreduce compatible with make_fx (#84221 ) land after #83122 This PR explores solutions for 2 issues: 1. Collective comm ops are inplace ops, and does not return a tensor. With that, `make_fx` cannot include comm ops in the traced graph. The current solution is to make comm ops return a tuple of `(output_tensors, work_handle)`, so that [`proxy_call`](`90821aab10/torch/fx/experimental/proxy_tensor.py (L170-L172)`) can handle that. It won't change the behavior of existing c10d Python/C++ APIs, so I directly added the code to `Ops.cpp`. 2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore the `wait()` call on the work when tracing graph. However, this might break correctness, as when running the traced function, it could consume a tensor before it's ready. The current solution is to create a `CommTensor` tensor subclass to explicitly call `wait()`. In this PR, I am only doing this in the test, as we will need more discussion to see if we can add this to c10d Python implementations. kudos to Chillee wanchaol Edit: `print_tabular` breaks CI. removing that from tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84221 Approved by: https://github.com/wanchaol	2022-08-30 02:13:00 +00:00
PyTorch MergeBot	b8fe0edcf5	Revert "Make allreduce compatible with fx ProxyTensor (#84126 )" This reverts commit `ec5b83f768`. Reverted https://github.com/pytorch/pytorch/pull/84126 on behalf of https://github.com/malfet due to Likely broke multigpu periodic jobs, see https://github.com/pytorch/pytorch/runs/8044611438?check_suite_focus=true	2022-08-27 14:14:58 +00:00
Shen Li	ec5b83f768	Make allreduce compatible with fx ProxyTensor (#84126 ) land after #83122 This PR explores solutions for 2 issues: 1. Collective comm ops are inplace ops, and does not return a tensor. With that, `make_fx` cannot include comm ops in the traced graph. The current solution is to make comm ops return a tuple of `(output_tensors, work_handle)`, so that [`proxy_call`](`90821aab10/torch/fx/experimental/proxy_tensor.py (L170-L172)`) can handle that. It won't change the behavior of existing c10d Python/C++ APIs, so I directly added the code to `Ops.cpp`. 2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore the `wait()` call on the work when tracing graph. However, this might break correctness, as when running the traced function, it could consume a tensor before it's ready. The current solution is to create a `CommTensor` tensor subclass to explicitly call `wait()`. In this PR, I am only doing this in the test, as we will need more discussion to see if we can add this to c10d Python implementations. kudos to @Chillee @wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/84126 Approved by: https://github.com/wanchaol	2022-08-26 19:10:04 +00:00
Shen Li	527a160169	Expose ProcessGroup::Work.wait() API to TorchScript (#83303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83303 Approved by: https://github.com/rohan-varma	2022-08-26 19:10:04 +00:00
PyTorch MergeBot	1f61c39ac4	Revert "Support NCCL Premul Sum (#81272 )" This reverts commit `432c508e71`. Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2022-08-25 05:01:37 +00:00
Masaki Kozuki	432c508e71	Support NCCL Premul Sum (#81272 ) This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum. The major changes include - convert enum ReduceOp to struct - add premul sum specific paths to init.cpp and Ops.cpp. note: - For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed The commit titled "add nccl premul" whose current hash is `cb99ad6744` was authored by @mcarilli and @ptrblck. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272 Approved by: https://github.com/kwen2501	2022-08-24 04:53:25 +00:00
Jiewen Tan	79c2dfcd8e	[c10d] Make send/recv as custom ops (#79779 ) Summary: This patch makes send/recv as custom ops such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_send_recv ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79779 Approved by: https://github.com/mrshenli, https://github.com/wanchaol	2022-06-27 22:07:17 +00:00
Jiewen Tan	238eaf2094	[c10d] Make barrier as a custom op (#79777 ) Summary: This patch makes barrier as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_nccl_barrier ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79777 Approved by: https://github.com/mrshenli, https://github.com/wanchaol	2022-06-27 10:57:36 +00:00
Jiewen Tan	dcd17357a4	[c10d] Make alltoall as a custom op (#79691 ) Summary: This patch makes alltoall as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_cuda_complex BACKEND=nccl WORLD_SIZE=2 python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_all_to_all_full_group_cuda and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79691 Approved by: https://github.com/mrshenli, https://github.com/wanchaol	2022-06-27 10:57:36 +00:00
Jiewen Tan	84c0a308a1	[c10d] Make scatter as a custom op (#79688 ) Summary: This patch makes scatter as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_scatter_ops python test/distributed/test_c10d_gloo.py -k test_scatter_basics ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79688 Approved by: https://github.com/mrshenli	2022-06-24 21:22:58 +00:00
Jiewen Tan	9d6e3f81d4	[c10d] Make gather as a custom op (#79687 ) Summary: This patch makes gather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_gather_ops python test/distributed/test_c10d_gloo.py -k test_gather_basics ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79687 Approved by: https://github.com/mrshenli	2022-06-24 21:19:56 +00:00
Jiewen Tan	3367e632b2	[c10d] Make reduce as a custom op (#79686 ) Summary: This patch makes reduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_reduce_ops python test/distributed/test_c10d_gloo.py -k test_reduce_basics ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79686 Approved by: https://github.com/mrshenli	2022-06-24 21:16:18 +00:00
Jiewen Tan	80b50dfa3a	[c10d] Make reduce_scatter as a custom op (#79683 ) Summary: This patch makes reduce_scatter as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_reduce_scatter_ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/79683 Approved by: https://github.com/mrshenli	2022-06-24 20:58:04 +00:00
Jiewen Tan	3359af9390	[c10d] Make allgather as a custom op (#79669 ) Summary: This patch makes allgather as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allgather_ops python test/distributed/test_c10d_gloo.py -k test_allgather_basics ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79669 Approved by: https://github.com/mrshenli	2022-06-24 20:29:29 +00:00
Jiewen Tan	e5841bafbd	[c10d] Make allreduce as a custom op Summary: This patch makes allreduce as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_allreduce_ops python test/distributed/test_c10d_gloo.py -k test_allreduce_basics ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79582 Approved by: https://github.com/wanchaol	2022-06-23 08:34:29 +00:00
Jiewen Tan	e757cf40cc	[c10d] Make broadcast as a custom op Summary: This patch makes broadcast as a custom op such that it's dispatcher passable. It's one part of the effort to route comm ops to the dispatcher such that tracing mechanisms that relies on the dispatcher can trace them, e.g., LazyTensor and AOTAutograd. Test Plan: python test/distributed/test_c10d_nccl.py -k test_broadcast_ops python test/distributed/test_c10d_gloo.py -k test_broadcast_basics ...and other existing distributed tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76722 Approved by: https://github.com/pritamdamania87	2022-06-14 01:54:29 +00:00

42 Commits