pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Howard Huang 7a0f29b776 Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj		2022-12-16 23:15:00 +00:00
..
api	Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils test. (#83693 )	2022-11-15 04:10:52 +00:00
c10d	Allow Process Group to support multiple backends (#88330 ) (#90997 )	2022-12-16 23:15:00 +00:00
common
dist_autograd	[lint] autoformat test/cpp and torch/csrc	2022-06-11 21:11:16 +00:00
jit	Clean up dependancy for flatbuffer_loader (#86041 )	2022-12-08 03:48:04 +00:00
lazy	[LTC] Make ComputePostOrder accept const T pointers (#88773 )	2022-11-10 18:34:19 +00:00
lite_interpreter_runtime	reland "support running test_mobile_profiler with buck1/buck2 and OSS (#89001 )" (#89091 )	2022-11-17 21:04:23 +00:00
monitor
profiler	Nested profiling support for Linux-perf Profiler (#87904 )	2022-11-02 14:51:53 +00:00
rpc	Refactor distribuetd to use absolute header path (#85780 )	2022-09-30 05:13:50 +00:00
tensorexpr	Fix the performance issue that the for-loop before ExternallCall could not be parallelized. (#85056 )	2022-10-07 07:36:28 +00:00
__init__.py