mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj |
||
|---|---|---|
| .. | ||
| autograd | ||
| c10d | ||
| rpc | ||