Commit Graph

31 Commits

Author SHA1 Message Date
Dmitry Rogozhkin
5872a8c6b0 Use task submitter TLS in gloo working threads (#142184)
Fixes: #86830

CC: @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142184
Approved by: https://github.com/albanD
2024-12-06 17:03:17 +00:00
cyy
f9ae3fac8c [Distributed] [19/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138903)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138903
Approved by: https://github.com/ezyang
2024-10-28 05:29:25 +00:00
Gufan Yin
fba2c0a23a Fix comment in ProcessGroupGloo (#137746)
Summary: Algorithm caching was removed in 2018 D13111781

Test Plan: CI

Differential Revision: D64214527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137746
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-10-12 01:04:41 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
PyTorch MergeBot
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
Richard Barnes
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
cyy
1ac402a96c [Distributed] [6/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124701)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124043.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124701
Approved by: https://github.com/ezyang
2024-04-25 11:39:23 +00:00
cyy
ea61c9cb29 [Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043
Approved by: https://github.com/ezyang
2024-04-23 00:43:50 +00:00
cyy
c2596fd3e0 [Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032
Approved by: https://github.com/Skylion007
2024-04-16 00:42:18 +00:00
Yifu Wang
372e9550bd ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910
2024-02-03 02:42:47 +00:00
Yifu Wang
fd000340fd ProcessGroupGloo::allgather_into_tensor_coalesced (#118910)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

**I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage**.

### This PR

This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`.  This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910
Approved by: https://github.com/shuqiangzhang
2024-02-02 17:53:28 +00:00
Rohan Varma
5ffa98f7ba [Dist] Add fallback reduce_scatter_base, allgather_base APIs to Gloo (#112144)
Per Ke's suggestion, adding these APIs in ProcessGroupGloo directly to
enable FSDP on CPUs

Differential Revision: [D50636382](https://our.internmc.facebook.com/intern/diff/D50636382/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112144
Approved by: https://github.com/wz337, https://github.com/fegin, https://github.com/wanchaol, https://github.com/XilunWu
2023-11-07 01:37:02 +00:00
Howard Huang
7a3c3d63bf fix gloo cuda sparse_allreduce dispatch (#111485)
Fixes #111422

allreduce_sparse_cuda gets dispatched to allreduce_sparse which doesnt exist for gloo. However, gloo has an existing implementation so this is just fixing the dispatching to that.

The reason CI didn't catch this is because we are calling the backend directly. Added a test which calls the public API (dist.XYZ) and goes through the dispatcher

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111485
Approved by: https://github.com/fduwjj
2023-10-19 21:15:45 +00:00
Rodrigo Kumpera
317e39a8ad [C10d] Cleanup collective sequence number. (#109136)
Sequence numbers must be associated with a Work object
if we want to use it as a way to report collective progress.

The API surface change is introducing Work::getSequenceNumber, which
should eventually be exposed to python.

The bulk of this change is changing gloo to make the sequence number
be always in use and weave it to the dozens subclasses of Work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109136
Approved by: https://github.com/fduwjj
2023-09-26 17:17:04 +00:00
Rodrigo Kumpera
9a1b6d44bb [C10d] Add PG::enableCollectivesTiming to make it dynamically enabled. (#108814)
Collectives timing gates the tracking when a collective starts on device.

Currently it's enabled by set the NCCL_ENABLE_TIMING env var.

The goal of this PR is to make it possible to dynamically enable that flag so users of the PG hooks don't have to set that flag in order to have their hooks work.

The design is that once set, all new collectives will have such behavior so we track it on each Work object.

We make enableTiming_ atomic in PGNCCL to avoid races on non-TSO hardware.

To ensure consistency, we copy its value during Work construction and replace all previous usage of enableTiming_ from the PG with usages from the Work, which now has an immutable value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108814
Approved by: https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #108813
2023-09-20 19:47:41 +00:00
Rodrigo Kumpera
bde75eb9a8 [Gloo] Properly pass op type to Work (#108812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108812
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-09-12 18:21:09 +00:00
Shen Li
dd6319198d Apply clang-format to distributed/c10d folder (#107140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140
Approved by: https://github.com/H-Huang
2023-08-14 23:16:38 +00:00
Rodrigo Kumpera
5be1088ed6 [c10d] Bridge c10d and gloo stores. (#102641)
This relands #100633 with fixes for internal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102641
Approved by: https://github.com/rohan-varma, https://github.com/fduwjj
2023-06-02 00:07:18 +00:00
PyTorch MergeBot
13383f45c5 Revert "[c10d] Bridge c10d and gloo stores. (#100384)"
This reverts commit 74b2c04aa1.

Reverted https://github.com/pytorch/pytorch/pull/100384 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100384#issuecomment-1548279946))
2023-05-15 17:44:54 +00:00
Rodrigo Kumpera
74b2c04aa1 [c10d] Bridge c10d and gloo stores. (#100384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100384
Approved by: https://github.com/fduwjj
2023-05-12 13:55:31 +00:00
sclarkson
9a18968253 Fix kDefaultTimeout multiple definition build failure (#97270)
Make the namespace explicit to avoid the constexpr conflict on GCC 11.

Fixes #90448

@ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97270
Approved by: https://github.com/ezyang
2023-03-21 21:44:53 +00:00
cyy
f172feae0d More tidy fixes (#93069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069
Approved by: https://github.com/Skylion007
2023-01-27 06:40:50 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
Howard Huang
74ead61944 [2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680)
### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.

Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
2022-09-14 13:05:58 +00:00
Rohan Varma
4d6314a16e [DDP] Log num threads (#64072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64072

Log gloo threads to DDP logging.
ghstack-source-id: 137119480

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D30596083

fbshipit-source-id: 2b4f6e762cb5d850be6056bcc5922029a1af3c91
2021-09-01 18:36:15 -07:00
guyang3532
4ed8858817 Exclude time of waiting in queue from gloo communication prof… (#61342)
Summary:
Background:
    The gloo communication implementation is as follow:
        1. Construct communication workers and push them into a queue.
        2. Initialize a thread pool and each thread run a loop to get worker from the queue and execute it.
Issue:
        The recorded profiling time span start from the worker construction and end at finish. So it will include the time of worker waiting in the queue and will result in multiple gloo communication time span overlapping with each other in a same thread in the timeline:
![image](https://user-images.githubusercontent.com/62738430/124867273-5bc95b80-dff0-11eb-8664-6e5d4166fc39.png)
This is because when next work is waiting in the queue, the last work is not finished.

Solution:
     This PR delays the profiling start time of gloo communication from worker construction to worker is really executed, so the profiling span will not include the time of waiting in queue. Implementation as follow:
             1. Firstly, disable the original record function by specifying 'nullptr' to 'profilingTitle' argument of ProcessGroup::Work
             2. Construct a 'recordFunctionBeforeCallback_' and 'recordFunctionEndCallback_' and save it as member of the worker.
             3. When the worker is executed, invoke the 'recordFunctionBeforeCallback_'.
             4. The 'recordFunctionEndCallback_' will be invoked at finish as before.
      After this modification, the gloo profiling span in timeline will not overlap with each other:
![image](https://user-images.githubusercontent.com/62738430/124868716-bb286b00-dff2-11eb-9cf0-d0494a356d0c.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61342

Reviewed By: albanD

Differential Revision: D29811656

Pulled By: gdankel

fbshipit-source-id: ff07e8906d90f21a072049998400b4a48791e441
2021-07-28 22:24:26 -07:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00