Commit Graph

6 Commits

Author SHA1 Message Date
Yuanyuan Chen
5103ecc5d8 [1/N] Fix clang-tidy readability checks (#164561)
Check all `.cpp` files except `jit` files for readability thoroughly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561
Approved by: https://github.com/Skylion007
2025-10-04 09:40:38 +00:00
Yuanyuan Chen
115af42e9d Fix readibility checks in TIDY and apply them (#164475)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475
Approved by: https://github.com/albanD, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-02 20:34:49 +00:00
Lakshay Garg
2d31c3d99d Pass shared_ptr by value (#161834)
The way AsyncAllreduceCUDADeviceWork is currently implemented,
using it will force a copy of `shared_ptr<gloo::Context>`
because `std::move` does nothing for a const ref.

This PR changes the param type to shared_ptr<> instead of the
const ref. This allows more efficient parameter passing.

Here's an example that demonstrates the issue:

```cpp
#include <memory>
#include <iostream>

struct Foo {};

void useFoo_ref(const std::shared_ptr<Foo>& f) {
    std::shared_ptr<Foo> internal = std::move(f);
    std::cout << "use_count: " << internal.use_count() << '\n';
}

void useFoo_val(std::shared_ptr<Foo> f) {
    std::shared_ptr<Foo> internal = std::move(f);
    std::cout << "use_count: " << internal.use_count() << '\n';
}

int main() {
    std::shared_ptr<Foo> f1 = std::make_shared<Foo>();
    useFoo_ref(std::move(f1)); // prints "use_count: 2"

    std::shared_ptr<Foo> f2 = std::make_shared<Foo>();
    useFoo_val(std::move(f2)); // prints "use_count: 1"
}
```

This also aligns well with [C++ Core Guidelines][1] for handling
smart pointers.

[1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
2025-08-30 18:00:37 +00:00
Tristan Rice
2a8795a981 [c10d] ProcessGroupGloo: support per operation timeouts (#158128)
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.

* This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context.
* This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout.

Test plan:

```
pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2025-07-11 23:09:50 +00:00
fduwjj
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
Tristan Rice
d1dd2c1fc8 gloo: cuda (#153406)
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.

This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441

Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.

The gloo submodule is updated to depend on the new Gloo changes

Test plan:

```py
import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00