The way AsyncAllreduceCUDADeviceWork is currently implemented,
using it will force a copy of `shared_ptr<gloo::Context>`
because `std::move` does nothing for a const ref.
This PR changes the param type to shared_ptr<> instead of the
const ref. This allows more efficient parameter passing.
Here's an example that demonstrates the issue:
```cpp
#include <memory>
#include <iostream>
struct Foo {};
void useFoo_ref(const std::shared_ptr<Foo>& f) {
std::shared_ptr<Foo> internal = std::move(f);
std::cout << "use_count: " << internal.use_count() << '\n';
}
void useFoo_val(std::shared_ptr<Foo> f) {
std::shared_ptr<Foo> internal = std::move(f);
std::cout << "use_count: " << internal.use_count() << '\n';
}
int main() {
std::shared_ptr<Foo> f1 = std::make_shared<Foo>();
useFoo_ref(std::move(f1)); // prints "use_count: 2"
std::shared_ptr<Foo> f2 = std::make_shared<Foo>();
useFoo_val(std::move(f2)); // prints "use_count: 1"
}
```
This also aligns well with [C++ Core Guidelines][1] for handling
smart pointers.
[1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.
* This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context.
* This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout.
Test plan:
```
pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.
This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441
Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.
The gloo submodule is updated to depend on the new Gloo changes
Test plan:
```py
import os
import time
transport = "TCP"
#transport = "IBVERBS"
os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"
import torch
import torch.distributed as dist
dist.init_process_group("gloo")
rank = dist.get_rank()
# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")
device = "cpu"
iters = 10
warmup_iters = 2
for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
t = torch.zeros(nelem, device=device)
torch.cuda.current_stream().synchronize()
for i in range(warmup_iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
start = time.perf_counter()
for i in range(iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
dur = (time.perf_counter() - start)
qps = iters/dur
bandwidth_gb = t.nbytes * iters / dur / 1e9
gb = t.nbytes / 1e9
if rank == 0:
print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj