pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	5103ecc5d8	[1/N] Fix clang-tidy readability checks (#164561 ) Check all `.cpp` files except `jit` files for readability thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561 Approved by: https://github.com/Skylion007	2025-10-04 09:40:38 +00:00
Yuanyuan Chen	115af42e9d	Fix readibility checks in TIDY and apply them (#164475 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475 Approved by: https://github.com/albanD, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-02 20:34:49 +00:00
Lakshay Garg	2d31c3d99d	Pass shared_ptr by value (#161834 ) The way AsyncAllreduceCUDADeviceWork is currently implemented, using it will force a copy of `shared_ptr<gloo::Context>` because `std::move` does nothing for a const ref. This PR changes the param type to shared_ptr<> instead of the const ref. This allows more efficient parameter passing. Here's an example that demonstrates the issue: ```cpp #include <memory> #include <iostream> struct Foo {}; void useFoo_ref(const std::shared_ptr<Foo>& f) { std::shared_ptr<Foo> internal = std::move(f); std::cout << "use_count: " << internal.use_count() << '\n'; } void useFoo_val(std::shared_ptr<Foo> f) { std::shared_ptr<Foo> internal = std::move(f); std::cout << "use_count: " << internal.use_count() << '\n'; } int main() { std::shared_ptr<Foo> f1 = std::make_shared<Foo>(); useFoo_ref(std::move(f1)); // prints "use_count: 2" std::shared_ptr<Foo> f2 = std::make_shared<Foo>(); useFoo_val(std::move(f2)); // prints "use_count: 1" } ``` This also aligns well with [C++ Core Guidelines][1] for handling smart pointers. [1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501	2025-08-30 18:00:37 +00:00
Tristan Rice	2a8795a981	[c10d] ProcessGroupGloo: support per operation timeouts (#158128 ) This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set. * This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context. * This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout. Test plan: ``` pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2025-07-11 23:09:50 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Tristan Rice	d1dd2c1fc8	gloo: cuda (#153406 ) This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend. This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441 Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both. The gloo submodule is updated to depend on the new Gloo changes Test plan: ```py import os import time transport = "TCP" #transport = "IBVERBS" os.environ["GLOO_DEVICE_TRANSPORT"] = transport rank = int(os.environ["RANK"]) os.environ["CUDA_VISIBLE_DEVICES"] = str(rank) ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank] ibv_name, ibv_port = ibv.split(":") os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port os.environ["TORCH_GLOO_IBV_INDEX"] = "3" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() # initial sanity check #device = "cpu" #t = torch.zeros(10, device=device) #dist.all_reduce(t) #print("sanity complete") device = "cpu" iters = 10 warmup_iters = 2 for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]: t = torch.zeros(nelem, device=device) torch.cuda.current_stream().synchronize() for i in range(warmup_iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() start = time.perf_counter() for i in range(iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() dur = (time.perf_counter() - start) qps = iters/dur bandwidth_gb = t.nbytes * iters / dur / 1e9 gb = t.nbytes / 1e9 if rank == 0: print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406 Approved by: https://github.com/fduwjj	2025-05-16 01:13:13 +00:00

6 Commits