pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Luca Wehrstedt	0a0023d984	Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564 ) In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564 Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed	2025-06-17 12:54:58 +00:00
Ke Wen	9e9484d022	[SymmMem] Enable NVSHMEM for Triton (#155506 ) (This is an Experimental feature) Allow Triton kernels to invoke NVSHMEM device functions. ### Example Triton program Key parts: - Call `nvshmem.enable_triton()` to initialize; - Call `nvshmem.putmem_block` in Triton kernel; - Add `extern_libs` kwarg at kernel invocation. ``` import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem @triton.jit def put_kernel( dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr, BLOCK_SIZE: tl.constexpr, ): nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer) if __name__ == "__main__": # Enable NVSHMEM for Triton nvshmem_lib = nvshmem.enable_triton() # Use torch Symmetric Memory to allocate Symmetric tensors ... peer = 1 - rank if rank == 0: kernel = put_kernel[(1, 1, 1)]( dst_ptr, src_ptr, numel=numel, peer=peer, BLOCK_SIZE=BLOCK_SIZE, extern_libs=nvshmem_lib, ) dist.barrier() if rank == 1: print(f"Rank {rank}: received {out=}") ``` ### Test output: ``` $ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put Rank 0: writing value 5 to Peer 1 Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506 Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj	2025-06-12 00:22:49 +00:00
Eddie Yan	4d9d884c3f	[NCCL] Expose new `ncclConfig_t` flags in 2.27 (#155379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155379 Approved by: https://github.com/Skylion007	2025-06-11 18:26:55 +00:00
fduwjj	ffc6cbfaf7	[symm_mem] Move all symm mem code into a dedicated folder (#155573 ) We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573 Approved by: https://github.com/fegin, https://github.com/d4l3k	2025-06-10 22:30:11 +00:00
Tushar Jain	7cf5b36ec2	Release GIL in PG destructor (#154976 ) Summary: Gloo PG doesn't release GIL, which results in python code hanging until the destructor completes. The destructor waits for all work on the PG to complete which can take a long time. Test Plan: Ran ``` $ pytest --log-cli-level=INFO -vs torchft/local_sgd_integ_test.py ``` with a large timeout on the async work. Call to `gil_scoped_release` doesn't show up in the gdb stack trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154976 Approved by: https://github.com/d4l3k, https://github.com/dcci, https://github.com/fduwjj	2025-06-04 23:10:55 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Simon Fan	d1f1ff8610	[ddp] propagate use_python_reducer to C++ reducer (#152735 ) C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735 Approved by: https://github.com/fegin ghstack dependencies: #153300, #152689	2025-05-16 01:38:03 +00:00
Xinfeng Xie	8136e0d3b7	Expose NCCL communicator from ProcessGroupNCCL via an unsafe API (#152496 ) Differential Revision: D73892691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152496 Approved by: https://github.com/ngimel	2025-04-30 23:51:34 +00:00
cyy	41bd0c900a	[1/N] Deprecate c10::string_view and at::string (#151972 ) The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972 Approved by: https://github.com/malfet	2025-04-29 07:23:52 +00:00
cyyever	f2cfeb23e5	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-04-24 01:06:29 +00:00
Syed Tousif Ahmed	334aab0dea	Updates NCCLConfig with QOS variable (#151821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151821 Approved by: https://github.com/kwen2501	2025-04-23 00:03:49 +00:00
Will Constable	2673ea4131	Add api to enable/disable NaN detector per-PG (#151723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151723 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-04-19 03:55:25 +00:00
Tristan Rice	98c892749b	c10d/Store: add nonblocking mode to queue_pop (#151485 ) This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue. We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly. Test plan: ``` pytest test/distributed/test_store.py -k queue -v -s -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485 Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank	2025-04-18 02:14:50 +00:00
Yanli Zhao	d8bafd23ab	[DDP] add one option to allow skipping all reduce unused parameters (#151503 ) Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model. Test Plan: unit tests, CI Differential Revision: D72282069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503 Approved by: https://github.com/mrshenli	2025-04-17 23:30:19 +00:00
Tristan Rice	bb60e82672	c10d/Store: add queues (#150969 ) This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943. This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue. This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported. `wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them. This also has a few cleanups to error types/documentation in related code. Example trace: ``` [I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816 [I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816 [I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816 [I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816 [I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816 [I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816 [I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816 [W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout [W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms [I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816 ``` Test plan: ``` $ pytest test/distributed/test_store.py -k queue -v -s test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues) test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s] test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues) test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues) test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s] test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 19:24:17 +00:00
Tristan Rice	8b5e717601	c10d/Store: add clone feature (#150966 ) (#150966 ) (#151045 ) Summary: This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread. This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation. Related issue: https://github.com/pytorch/pytorch/issues/150943 Approved by: https://github.com/fduwjj Test Plan: contbuild & OSS CI, see `205881ea4a` Test plan from GitHub: ``` pytest test/distributed/test_store.py -k PythonStore pytest test/distributed/test_store.py -k clone ``` Differential Revision: D72789690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151045 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 04:00:23 +00:00
Tristan Rice	df4e5294a6	Reapply "ProcessGroupGloo: support lazy_init (#150801 )" (#151031 ) This reverts commit `73f3d6d9aa`. Reapplies #150801 Test plan: See #150801 submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031 Approved by: https://github.com/fduwjj	2025-04-11 01:58:35 +00:00
PyTorch MergeBot	abe41c5c9c	Revert "c10d/Store: add clone feature (#150966 )" This reverts commit `205881ea4a`. Reverted https://github.com/pytorch/pytorch/pull/150966 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150966#issuecomment-2795063574))	2025-04-10 20:17:53 +00:00
PyTorch MergeBot	73f3d6d9aa	Revert "ProcessGroupGloo: support lazy_init (#150801 )" This reverts commit `f237ee54bf`. Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))	2025-04-10 13:44:31 +00:00
Tristan Rice	205881ea4a	c10d/Store: add clone feature (#150966 ) This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread. This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation. Related issue: https://github.com/pytorch/pytorch/issues/150943 Test plan: ``` pytest test/distributed/test_store.py -k PythonStore pytest test/distributed/test_store.py -k clone ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966 Approved by: https://github.com/fduwjj	2025-04-10 01:41:50 +00:00
Tristan Rice	f237ee54bf	ProcessGroupGloo: support lazy_init (#150801 ) This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801 Approved by: https://github.com/fduwjj	2025-04-09 19:29:50 +00:00
Ke Wen	35c45a4a31	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman	2025-04-01 16:46:07 +00:00
Tristan Rice	29b3fdab01	TCPStoreLibUvBackend: support masterListenFd (#150215 ) This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine. This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend. Test plan: ``` pytest -v test/distributed/test_store.py -k test_take_over_listen_socket ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215 Approved by: https://github.com/fduwjj	2025-03-29 01:58:07 +00:00
fduwjj	8bf3f3fc43	[c10d] Add a collective time estimator for NCCL comms (#149343 ) We want to upstream the feature from new nccl for users to estimate comm time. Resolves #147753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343 Approved by: https://github.com/kwen2501	2025-03-19 07:54:02 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `ef6296e7f2`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Tristan Rice	494abeff8a	CUDACachingAllocator,c10d: fixes for IPC release performance (#148805 ) This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL. 1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock 2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there Test plan: Run with torchft + torchtitan ``` REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096 ... [rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61 loss: 7.4825 memory: 79.73GiB(83.89%) tps: 317 tflops: 16.34 mfu: 1.65% ``` Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors ![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d) ![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805 Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito	2025-03-10 19:47:04 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
Tristan Rice	7ffadff286	c10d/ProcessGroup: cleanup abort and shutdown (#148798 ) This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs. This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation. Test plan: ``` pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798 Approved by: https://github.com/kwen2501	2025-03-08 18:33:18 +00:00
Sanket Purandare	9841f0ddcf	Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566 ) This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation. It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do. For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`. Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566 Approved by: https://github.com/weifengpy	2025-03-08 18:00:49 +00:00
taozhiwei	16d07988fc	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-03-04 12:37:06 +00:00
Tristan Rice	68631f6e87	PyWork: preserve Python reference counting when used in functional collectives (#146376 ) @fegin found an issue where torchft is not compatible with functional collectives. Found in https://github.com/pytorch/torchtitan/pull/806 The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug. PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects. To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup. Test plan: ``` cd pytorch pytest test/distributed/test_c10d_functional_native.py ``` ``` cd torchft pytest torchft/process_group_test.py -k functional -v -x -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376 Approved by: https://github.com/yifuwang	2025-02-07 18:07:53 +00:00
PyTorch MergeBot	00dc5b10f6	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit `2fd1b6b361`. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))	2025-02-03 22:04:28 +00:00
cyy	2fd1b6b361	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-02-01 12:33:41 +00:00
Ke Wen	51ee9b154e	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-30 18:19:00 +00:00
PyTorch MergeBot	5fa28bbe40	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit `18a7a04c4a`. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))	2025-01-30 16:01:52 +00:00
Ke Wen	25ca05eebf	[PGNCCL] Correct some ifdef's (#145893 ) `create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893 Approved by: https://github.com/c-p-i-o	2025-01-30 01:05:21 +00:00
Ke Wen	18a7a04c4a	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 23:20:22 +00:00
PyTorch MergeBot	6371c25b91	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit `9fd6722fc9`. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))	2025-01-29 18:30:30 +00:00
PyTorch MergeBot	284f217011	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit `97b3b73f3e`. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))	2025-01-29 18:24:29 +00:00
Ke Wen	9fd6722fc9	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 02:48:56 +00:00
cyyever	97b3b73f3e	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-01-28 15:21:12 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
cyy	6a35d9aaa4	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-24 12:22:13 +00:00
PyTorch MergeBot	6a2b4db0a1	Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )" This reverts commit `42f4fda2eb`. Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))	2025-01-24 00:17:34 +00:00
cyy	42f4fda2eb	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-23 22:47:18 +00:00
Tristan Rice	6e58c37542	c10d: no call_guard in init (#143598 ) `py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe. For more details see #143593 https://github.com/pybind/pybind11/issues/5473 Test plan: ``` python setup.py develop ``` CI ```py import time from concurrent.futures import ThreadPoolExecutor from torch import distributed as dist def run(): store = dist.TCPStore( host_name="localhost", port=0, is_master=True, wait_for_workers=False, ) # this sleep is required to trigger the crash time.sleep(0.1) del store futures = [] with ThreadPoolExecutor( max_workers=100, ) as executor: for i in range(100000): print(i) futures.append(executor.submit(run)) if len(futures) > 100: futures.pop(0).result() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598 Approved by: https://github.com/c-p-i-o	2024-12-20 22:23:36 +00:00

1 2 3 4 5 ...

532 Commits