pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	694d205143	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `311ea0dec0`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))	2025-10-30 17:52:29 +00:00
Bruce Chang	311ea0dec0	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-30 01:50:54 +00:00
PyTorch MergeBot	ad4dc52bf6	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `4e643422f6`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))	2025-10-21 20:24:14 +00:00
Bruce Chang	4e643422f6	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-21 19:47:33 +00:00
PyTorch MergeBot	633a3b7f67	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `fa0db212e7`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217))	2025-10-19 19:20:45 +00:00
Bruce Chang	fa0db212e7	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-19 18:00:08 +00:00
PyTorch MergeBot	fae74cd52f	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `a032510db3`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767))	2025-10-17 18:55:53 +00:00
Bruce Chang	a032510db3	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/Skylion007, https://github.com/syed-ahmed, https://github.com/kwen2501	2025-10-17 17:55:03 +00:00
fduwjj	6b2bef10af	[c10d] Prototype of `group_split` for dist2 work (#157716 ) This is to implement group_split as proposed in [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157716 Approved by: https://github.com/d4l3k	2025-07-14 21:04:12 +00:00
fduwjj	1d0f45d5d1	[c10d][PGNCCL] Cleanup unused params for nccl comm split (#157978 ) Previously we add global ranks as a input params for nccl comm. Now this is not needed, let's clean that up. Differential Revision: [D78051047](https://our.internmc.facebook.com/intern/diff/D78051047) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157978 Approved by: https://github.com/Skylion007	2025-07-10 17:36:23 +00:00
fduwjj	d0cfa3e5bf	[c10d] Move the include of header file of TraceUtils.h into NCCLUtil.cpp instead of keeping in hpp (#156909 ) We have seen complaint about compilation failure of `NCCLSymmetricMemory.cu` and the reason is because we include <torch/csrc/distributed/c10d/TraceUtils.h> inside NCCLUtil.hpp this is not necessary so we want to move the include to cpp. Differential Revision: [D77346675](https://our.internmc.facebook.com/intern/diff/D77346675) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156909 Approved by: https://github.com/kwen2501	2025-06-27 16:30:49 +00:00
Pavan Balaji	e99a2a2dba	[PG/nccl] Simplify uniqueHash management (#156790 ) Summary: ncclUniqueID is only relevant when a comm is created using ncclCommCreate or ncclCommCreateConfig. If a comm is created with ncclCommSplit, this field is unset, causing its usage to create unexpected behavior. This patch creates a unique hash key for each comm, irrespective of how the comm is created. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/156790 Approved by: https://github.com/fduwjj, https://github.com/kwen2501	2025-06-25 20:06:08 +00:00
Syed Tousif Ahmed	f70c80105e	Enables NCCL symmetric memory kernels through mempool registration (#155134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 23:24:04 +00:00
Luca Wehrstedt	d4ad280429	Enable querying the build and runtime NCCL versions (#156305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305 Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin	2025-06-19 02:00:08 +00:00
Xinfeng Xie	bfc0920d95	[C10D] Move getNcclDataType into NCCLUtils (#153113 ) Differential Revision: D74365214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153113 Approved by: https://github.com/ngimel	2025-05-08 08:54:05 +00:00
Hexin Wang	6720d23969	Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 ) Detail of the issue: If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group. I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream); send(..., comm1, stream); abort(comm1); abort(comm0); Fixes #119797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-04-10 17:33:26 +00:00
Luca Wehrstedt	3649e2e7bd	Safer bookkeeping of NCCL communicators (#150681 ) This consists mainly in two changes: - ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device) - use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681 Approved by: https://github.com/kwen2501	2025-04-08 11:12:37 +00:00
Richard Barnes	5301710b15	[codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555 ) Summary: LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D69945678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-01 19:46:13 +00:00
Ke Wen	effc545274	[DDP] Use NCCL allocated memory for gradient bucket (#146589 ) So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications. Less SM usage, less memory contention between NCCL kernel and compute kernels. Added env `DDP_DISABLE_COMM_MEM` as a back-out option: ``` An environment variable to disable comm-optimized memory pool. Default is 0, which means comm-optimized memory pool is enabled. Users can set it to 1 in case of seeing regression or OOM (because this comm MemPool may not share space with regular compute MemPool). ``` Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589 Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj	2025-02-10 05:23:11 +00:00
Dingming Wu	fa34128435	revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453 ) Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass. Test Plan: With the change, build of APS model using rcclexp can now pass: `sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0` Reviewed By: c-p-i-o Differential Revision: D69149588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453 Approved by: https://github.com/c-p-i-o	2025-02-07 22:43:52 +00:00
fduwjj	eb029fba13	[c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789 ) (#144794 ) Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-01-31 22:39:56 +00:00
Ke Wen	25ca05eebf	[PGNCCL] Correct some ifdef's (#145893 ) `create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893 Approved by: https://github.com/c-p-i-o	2025-01-30 01:05:21 +00:00
fduwjj	4f949f282d	[c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855 ) While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2025-01-28 21:19:49 +00:00
cyy	29f52e3972	[2/N] Remove unnecessary once flag usage (#145057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057 Approved by: https://github.com/albanD	2025-01-23 09:48:46 +00:00
Dingming Wu	a881954b0c	[PTD] Dump rcclexp proxy trace in pytorch (#143678 ) Summary: Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts. Added `#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console. This is the changes of the PTD. Test Plan: Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3 {F1971449692} Reviewed By: c-p-i-o Differential Revision: D67036093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678 Approved by: https://github.com/c-p-i-o	2025-01-04 10:20:47 +00:00
Ke Wen	cb354f8b47	[PGNCCL] Move NCCLComm impl to cpp (#142826 ) BE as titled. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-12-12 02:45:52 +00:00
Ke Wen	cca33d50b9	[PGNCCL] Use long/short wait for different non-blocking calls (#142291 ) In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it. Today this is done by the `waitReady()` function. Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks. While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.) This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142291 Approved by: https://github.com/eqy, https://github.com/fduwjj	2024-12-09 22:19:58 +00:00
Ke Wen	f24a9d0755	[PGNCCL] Fix behavior of destroy_process_group (#141510 ) Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141510 Approved by: https://github.com/wconstab	2024-12-04 20:30:47 +00:00
Ke Wen	ad39a2fc46	[1/N] Decouple Flight Recorder from NCCL utils (#141648 ) Part of the effort to make Flight Recorder device agnostic. Step 1: Move it out of NCCLUtils. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141648 Approved by: https://github.com/fduwjj	2024-11-27 18:29:42 +00:00
Ke Wen	915625307e	[PGNCCL] Record device index for GPU guarding during NCCLComm method calls (#141270 ) ### Motivation `ncclCommInitRank` needs GPU guard (documented in NCCL). `ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests. ### Solution This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs. ### Note This is not a bug fix. Just a safety improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270 Approved by: https://github.com/eqy ghstack dependencies: #141374	2024-11-25 21:31:21 +00:00
Chirag Pandya	48a55b8623	[c10d][fr] wait counter for dump function (#140823 ) Summary: Add a wait counter for the dump function. This is useful to see if we get stuck in the dump function and never return for a particular job. Test Plan: Tested locally I and see `pytorch.wait_counter.NCCLTraceBuffer__dump.busy_time_us.sum.60` in ODS. Differential Revision: D65823433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140823 Approved by: https://github.com/fduwjj	2024-11-16 02:22:08 +00:00
Ke Wen	5f2ed505eb	[PGNCCL] Watchdog prints call-time traceback when reporting timeout (#139659 ) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-11-05 19:07:17 +00:00
Ke Wen	ee11e2da1e	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-27 17:40:43 +00:00
PyTorch MergeBot	144d75d934	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit `07e30eae2a`. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2440070035))	2024-10-27 15:39:33 +00:00
Ke Wen	1152726feb	[PGNCCL] Use recursive mutex in NCCLComm (#138997 ) Fixes #138995: [PGNCCL][BUG] mutex acquired in recursive way may deadlock The fix: use `std::recursive_mutex` to replace `std::mutex`. Found and proposed by @dsjohns2. Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138997 Approved by: https://github.com/dsjohns2	2024-10-27 08:58:47 +00:00
Ke Wen	07e30eae2a	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-26 06:53:15 +00:00
cyyever	ce631939f0	[Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692 Approved by: https://github.com/ezyang	2024-10-25 05:32:38 +00:00
PyTorch MergeBot	cdfe1bffd1	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit `8fbf866904`. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))	2024-10-23 14:49:49 +00:00
Ke Wen	8fbf866904	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384	2024-10-23 08:51:54 +00:00
Ke Wen	f2ebf6d94a	[PGNCCL] Ensure comm is ready before all accesses (#138384 ) Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374	2024-10-23 01:36:58 +00:00
Ke Wen	6b29d40e9b	[PGNCCL] Add default value for `nccl_nonblocking_timeout` (#138374 ) - Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488	2024-10-22 05:06:18 +00:00
Richard Barnes	fddabc6e0b	C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 ) Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-10-19 13:17:43 +00:00
Ke Wen	fecd370ea1	[c10d] Fix color value for comm split being negative (#137855 ) Fixes https://github.com/pytorch/pytorch/issues/137856. ### Issue 1 Today under `ProcessGroupNCCL::Options`, color is declared as: ``` int64_t split_color{0}; ``` When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`). But that's not all. ### Issue 2 `split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain: ``` [rank0]: TypeError: (): incompatible function arguments. The following argument types are supported: [rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None ``` This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855 Approved by: https://github.com/wconstab	2024-10-19 03:17:19 +00:00
Ke Wen	35fc24fbed	[PGNCCL] Fix bugs in non-blocking mode (#137741 ) ### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // https://github.com/NVIDIA/nccl/issues/1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741 Approved by: https://github.com/shuqiangzhang	2024-10-15 20:35:39 +00:00
cyy	94e12f97dc	[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 ) Follows #137072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404 Approved by: https://github.com/Skylion007	2024-10-10 18:05:34 +00:00
Ke Wen	4f45c76806	[PGNCCL] Limit access to ncclComm_ (#137573 ) When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it. `NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`. To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573 Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o ghstack dependencies: #137572	2024-10-10 00:34:05 +00:00
eqy	e3aa5e2f64	[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->initialized_` (#136155 ) #133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151 CC @shuqiangzhang @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155 Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang	2024-09-17 18:50:12 +00:00
PyTorch MergeBot	c044deb9ce	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit `f33bcbe5fd`. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))	2024-09-03 22:35:14 +00:00
fduwjj	4c16797e71	[c10d FR analyzer] Output a meaningful debug report for users (#134528 ) - This PR generates a more useful output log for users: P1552399180. - It also fixes the logic when we check the all-gather size mismatch. - Add dtype check for collective input/output - We store more context information for error match_state so that we can report them in the file. - Disable the size match for alltoall because we don't log the size for all inputs/outputs. - Correct some types for func args specification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528 Approved by: https://github.com/c-p-i-o	2024-08-28 21:22:47 +00:00
Tristan Rice	f33bcbe5fd	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-28 01:40:42 +00:00

1 2

90 Commits