pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `ef6296e7f2`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
fduwjj	aed0b7a742	[c10d] Add param recording for uniqueID broadcasting and allgather (#149166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166 Approved by: https://github.com/kwen2501	2025-03-14 03:51:30 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `2149f6c684`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit `17dbeb11db`. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
Tristan Rice	7ffadff286	c10d/ProcessGroup: cleanup abort and shutdown (#148798 ) This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs. This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation. Test plan: ``` pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798 Approved by: https://github.com/kwen2501	2025-03-08 18:33:18 +00:00
taozhiwei	16d07988fc	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-03-04 12:37:06 +00:00
Ke Wen	f211818bc0	[c10d] Restrict use condition of NCCL mem pool (#147764 ) Add check to see if CUDA driver support multicast, as does in Symmetric Memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764 Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang	2025-02-26 03:40:00 +00:00
Ke Wen	30cbf13544	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 ) This is a forward fix to #146589. For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`. This PR gates the support by checking link-time NCCL version via `ncclGetVersion`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842 Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #146589	2025-02-11 02:52:52 +00:00
Ke Wen	effc545274	[DDP] Use NCCL allocated memory for gradient bucket (#146589 ) So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications. Less SM usage, less memory contention between NCCL kernel and compute kernels. Added env `DDP_DISABLE_COMM_MEM` as a back-out option: ``` An environment variable to disable comm-optimized memory pool. Default is 0, which means comm-optimized memory pool is enabled. Users can set it to 1 in case of seeing regression or OOM (because this comm MemPool may not share space with regular compute MemPool). ``` Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589 Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj	2025-02-10 05:23:11 +00:00
fduwjj	eb029fba13	[c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789 ) (#144794 ) Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-01-31 22:39:56 +00:00
Ke Wen	51ee9b154e	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-30 18:19:00 +00:00
PyTorch MergeBot	5fa28bbe40	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit `18a7a04c4a`. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))	2025-01-30 16:01:52 +00:00
Ke Wen	25ca05eebf	[PGNCCL] Correct some ifdef's (#145893 ) `create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893 Approved by: https://github.com/c-p-i-o	2025-01-30 01:05:21 +00:00
Ke Wen	18a7a04c4a	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 23:20:22 +00:00
PyTorch MergeBot	6371c25b91	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit `9fd6722fc9`. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))	2025-01-29 18:30:30 +00:00
Ke Wen	9fd6722fc9	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 02:48:56 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
cyy	6a35d9aaa4	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-24 12:22:13 +00:00
PyTorch MergeBot	6a2b4db0a1	Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )" This reverts commit `42f4fda2eb`. Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))	2025-01-24 00:17:34 +00:00
cyy	42f4fda2eb	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-23 22:47:18 +00:00
fduwjj	ae7df51232	[c10d] Fix CudaEventCache for dangling references (#144496 ) Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it. 1. We add a unit test to repro the issue mentioned in the issue. 2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496 Approved by: https://github.com/kwen2501	2025-01-15 05:11:48 +00:00
Chirag Pandya	1cd70e7e23	[fr][c10d] log trace capture enabled or not in flight recorder (#143865 ) Summary: Refactor logging for flight recorder so we can log if the capture was with or without stack trace capture enabled. We introduce a new column ('trace_enabled') in the logger. Test Plan: Tested on local job and noted that correct output was produced. Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865 Approved by: https://github.com/fduwjj	2024-12-27 03:07:55 +00:00
mori360	4a16a60052	[C10D] Add better profiling title for NCCL barrier, nccl:all_reduce to nccl:all_reduce_barrier (#140785 ) Fixes [issue](https://github.com/pytorch/pytorch/issues/140257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140785 Approved by: https://github.com/wconstab	2024-12-10 23:08:15 +00:00
Chirag Pandya	3b531f18c7	[BE] Improve Flight Recorder efficacy (#142178 ) Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Differential Revision: D66843194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142178 Approved by: https://github.com/kwen2501	2024-12-07 19:32:28 +00:00
Ke Wen	ed092e2161	[2/N] Rename NCCLTraceBuffer to FlightRecorder (#141712 ) Just name change. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141712 Approved by: https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #141648	2024-11-29 21:15:31 +00:00
fduwjj	5b4c864672	[c10d] Enable CudaEventCache by default and add multi device support (#140975 ) We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-11-26 18:42:45 +00:00
Syed Tousif Ahmed	e0482fdf95	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-21 01:40:11 +00:00
PyTorch MergeBot	9fac5a16fd	Revert "[PGNCCL] Add an API to get the status/error code of each PG (#140087 )" This reverts commit `80aa19a622`. Reverted https://github.com/pytorch/pytorch/pull/140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140087#issuecomment-2486912231))	2024-11-19 22:53:46 +00:00
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit `25d9be37be`. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Shuqiang Zhang	80aa19a622	[PGNCCL] Add an API to get the status/error code of each PG (#140087 ) Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087 Approved by: https://github.com/kwen2501	2024-11-15 04:11:00 +00:00
PyTorch MergeBot	1400fedf76	Revert "add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 )" This reverts commit `e5574445b0`. Reverted https://github.com/pytorch/pytorch/pull/135338 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. Please see D65663382 for more details ([comment](https://github.com/pytorch/pytorch/pull/135338#issuecomment-2465911854))	2024-11-08 23:52:49 +00:00
taozhiwei	e5574445b0	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501	2024-11-08 11:08:45 +00:00
Ke Wen	e474f0de82	[PGNCCL] Slimming watchdog loop (#139834 ) - Refactored traceback code into `work.printTraceback()`. cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang - Refactored desync debug code into `class DesyncDebugger`. - Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang - Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834 Approved by: https://github.com/shuqiangzhang	2024-11-07 17:22:44 +00:00
Shuqiang Zhang	df0c1eceb9	[pgnccl][simple] clean up unused members of PGNCCL (#139436 ) Summary: Found those unused members when prototying something else. Better remove unused members Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436 Approved by: https://github.com/Skylion007	2024-11-01 16:25:04 +00:00
Shuqiang Zhang	5075046db2	[c10d] separate comm init from getNCClComm (#139362 ) Summary: This PR is a non op. But it clearly separate the init logic from the getNCCLCOMM. getNCClComm is now a purely a 'read' only function Test Plan: existing CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139362 Approved by: https://github.com/wconstab	2024-10-31 16:58:20 +00:00
cyy	f9ae3fac8c	[Distributed] [19/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138903 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138903 Approved by: https://github.com/ezyang	2024-10-28 05:29:25 +00:00
Ke Wen	ee11e2da1e	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-27 17:40:43 +00:00
PyTorch MergeBot	144d75d934	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit `07e30eae2a`. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2440070035))	2024-10-27 15:39:33 +00:00
Ke Wen	07e30eae2a	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-26 06:53:15 +00:00
cyyever	ce631939f0	[Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692 Approved by: https://github.com/ezyang	2024-10-25 05:32:38 +00:00
Shuqiang Zhang	4c91481656	[c10d] allow sub group to be eagerly inited even if default one is not (#138665 ) Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves https://github.com/pytorch/pytorch/issues/137018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665 Approved by: https://github.com/kwen2501 ghstack dependencies: #138781	2024-10-24 23:51:28 +00:00
Chirag Pandya	425ce2a7ee	[c10d] use a promise to delay watchdog shutdown (#138828 ) Summary: We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs. This change: 1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes. 2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout. Test Plan: Tested on my local job and flight recorder successfully executed for the job. https://fburl.com/mlhub/38fj5yne The watchdog thread gives heartbeat thread time to write out the logs. In the logs we see: ``` [trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish. ``` Differential Revision: D64857928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828 Approved by: https://github.com/d4l3k, https://github.com/fduwjj	2024-10-24 23:42:29 +00:00
PyTorch MergeBot	cdfe1bffd1	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit `8fbf866904`. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))	2024-10-23 14:49:49 +00:00
Ke Wen	8fbf866904	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384	2024-10-23 08:51:54 +00:00
PyTorch MergeBot	fc9093c3d2	Revert "Remove C10_DEPRECATED (#138406 )" This reverts commit `70ec86d754`. Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))	2024-10-22 18:00:41 +00:00
Ke Wen	6b29d40e9b	[PGNCCL] Add default value for `nccl_nonblocking_timeout` (#138374 ) - Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488	2024-10-22 05:06:18 +00:00

1 2 3 4 5

247 Commits