pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	ea61c9cb29	[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043 Approved by: https://github.com/ezyang	2024-04-23 00:43:50 +00:00
Shuqiang Zhang	87a35d5a29	Use new function to log one cluster per line (#124628 ) Summary: For motivation behind the overall stack of diffs see D56218385 summary. This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`. Test Plan: ``` [romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/... File changed: fbcode//common/base/ThreadStackTrace.cpp File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp 4 additional file change events Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98 Network: Up: 0B Down: 0B Jobs completed: 2. Time elapsed: 1.5s. Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0 NO TESTS RAN [romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ ``` Tested to print the stack trace: P1220109730 Differential Revision: D56218360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628 Approved by: https://github.com/wconstab	2024-04-22 21:57:39 +00:00
Cen Zhao	96724a769b	[ptd] drop ncclGroupStart/end for ncclCommInit (#124363 ) (#124416 ) Summary: ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have single-thread to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975 Co-authored-by: Cen Zhao <cenzhao@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416 Approved by: https://github.com/shuqiangzhang	2024-04-19 13:12:42 +00:00
Tristan Rice	ddd0ed1b43	distributed: templated ring attention (#124215 ) This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR. This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way. Misc changes: * Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test * Adds compile support to the ring attention implementations (required some tweaks to process groups) Test plan: ``` pytest test/distributed/_tensor/test_attention.py pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215 Approved by: https://github.com/wanchaol	2024-04-19 00:57:08 +00:00
Shuqiang Zhang	ca6a0e1348	[c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334 ) Summary: This ENV was introduced to safely rollout the behavior change in destroy process group (e.g., call ncclCommsAbort). Now that this behavior change were already rolled out, we no longer need this env and we should clean up it to keep our code cleaner Test Plan: Modified/existing ut pass Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334 Approved by: https://github.com/wconstab	2024-04-18 23:42:55 +00:00
Tristan Rice	1ec05c769b	all_gather and reduce_scatter autograd (#123989 ) This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library. This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future. The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py This follows the pattern of #123599 . Test plan: ```sh pytest test/distributed/test_functional_api.py -k Autograd ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989 Approved by: https://github.com/wanchaol	2024-04-17 21:32:22 +00:00
Shengbao Zheng	42e22bb444	[nccl-pg] Pass pg name and desc to NCCL communicator (#124149 ) Summary: Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer. The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>") Function only valid when NCCL_COMM_DESCRIPTION is defined. Differential Revision: D55703310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149 Approved by: https://github.com/shuqiangzhang	2024-04-16 20:08:07 +00:00
cyy	c2596fd3e0	[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032 Approved by: https://github.com/Skylion007	2024-04-16 00:42:18 +00:00
Shengbao Zheng	9fa922c2ed	[profiler] Log process group name instead of pg uid (#124035 ) Summary: As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler. - group_name remains as the unique identifier, e.g. “0”, "1" - group_desc will be the user specified name, e.g. "fsdp". Reviewed By: aaronenyeshi, kwen2501 Differential Revision: D55610682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035 Approved by: https://github.com/aaronenyeshi	2024-04-15 21:49:06 +00:00
cyy	b60af92c17	[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 ) This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312 Approved by: https://github.com/Skylion007	2024-04-13 11:45:00 +00:00
cyy	77a45883ce	[Reland] [Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123821 ) Reland of #122892 with problematic changes reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123821 Approved by: https://github.com/Skylion007	2024-04-13 00:57:03 +00:00
Shengbao Zheng	585cd117e6	[nccl-pg] print broadcast ncclunique id duration (#123963 ) Summary: Print NCCL PG broadcast nccl unique id duration for measurement. Differential Revision: D56048059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123963 Approved by: https://github.com/wconstab	2024-04-12 23:33:11 +00:00
Shengbao Zheng	4e9094533e	[c10d/nccl-pg] allow user to pass process group description (#123472 ) Summary: We need a way to allow user set a customized description for a process group, e.g. FSDP, PP. Here are several use cases of user specified group_desc: - Logging: we can easily match a log line and understand what's this collective/pg is used to. - Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP. - Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG. Solution: Add a group_desc field to c10d Differential Revision: D55781850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472 Approved by: https://github.com/kwen2501	2024-04-12 08:44:21 +00:00
Tristan Rice	358ace1a1b	functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599 ) This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs. This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions. This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering. To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`. Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py Test plan: ``` pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599 Approved by: https://github.com/yifuwang	2024-04-12 01:48:49 +00:00
Shuqiang Zhang	22ba180e55	[c10d] add more fields for periodic logging (#123860 ) Summary: Added the names of the last enquened, started and completed colletives, in addition to their seq ID Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/123860 Approved by: https://github.com/XilunWu	2024-04-12 00:11:07 +00:00
Shuqiang Zhang	e00282fecf	[c10d] make monitorThread sleep when we try to dump (#123788 ) Summary: We seperated the FR dump logic from the desync debug logic, so we no longer set collectiveDebugInfoMode_ to true when we just need FR dump. That's why monitor thread did not sleep and try to kill the process without waiting for the dump. The fix is simple, we should sleep whenever shouldDump_ is true Test Plan: Existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788 Approved by: https://github.com/wconstab	2024-04-11 07:10:46 +00:00
Jeff Daily	3969f85769	add TORCH_NCCL_HIGH_PRIORITY option (#122830 ) There are many existing ProcessGroupNCCL features controlled by env vars. This PR adds TORCH_NCCL_HIGH_PRIORITY to force the use of high-priority CUDA or HIP streams for the NCCL or RCCL kernels, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122830 Approved by: https://github.com/kwen2501	2024-04-09 01:11:18 +00:00
Shengbao Zheng	ae6f8d923c	Pass and record process_group_name when creating ProcessGroupNCCL (#123117 ) Summary: Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers. Also record pg_name in flight recorder entry. Differential Revision: D55597200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117 Approved by: https://github.com/wconstab	2024-04-05 18:57:45 +00:00
Richard Barnes	98e5238ad8	[codemod][lowrisk] Remove unused exception parameter from caffe2/caffe2/image/image_input_op.h (#123056 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D55548497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123056 Approved by: https://github.com/Skylion007	2024-04-04 17:24:43 +00:00
PyTorch MergeBot	54801e6fd6	Revert "[Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122892 )" This reverts commit `0ba16ffd35`. Reverted https://github.com/pytorch/pytorch/pull/122892 on behalf of https://github.com/atalman due to broke cuda tests ([comment](https://github.com/pytorch/pytorch/pull/122892#issuecomment-2037207036))	2024-04-04 13:22:22 +00:00
Yifu Wang	c58b0ac7c2	IntraNodeComm primitives for allgather_matmul (#118038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118038 Approved by: https://github.com/wanchaol	2024-04-04 00:46:08 +00:00
cyy	0ba16ffd35	[Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122892 ) This PR continues to fix some clang-tidy warnings in distributed code, following #122884. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122892 Approved by: https://github.com/Skylion007	2024-04-04 00:39:31 +00:00
Aidyn-A	71085983ae	[c10d] [NCCL] Fix work handle for coalescing manager (#122849 ) Fixes #122807 The work handle of the coalescing job will be populated: ```python with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm: dist.all_reduce(a) dist.all_reduce(b) print(len(cm.works)) # prints 1 cm.wait() # actually waits ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122849 Approved by: https://github.com/kwen2501	2024-04-02 21:25:16 +00:00
Shuqiang Zhang	7a934e4031	[c10d] dump on any exception (timeout + nccl error) (#123023 ) Summary: Existing flight recorder dumping logic is: dump only on timeout, but not on NCCL error. This resulted in the faulty ranks missing dumps when NCCL error happens. So in this PR, we revise the logic of dump such that records are dumped when any exception is detected. Exception could be 1. NCCL async errors. 2. watchdog timeout Also the existing code tends to mix the logic of flight recorder dump and desync debug, which is no desirable. We only dump the desync debug report only when timeout is detected. Test Plan: Added a new unit test to trigger nccl error and dump, and make sure the dump is triggered by the error. Also existing dump on timeout tests should still pass. sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (84bf9d4c)]$ python test/distributed/test_c10d_nccl.py NcclErrorDumpTest NCCL version 2.19.3+cuda12.0 [E329 19:15:11.775879730 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=10, NumelOut=10, Timeout(ms)=10000) ran for 10028 milliseconds before timing out. [E329 19:15:11.777459894 ProcessGroupNCCL.cpp:1561] [PG 0 Rank 0] Exception hit in NCCL work: 2 [E329 19:15:12.660717323 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0] Received a timeout signal from this local rank and will start to dump the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1. [E329 19:15:12.660932242 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [E329 19:15:12.661192990 ProcessGroupNCCL.cpp:1174] [PG 0 Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/tmp06psqil3/trace_0 [F329 19:15:12.661485601 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout from the local rank. This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. We tried our best to dump the debug info into the storage to help you debug the issue. Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/123023 Approved by: https://github.com/wconstab	2024-04-02 03:16:54 +00:00
cyy	6d8bb0e984	[Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884 ) This PR fixes some clang-tidy warnings in distributed code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884 Approved by: https://github.com/kwen2501	2024-03-31 09:06:35 +00:00
Shuqiang Zhang	4282bb8b07	[c10d] add the source rank which detects the timeout (#122850 ) Summary: When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the collective timeout locally. We just need to put the source rank as the value in the kvstore Test Plan: In unit test, we triggered the timeout on rank 0 and rank 1 should get the timeout signal from store and log the correct source rank: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks NCCL version 2.19.3+cuda12.0 [rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for 1099 milliseconds before timing out. [rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1. [rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0 Rank 0] Received a timeout signal from this local rank and will start to dump the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1. [rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0 Rank 1] Received a global timeout from another rank 0, and will start to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: 1. [rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info. [rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a collective timeout from another rank 0 and notified the current rank. This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. We tried our best to dump the debug info into the storage to help you debug the issue. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850 Approved by: https://github.com/wconstab	2024-03-29 22:22:37 +00:00
Aidyn-A	66510c641f	[c10d][NCCL] Refactor coalesced storage (#122651 ) The `coalescedDevice_` are `coalescedComms_` used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122651 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-03-27 23:56:02 +00:00
zdevito	530e13cf3d	Revert "[c10d] disable compute_duration by default (#122138 )" (#122539 ) This reverts commit `bf18e967b4`. It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122539 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #122538	2024-03-27 21:53:28 +00:00
hongxyan	1c1268b6e9	seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr (#121905 ) When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`. This pull request is to fix this edge condition so that it will exit the program gracefully with useful information. Test: Before the fix, my test script exits like below: ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: basic_string::_M_construct null not valid ``` After this fix, my test script exited with useful message like, ``` [rank0]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce [rank0]: work = group.allreduce([tensor], opts) [rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2 [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: Unknown NCCL Error ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905 Approved by: https://github.com/wconstab	2024-03-25 20:49:34 +00:00
Yulun Wang	3db64c1955	[NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049 ) Differential Revision: D54993977 ### Summary The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_. Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement. ### Test Plan See diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049 Approved by: https://github.com/shuqiangzhang	2024-03-22 17:10:33 +00:00
Yifu Wang	d131cbc44f	Fuse the input -> p2p buffer copy into one-shot all-reduce kernel when the input is small (#121213 ) This improves the gpt-fast llama2 70B 8xH100 (non-standard) TP benchmark from 86 tok/s to 88 tok/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121213 Approved by: https://github.com/Chillee	2024-03-21 18:25:57 +00:00
Shuqiang Zhang	ea1cd31b50	[c10d] Log the target of FR dump (#122345 ) Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump Test Plan: Modified unit tests Differential Revision: D54972069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345 Approved by: https://github.com/wconstab	2024-03-21 08:03:05 +00:00
Shuqiang Zhang	bf18e967b4	[c10d] disable compute_duration by default (#122138 ) Summary: Compute duration would invoke additional cuda overhead and possibly GPU mem increase and possible hang, so we want to disable it by default and enable it only when needed, or at least when timing is enabled. Test Plan: Test with existing unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138 Approved by: https://github.com/wconstab	2024-03-21 04:45:37 +00:00
Yifu Wang	e7141d117f	[IntraNodeComm] refactor rendezvous into a separate method for better code organization and error handling (#120968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120968 Approved by: https://github.com/wanchaol	2024-03-20 06:54:25 +00:00
Richard Barnes	94eb940a02	[codemod] Remove unused variables in caffe2/caffe2/operators/softmax_op_cudnn.cc (#121995 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D54931224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995 Approved by: https://github.com/Skylion007	2024-03-19 22:35:58 +00:00
Shuqiang Zhang	26aaabb979	[c10d] initialize lastEnqueuedSeq_ and lastCompletedSeq_ (#121980 ) Summary: It is found that this 2 unitilized number was logged with some super large or negative numbers, which is confusing. So we need to initialize them. Now -1 indicate the number if invalid, or no work is completed or enqueued yet. 0 could be a legit seq id. Test Plan: Build Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/121980 Approved by: https://github.com/xw285cornell, https://github.com/wconstab, https://github.com/kwen2501, https://github.com/XilunWu	2024-03-15 21:45:15 +00:00
Shuqiang Zhang	443444dc7f	[c10d] Add generic scuba logging capability into c10d (#121859 ) Summary: This diff tries to periodically (e.g., every 30s) log critical collective progress status to scuba table, starting from a few metric such as last enequeued seq id. With the Scuba table, it is our hope that we can easily detect the straggler of a PG, E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_ The implementation needs to make sure that Scuba will be used only for FB internal use cases. For OSS, we still provide a generic logger data struct and logger that can be easily extended. If users do not register the logger, nothing will be logged. Test Plan: Re-use the existing unit test for fb side of operations, such as test_register_and_dump in test_c10d_manifold and change the dump period to a very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table: https://fburl.com/scuba/c10d_work_update/9trhwnmy Reviewed By: wconstab Differential Revision: D54556219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859 Approved by: https://github.com/wconstab	2024-03-14 16:03:45 +00:00
Will Constable	7e076c75bd	[C10D] Fix coalescedCollective op Flight Recording (#120430 ) Also noticed and filed https://github.com/pytorch/pytorch/issues/120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120430 Approved by: https://github.com/kwen2501	2024-03-13 23:55:00 +00:00
Howard Huang	2a99e6f299	Update error message (#121644 ) Summary: We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead. Update the error message to explicitly say that sparse_allreduce is not supported. Test Plan: sandcastle Differential Revision: D54759307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644 Approved by: https://github.com/awgu	2024-03-12 13:04:21 +00:00
Yifu Wang	41286f1505	[IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575 ) `hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575 Approved by: https://github.com/Chillee	2024-03-10 00:55:25 +00:00
Aidyn-A	eb3919944d	[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045 ) The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-03-09 02:00:54 +00:00
Aidyn-A	b2f19dd284	[C10d][UCC] Retain CUDA context in progress_loop (#121446 ) UCC requires CUDA context be present, while `progress_loop` `f61192b014/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L333)` runs on the side thread and it does not have context present (even though it sets the device). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121446 Approved by: https://github.com/kwen2501	2024-03-09 00:09:47 +00:00
Yifu Wang	22cd2658b4	Disable GroupRegistry's thread isolation by default (#121457 ) Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes). However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups. This PR fixes the issue by: - Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry. - Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly. Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457 Approved by: https://github.com/wanchaol	2024-03-08 19:31:24 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
cyy	6ecd65886a	Remove unnecessary const_casts (#121225 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225 Approved by: https://github.com/soulitzer	2024-03-05 17:34:24 +00:00
Shuqiang Zhang	c8e56b4965	[c10d] dump from one and only one thread (PG0's monitor thread) (#120893 ) Summary: When there are multiple PGs in a process and a hardware failure happens, we found that multiple PGs/ threads in the same process are competing to dump the same records at the same time. The affects the reliability of dumps. In this PR, we will try to make the change such that only one thread/PG could dump: PG0's monitor thread. We use a static variable to indicate that something (e.g., collective timeout) has triggered the dump locally. monitor thread would dump debug info under any one of the 3 conditions: 1: this static variable is set to true by the watchdog thread when it detects a timeout or pipe dump signal 2: timeout signal is received from other ranks through tcpstore 3: no heartbeat of watchdog Test Plan: python test/distributed/test_c10d_nccl.py -k test_timeout_dumps_on_stuck_ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893 Approved by: https://github.com/wconstab	2024-03-02 00:13:13 +00:00
Will Constable	581fe26792	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito	2024-03-01 23:45:43 +00:00
PyTorch MergeBot	76d3a6bb4a	Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 )" This reverts commit `381a7ad3f1`. Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))	2024-02-29 22:06:13 +00:00
Shuqiang Zhang	313abcdba2	[c10d] fix the unwanted reason (#120863 ) Summary: Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false postiive errors. This is a quick fix, but we need to revisit the error handling logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863 Approved by: https://github.com/kwen2501	2024-02-29 19:58:11 +00:00
Shengbao Zheng	5b9e5f854b	[profiler] Log process group id instead of backend id (#120475 ) Summary: https://github.com/pytorch/pytorch/pull/104373 introduced backend_id > an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object. However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution. This PR change the ID information exposted in record_param_comms from backend_id to pg_id. Differential Revision: D53558257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475 Approved by: https://github.com/aaronenyeshi	2024-02-29 15:04:33 +00:00

1 2 3 4 5 ...

1752 Commits