Summary:
For motivation behind the overall stack of diffs see D56218385 summary.
This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`.
Test Plan:
```
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/...
File changed: fbcode//common/base/ThreadStackTrace.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
4 additional file change events
Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98
Network: Up: 0B Down: 0B
Jobs completed: 2. Time elapsed: 1.5s.
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$
```
Tested to print the stack trace:
P1220109730
Differential Revision: D56218360
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628
Approved by: https://github.com/wconstab
Summary:
```
ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()
```
above pattern is only needed when we have *single-thread* to manage multiple GPUs
in our case, we always have 1 process managing 1 GPU, we don't need group operation.
Test Plan: CI
Differential Revision: D56274975
Co-authored-by: Cen Zhao <cenzhao@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416
Approved by: https://github.com/shuqiangzhang
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.
This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.
Misc changes:
* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)
Test plan:
```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
Summary:
Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer.
The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>")
Function only valid when NCCL_COMM_DESCRIPTION is defined.
Differential Revision: D55703310
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149
Approved by: https://github.com/shuqiangzhang
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".
Reviewed By: aaronenyeshi, kwen2501
Differential Revision: D55610682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.
Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.
Solution: Add a group_desc field to c10d
Differential Revision: D55781850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.
This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.
This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.
To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.
Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py
Test plan:
```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
Summary:
We seperated the FR dump logic from the desync debug logic,
so we no longer set collectiveDebugInfoMode_ to true when we just need FR
dump. That's why monitor thread did not sleep and try to kill the
process without waiting for the dump.
The fix is simple, we should sleep whenever shouldDump_ is true
Test Plan:
Existing unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788
Approved by: https://github.com/wconstab
There are many existing ProcessGroupNCCL features controlled by env vars. This PR adds TORCH_NCCL_HIGH_PRIORITY to force the use of high-priority CUDA or HIP streams for the NCCL or RCCL kernels, respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122830
Approved by: https://github.com/kwen2501
Summary:
Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers.
Also record pg_name in flight recorder entry.
Differential Revision: D55597200
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117
Approved by: https://github.com/wconstab
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D55548497
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123056
Approved by: https://github.com/Skylion007
Fixes#122807
The work handle of the coalescing job will be populated:
```python
with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
dist.all_reduce(a)
dist.all_reduce(b)
print(len(cm.works)) # prints 1
cm.wait() # actually waits
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122849
Approved by: https://github.com/kwen2501
Summary:
Existing flight recorder dumping logic is: dump only on timeout, but not
on NCCL error. This resulted in the faulty ranks missing dumps when NCCL
error happens.
So in this PR, we revise the logic of dump such that records are dumped
when any exception is detected. Exception could be 1. NCCL async errors.
2. watchdog timeout
Also the existing code tends to mix the logic of flight recorder dump
and desync debug, which is no desirable. We only dump the desync debug
report only when timeout is detected.
Test Plan:
Added a new unit test to trigger nccl error and dump, and make sure the
dump is triggered by the error.
Also existing dump on timeout tests should still pass.
sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (84bf9d4c)]$ python
test/distributed/test_c10d_nccl.py NcclErrorDumpTest
NCCL version 2.19.3+cuda12.0
[E329 19:15:11.775879730 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog
caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=10, NumelOut=10, Timeout(ms)=10000) ran for
10028 milliseconds before timing out.
[E329 19:15:11.777459894 ProcessGroupNCCL.cpp:1561] [PG 0 Rank 0]
Exception hit in NCCL work: 2
[E329 19:15:12.660717323 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0]
Received a timeout signal from this local rank and will start to dump
the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1.
[E329 19:15:12.660932242 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0]
ProcessGroupNCCL preparing to dump debug info.
[E329 19:15:12.661192990 ProcessGroupNCCL.cpp:1174] [PG 0 Rank 0]
ProcessGroupNCCL dumping nccl trace to /tmp/tmp06psqil3/trace_0
[F329 19:15:12.661485601 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0
Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout from
the local rank. This is most likely caused by incorrect usages of
collectives, e.g., wrong sizes used across ranks, the order of
collectives is not same for all ranks or the scheduled collective, for
some reason, didn't run. Additionally, this can be caused by GIL
deadlock or other reasons such as network errors or bugs in the
communications library (e.g. NCCL), etc. We tried our best to dump the
debug info into the storage to help you debug the issue.
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123023
Approved by: https://github.com/wconstab
Summary:
When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the
collective timeout locally. We just need to put the source rank as the
value in the kvstore
Test Plan:
In unit test, we triggered the timeout on rank 0 and rank 1 should get
the timeout signal from store and log the correct source rank:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$ python
test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks
NCCL version 2.19.3+cuda12.0
[rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0]
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for
1099 milliseconds before timing out.
[rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank
0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed
NCCL work: 1.
[rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0
Rank 0] Received a timeout signal from this local rank and will start
to dump the debug info. Last enqueued NCCL work: 2, last completed
NCCL work: 1.
[rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0
Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0
Rank 1] Received a global timeout from another rank 0, and will start
to dump the debug info. Last enqueued NCCL work: 1, last completed
NCCL work: 1.
[rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0
Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0
Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a
collective timeout from another rank 0 and notified the current rank.
This is most likely caused by incorrect usages of collectives, e.g.,
wrong sizes used across ranks, the order of collectives is not same
for all ranks or the scheduled collective, for some reason, didn't
run. Additionally, this can be caused by GIL deadlock or other
reasons such as network errors or bugs in the communications library
(e.g. NCCL), etc. We tried our best to dump the debug info into the
storage to help you debug the issue.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850
Approved by: https://github.com/wconstab
When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`.
This pull request is to fix this edge condition so that it will exit the program gracefully with useful information.
**Test:**
Before the fix, my test script exits like below:
```
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: basic_string::_M_construct null not valid
```
After this fix, my test script exited with useful message like,
```
[rank0]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2
[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error: Unknown NCCL Error
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905
Approved by: https://github.com/wconstab
Differential Revision: D54993977
### Summary
The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_.
Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement.
### Test Plan
See diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049
Approved by: https://github.com/shuqiangzhang
Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump
Test Plan: Modified unit tests
Differential Revision: D54972069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345
Approved by: https://github.com/wconstab
Summary:
Compute duration would invoke additional cuda overhead and possibly
GPU mem increase and possible hang, so we want to disable it by default and enable it only
when needed, or at least when timing is enabled.
Test Plan:
Test with existing unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138
Approved by: https://github.com/wconstab
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D54931224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995
Approved by: https://github.com/Skylion007
Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.
With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_
The implementation needs to make sure that Scuba will be used only for FB internal use
cases.
For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.
Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy
Reviewed By: wconstab
Differential Revision: D54556219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.
Update the error message to explicitly say that sparse_allreduce is not supported.
Test Plan: sandcastle
Differential Revision: D54759307
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
`hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575
Approved by: https://github.com/Chillee
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.
```
[rank0]: Traceback (most recent call last):
[rank0]: File "~/complex_ddp.py", line 72, in <module>
[rank0]: main()
[rank0]: File "~/complex_ddp.py", line 64, in main
[rank0]: loss.backward()
[rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```
I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501
Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes).
However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups.
This PR fixes the issue by:
- Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry.
- Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly.
Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457
Approved by: https://github.com/wanchaol
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.
In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.
monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.
However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.
Differential Revision: D53558257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi