This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
# Fix typo errors across PyTorch codebase
This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.
## Changes Made
### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
- `setup.py`: Build system documentation
- `torch/_library/triton.py`: AOT compilation comments
- `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
- `torch/export/_unlift.py`: Pass population comments
- `torch/export/exported_program.py`: Decomposition table notes
### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
- `test/mobile/test_lite_script_module.py`: Exception handling comments
- `torch/export/_draft_export.py`: Error message text
- `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
- `torch/csrc/utils/python_numbers.h`: Overflow handling comment
- `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
- `torch/_dynamo/symbolic_convert.py`: Error explanation
### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
- `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
- `torch/distributed/distributed_c10d.py`
## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.
## Test Plan
No testing required as these changes only affect comments and documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode.
<details>
<summary>The repro code</summary>
```Python
import os
import torch
import torch.distributed as dist
USE_ASYNC = True
def test_func(x, rank):
if rank == 0:
x += 1
# Send the tensor to process 1
if USE_ASYNC:
a = dist.isend(tensor=x, dst=1)
else:
dist.send(tensor=x, dst=1)
else:
# Receive tensor from process 0
if USE_ASYNC:
a = dist.irecv(tensor=x, src=0)
else:
dist.recv(tensor=x, src=0)
if USE_ASYNC:
a.wait()
return x + 2
def run(rank):
torch.cuda.set_device(rank)
x = torch.ones(1, device='cuda')
with torch.cuda.stream(torch.cuda.Stream()):
for i in range(11):
x.copy_(torch.ones(1, device='cuda'))
y = test_func(x, rank)
print(f"Rank{rank} has data {y} in warmup")
torch.cuda.synchronize()
graph = torch.cuda.CUDAGraph()
x.copy_(torch.ones(1, device='cuda'))
with torch.cuda.graph(graph):
y = test_func(x, rank)
for i in range(1):
x.copy_(torch.ones(1, device='cuda'))
graph.replay()
print(f"Rank{rank} has data {y} after graph replay")
def main():
rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
dist.init_process_group('nccl', rank=rank, world_size=world_size)
run(local_rank)
if __name__ == "__main__":
main()
```
</details>
Fails with an error stating that work handle is of a NoneType:
```
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/repro.py", line 54, in <module>
[rank1]: main()
[rank1]: File "/workspace/repro.py", line 51, in main
[rank1]: run(local_rank)
[rank1]: File "/workspace/repro.py", line 38, in run
[rank1]: y = test_func(x, rank)
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/workspace/repro.py", line 22, in test_func
[rank1]: a.wait()
[rank1]: ^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'wait'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462
Approved by: https://github.com/kwen2501
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.
Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```
Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
This PR implements a small UI improvement over #133603.
It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.
UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
This PR implements a small UI improvement over #133603.
It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.
UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
This PR implements a small UI improvement over #133603.
It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.
UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.
This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.
Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective
Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
Summary:
Explicitly catch c10 error and log the error message only.
The standard exception `e.what()` below ends up logging the stack trace that is confusing users.
See S477887 for details.
Test Plan:
tested locally.
```
buck test caffe2/test/cpp/c10d:TCPStoreTest
buck2 daemon constraint mismatch: Version mismatch; killing daemon...
Starting new buck2 daemon...
Connected to new buck2 daemon.
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68
Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918
Network: Up: 1.5GiB Down: 4.7GiB (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161)
Loading targets. Remaining 0/3024 69199 dirs read, 687558 targets declared
Analyzing targets. Remaining 0/31483 1481904 actions, 1719048 artifacts declared
Executing actions. Remaining 0/250391 77:11:29.7s exec time total
Command: test. Finished 2031 local, 45445 remote, 51473 cache (52% hit) 20:16:36.9s exec time cached (26%)
Time elapsed: 7:32.7s
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D68516080
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413
Approved by: https://github.com/fduwjj
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
Summary:
Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts.
Added
`#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console.
This is the changes of the PTD.
Test Plan:
Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3
{F1971449692}
Reviewed By: c-p-i-o
Differential Revision: D67036093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678
Approved by: https://github.com/c-p-i-o
Summary:
Change back log to VLOG(2) in waitForFutureOrTimeout. Instead, print a more user friendly message - if FR completes successfully.
This message is meant for developers only - so don't default to `INFO` in this function.
Also, change one more message from LOG(ERROR) to LOG(INFO).
Tested locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142505
Approved by: https://github.com/kwen2501
Summary:
Change log message for future execution back from VLOG(2) to LOG(INFO).
This message is useful for Flight Recorder to verify that flight recorder dumps completed successfully (or not).
Test Plan: Tested manually on a mast job and noted that the INFO message was as expected.
(meta only link: https://fburl.com/mlhub/iui2tpc9)
```
[trainer5]:I1208 10:21:00.772841 7528 ProcessGroupNCCL.cpp:1294] [PG ID 0 PG GUID 0(precheck) Rank 21] future is successfully executed for: Flight recorder dump in heartbeatMonitor
```
Differential Revision: D66996439
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142441
Approved by: https://github.com/fduwjj
In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function.
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)
This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness.
Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142291
Approved by: https://github.com/eqy, https://github.com/fduwjj
The usage of `onCompletionHook` is mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database".
Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code)
User can file an issue if additional information needs to be recorded.
They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142390
Approved by: https://github.com/fduwjj, https://github.com/fegin
Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.
There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.
TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.
Test Plan: Unit tests.
Differential Revision: D66843194
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142178
Approved by: https://github.com/kwen2501
Today `destroy_process_group()` is implemented via `ncclCommAbort`.
When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption.
Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources.
This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`.
Expected behaviors:
For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141510
Approved by: https://github.com/wconstab