Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing.
Test Plan: CI
Differential Revision: D71144456
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152
Approved by: https://github.com/kwen2501
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.
1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there
Test plan:
Run with torchft + torchtitan
```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096
...
[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61 loss: 7.4825 memory: 79.73GiB(83.89%) tps: 317 tflops: 16.34 mfu: 1.65%
```
Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors


Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies
This PR uses alltoall_single instead, so that we could minimize tensor copies.
tested on all the shard dim change tests and it works properly:
```
pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868
Approved by: https://github.com/tianyu-l
# Fix typo errors across PyTorch codebase
This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.
## Changes Made
### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
- `setup.py`: Build system documentation
- `torch/_library/triton.py`: AOT compilation comments
- `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
- `torch/export/_unlift.py`: Pass population comments
- `torch/export/exported_program.py`: Decomposition table notes
### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
- `test/mobile/test_lite_script_module.py`: Exception handling comments
- `torch/export/_draft_export.py`: Error message text
- `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
- `torch/csrc/utils/python_numbers.h`: Overflow handling comment
- `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
- `torch/_dynamo/symbolic_convert.py`: Error explanation
### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
- `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
- `torch/distributed/distributed_c10d.py`
## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.
## Test Plan
No testing required as these changes only affect comments and documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.
This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.
Test plan:
```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation.
It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do.
For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`.
Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566
Approved by: https://github.com/weifengpy
Fix
```
/usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
91 | static_assert(sizeof(_Tp)>0,
| ^~~~~~~~~~~
/usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
399 | get_deleter()(std::move(__ptr));
| ^
../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
200 | AliasDb::~AliasDb() = default;
| ^
../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
200 | AliasDb::~AliasDb() = default;
| ^
../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
298 | struct WriteRegistry;
| ^
1 error generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: dtolnay
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501
Approved by: https://github.com/Skylion007
This is a forward fix for #135338.
It hits error like this:
```
"distributed_c10d.py", line 2156, in destroy_process_group
if type(pg) == ProcessGroup and pg._has_hooks():
RuntimeError: Could not find the default backend type 0 for Process Group with name undefined.
```
When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call.
The fix wraps `getDefaultBackend` with a try-catch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596
Approved by: https://github.com/LucasLLC, https://github.com/fduwjj
This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode.
<details>
<summary>The repro code</summary>
```Python
import os
import torch
import torch.distributed as dist
USE_ASYNC = True
def test_func(x, rank):
if rank == 0:
x += 1
# Send the tensor to process 1
if USE_ASYNC:
a = dist.isend(tensor=x, dst=1)
else:
dist.send(tensor=x, dst=1)
else:
# Receive tensor from process 0
if USE_ASYNC:
a = dist.irecv(tensor=x, src=0)
else:
dist.recv(tensor=x, src=0)
if USE_ASYNC:
a.wait()
return x + 2
def run(rank):
torch.cuda.set_device(rank)
x = torch.ones(1, device='cuda')
with torch.cuda.stream(torch.cuda.Stream()):
for i in range(11):
x.copy_(torch.ones(1, device='cuda'))
y = test_func(x, rank)
print(f"Rank{rank} has data {y} in warmup")
torch.cuda.synchronize()
graph = torch.cuda.CUDAGraph()
x.copy_(torch.ones(1, device='cuda'))
with torch.cuda.graph(graph):
y = test_func(x, rank)
for i in range(1):
x.copy_(torch.ones(1, device='cuda'))
graph.replay()
print(f"Rank{rank} has data {y} after graph replay")
def main():
rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
dist.init_process_group('nccl', rank=rank, world_size=world_size)
run(local_rank)
if __name__ == "__main__":
main()
```
</details>
Fails with an error stating that work handle is of a NoneType:
```
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/repro.py", line 54, in <module>
[rank1]: main()
[rank1]: File "/workspace/repro.py", line 51, in main
[rank1]: run(local_rank)
[rank1]: File "/workspace/repro.py", line 38, in run
[rank1]: y = test_func(x, rank)
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/workspace/repro.py", line 22, in test_func
[rank1]: a.wait()
[rank1]: ^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'wait'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462
Approved by: https://github.com/kwen2501
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.
For questions/comments, contact r-barnes.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Differential Revision: D69945678
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python.
Test plan:
```
pytest test/distributed/test_store.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647
Approved by: https://github.com/fduwjj
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.
This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.
This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.
https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau
Test plan:
```
pytest test/distributed/test_store.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.
Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```
Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass.
Test Plan:
With the change, build of APS model using rcclexp can now pass:
`sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0`
Reviewed By: c-p-i-o
Differential Revision: D69149588
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453
Approved by: https://github.com/c-p-i-o
@fegin found an issue where torchft is not compatible with functional collectives.
Found in https://github.com/pytorch/torchtitan/pull/806
The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug.
PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists
To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects.
To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup.
Test plan:
```
cd pytorch
pytest test/distributed/test_c10d_functional_native.py
```
```
cd torchft
pytest torchft/process_group_test.py -k functional -v -x -s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376
Approved by: https://github.com/yifuwang
This PR implements a small UI improvement over #133603.
It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.
UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab