Fixes #ISSUE_NUMBER
When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message.
```
│ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │
│ ed_c10d.py:1215 in get_backend │
│ │
│ 1212 │ if _rank_not_in_group(pg): │
│ 1213 │ │ raise ValueError("Invalid process group specified") │
│ 1214 │ pg_store = _world.pg_map[pg] if pg in _world.pg_map else None │
│ ❱ 1215 │ return Backend(not_none(pg_store)[0]) │
│ 1216 │
│ 1217 │
│ 1218 def _get_process_group_uid(pg: ProcessGroup) -> int: │
│ │
│ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │
│ y:13 in not_none │
│ │
│ 10 │
│ 11 def not_none(obj: Optional[T]) -> T: │
│ 12 │ if obj is None: │
│ ❱ 13 │ │ raise TypeError("Invariant encountered: value was None when it should not be") │
│ 14 │ return obj │
│ 15 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Invariant encountered: value was None when it should not be
Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0>
```
Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796
Approved by: https://github.com/kwen2501
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
# Fix typo errors across PyTorch codebase
This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.
## Changes Made
### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
- `setup.py`: Build system documentation
- `torch/_library/triton.py`: AOT compilation comments
- `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
- `torch/export/_unlift.py`: Pass population comments
- `torch/export/exported_program.py`: Decomposition table notes
### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
- `test/mobile/test_lite_script_module.py`: Exception handling comments
- `torch/export/_draft_export.py`: Error message text
- `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
- `torch/csrc/utils/python_numbers.h`: Overflow handling comment
- `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
- `torch/_dynamo/symbolic_convert.py`: Error explanation
### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
- `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
- `torch/distributed/distributed_c10d.py`
## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.
## Test Plan
No testing required as these changes only affect comments and documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj
Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.
This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.
Test plan:
```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.
The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)
Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)
We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary
fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)
And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)
In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang
Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`.
Test Plan: added unit tests
Differential Revision: D69813991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417
Approved by: https://github.com/angelayi
Summary:
Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow.
Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed.
Test Plan: moved and updated test
Differential Revision: D69607140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133
Approved by: https://github.com/tugsbayasgalan
This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276
Approved by: https://github.com/aorenste
### Motivation:
As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.
### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def run_allreduce(rank, world_size):
setup(rank, world_size)
device = torch.device('xpu:{}'.format(rank))
x = torch.randn([2, 2], device=device)
dist.all_reduce(x)
cleanup()
if __name__ == '__main__':
world_size = 2
mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all **default** backends to cover what the user may do later.
A side effect is increase of initialization time and resources.
This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142216
Approved by: https://github.com/wconstab, https://github.com/XilunWu
# Motivation
The distributed APIs rely on backend names for creation of process group.
To abstract out references of these names from PG creation, an API is added to get default distributed backend for device.
The device code would need to register its device and backend via ```torch.distributed.Backend.register_backend``` or update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ``` prior to using the API.
An example of use is added in the test file ( which can be used to check abstracted APIs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140536
Approved by: https://github.com/kwen2501
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead
of global ranks. I think this is OK since I also updated the field names
to make this obvious.
Also add mypy annotations
Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141054
Approved by: https://github.com/kwen2501
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
Citing @malfet's [comment](https://github.com/pytorch/pytorch/pull/136343#pullrequestreview-2318792396) in https://github.com/pytorch/pytorch/pull/136343
> It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.
This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init).
New user experience:
```
device = torch.device(device_type, rank % device_count)
dist.init_process_group(device_id=device)
```
The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140963
Approved by: https://github.com/wconstab
Also add missing mypy typing and a few asserts to make mypy happy
Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Note: object collective version canonicalizes to global instead of group
rank, simply becuase this left more of the original code intact and
required less conversions overall.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140827
Approved by: https://github.com/kwen2501
Avoid copypaste of send/isend and recv/irecv impl.
This does change the warning issued from send to include the identifier
"isend" instead of "send", but I think thats not a big deal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140815
Approved by: https://github.com/fegin
ghstack dependencies: #140460
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501
Summary:
Currently, whether split should be used depends on the size of subgroup.
It's possible that default PG is not eagerly initialized yet, but split is still
called.
This PR fixes this issue by removing split's dependency on subgroup size
Test Plan:
Modified UT
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781
Approved by: https://github.com/kwen2501