Commit Graph

440 Commits

Author SHA1 Message Date
Tsung-Hsien Lee
f1f54c197d [c10d] Simplify new_subgroups() by using new_subgroups_by_enumeration() (#153843)
Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`.

Test Plan: contbuild & OSS CI

Differential Revision: D75007368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843
Approved by: https://github.com/Skylion007, https://github.com/wz337
2025-05-20 19:15:20 +00:00
Tsung-Hsien Lee
6487ea30b3 [c10d] Fix new_subgroups(group=) bug (#153798)
Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call.

Test Plan: contbuild & OSS CI

Differential Revision: D74964213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798
Approved by: https://github.com/Skylion007
2025-05-19 17:01:10 +00:00
Deep Shah
2489b6470b [c10d] Allow split_group to work with non nccl backends (#152175)
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.

The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.

I am open to making them part of base Backend based if folks prefer.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2025-05-16 00:15:29 +00:00
Tsung-Hsien Lee
dfcfad2112 [c10d] Fix unused group input argument in new_subgroups() (#152765)
Summary: This diff fixes an unused input argument [`group`](8faa225695/torch/distributed/distributed_c10d.py (L5341)) in the `new_subgroups()` function.

Test Plan: contbuild & OSS CI, see

Differential Revision: D74132537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152765
Approved by: https://github.com/wz337
2025-05-07 02:37:51 +00:00
Ke Wen
a8f727c439 [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-06 15:27:30 +00:00
PyTorch MergeBot
cc954848d4 Revert "[c10d] Fix extra CUDA context created by barrier (#149144)"
This reverts commit 457fa820ad.

Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))
2025-05-05 22:56:50 +00:00
Ke Wen
457fa820ad [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-03 03:13:34 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
Will Constable
9e235c549c [C10D] avoid computing global_rank when group_rank is used (#151373)
collective APIs accept either group or global rank for src/dst rank.

We provide a helper `_canonicalize_group_rank` which converts from maybe
group or maybe global to one particular format (defined by the kwarg
return_global: bool=False).

In this PR we stop performing the mapping lookup that converts group to
global or global to group in the case that the caller wants us to return
the same value that was passed in.  The PR should be functionally
equivalent, except in cases where the mapping itself would raise an
exception but the mapping was not necessary in the first place.

This has come up in cases where people create new process groups outside
of 'init_process_group' APIs and group-specific ranks may not have a
valid mapping to the 'global' rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151373
Approved by: https://github.com/xunnanxu, https://github.com/d4l3k
2025-04-17 23:53:50 +00:00
Will Constable
c9a35c2a6e [C10D] Document object collectives limitations (#150815)
Adds louder warning labels in the doc page and docstring for object
collectives in hopes of raising awareness of several footgun issues
including accidental creation of cuda contexts by serializing and
sending 'device-local' gpu tensors over the object-* apis.

Preview:
<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />

addresses #150798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150815
Approved by: https://github.com/kwen2501
2025-04-10 22:48:39 +00:00
Ke Wen
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
Tristan Rice
159e97cbcf ProcessGroupGloo: support reduce_scatter + update support chart (#149869)
This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per  rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank.

If users find these functions to be too slow we can address them as issues arise.

Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart.

Test plan:

```
pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869
Approved by: https://github.com/fduwjj
2025-03-25 01:16:12 +00:00
zhc7
a268c29b9f [distributed] fix: use group rank instead of global rank when possible (#149488)
Fixes #149200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149488
Approved by: https://github.com/wconstab
2025-03-20 21:47:03 +00:00
Yi Wang
ffa085334c Specify the default PyTorch Distributed backend for MPS (#149538)
Fixes #149537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149538
Approved by: https://github.com/d4l3k, https://github.com/malfet
2025-03-20 18:54:03 +00:00
fduwjj
8bf3f3fc43 [c10d] Add a collective time estimator for NCCL comms (#149343)
We want to upstream the feature from new nccl for users to estimate comm time.

Resolves #147753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343
Approved by: https://github.com/kwen2501
2025-03-19 07:54:02 +00:00
PyTorch MergeBot
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f2.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
Um Changyong
69aeb87eca update error message in get_backend() more detail_ (#141796)
Fixes #ISSUE_NUMBER
When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message.
```
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │
                             │ ed_c10d.py:1215 in get_backend                                                                                            │
                             │                                                                                                                           │
                             │   1212 │   if _rank_not_in_group(pg):                                                                                     │
                             │   1213 │   │   raise ValueError("Invalid process group specified")                                                        │
                             │   1214 │   pg_store = _world.pg_map[pg] if pg in _world.pg_map else None                                                  │
                             │ ❱ 1215 │   return Backend(not_none(pg_store)[0])                                                                          │
                             │   1216                                                                                                                    │
                             │   1217                                                                                                                    │
                             │   1218 def _get_process_group_uid(pg: ProcessGroup) -> int:                                                               │
                             │                                                                                                                           │
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │
                             │ y:13 in not_none                                                                                                          │
                             │                                                                                                                           │
                             │   10                                                                                                                      │
                             │   11 def not_none(obj: Optional[T]) -> T:                                                                                 │
                             │   12 │   if obj is None:                                                                                                  │
                             │ ❱ 13 │   │   raise TypeError("Invariant encountered: value was None when it should not be")                               │
                             │   14 │   return obj                                                                                                       │
                             │   15                                                                                                                      │
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                             TypeError: Invariant encountered: value was None when it should not be
Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0>
```
Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796
Approved by: https://github.com/kwen2501
2025-03-14 19:42:42 +00:00
Ke Wen
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
PyTorch MergeBot
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c684.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
Aditya Tiwari
bb9c426024 Typo Errors fixed in multiple files (#148262)
# Fix typo errors across PyTorch codebase

This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.

## Changes Made

### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
  - `setup.py`: Build system documentation
  - `torch/_library/triton.py`: AOT compilation comments
  - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
  - `torch/export/_unlift.py`: Pass population comments
  - `torch/export/exported_program.py`: Decomposition table notes

### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
  - `test/mobile/test_lite_script_module.py`: Exception handling comments
  - `torch/export/_draft_export.py`: Error message text
  - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
  - `torch/csrc/utils/python_numbers.h`: Overflow handling comment
  - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
  - `torch/_dynamo/symbolic_convert.py`: Error explanation

### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
  - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
  - `torch/distributed/distributed_c10d.py`

## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.

## Test Plan
No testing required as these changes only affect comments and documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-09 12:21:40 +00:00
Ke Wen
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
PyTorch MergeBot
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
Ke Wen
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
Tristan Rice
7ffadff286 c10d/ProcessGroup: cleanup abort and shutdown (#148798)
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.

This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.

Test plan:

```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
2025-03-08 18:33:18 +00:00
taozhiwei
16d07988fc add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-03-04 12:37:06 +00:00
ankurneog
e45040b1d3 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang
2025-03-03 21:32:21 +00:00
PyTorch MergeBot
94afb165d9 Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478)"
This reverts commit dae3fbfe97.

Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see dae3fbfe97 ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))
2025-03-02 21:22:04 +00:00
ankurneog
dae3fbfe97 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang, https://github.com/guangyey
2025-03-02 05:13:48 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Zesheng Zong
580f1183b4 Enable ruff rule S324 (#147665)
Fixes #147627

- Add `S324` in `pyproject.toml `
- Running check and clean warnings

```bash
lintrunner --take RUFF --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 18:27:34 +00:00
Avik Chaudhuri
24738768a8 more dist ops in non strict (#147417)
Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`.

Test Plan: added unit tests

Differential Revision: D69813991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417
Approved by: https://github.com/angelayi
2025-02-19 21:29:16 +00:00
Avik Chaudhuri
4ab967c44d all reduce non strict (#147133)
Summary:
Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow.

Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed.

Test Plan: moved and updated test

Differential Revision: D69607140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133
Approved by: https://github.com/tugsbayasgalan
2025-02-15 19:37:08 +00:00
Aaron Gokaslan
6344ca1dd4 [BE][Ez]: Apply FURB188: use str remove(pre|suf)fix (#146997)
Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997
Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel
2025-02-14 03:38:07 +00:00
ankurneog
f50d359ce2 [ c10d ] modify API to get device string from device with torch.device (#146290)
Modify the ```get_default_backend_for_device()``` API to extract the device string using ```torch.device()```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146290
Approved by: https://github.com/guangyey, https://github.com/H-Huang
2025-02-11 23:30:57 +00:00
Aaron Gokaslan
7f65a20884 [BE]: Enable ruff SLOT checks (#146276)
This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276
Approved by: https://github.com/aorenste
2025-02-04 19:18:23 +00:00
c8ef
a989a0b13a [NFC] Fix some minor typos. (#145599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599
Approved by: https://github.com/Skylion007
2025-01-24 18:58:59 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
lzhang2
1800f5f461 Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735)
**Motivation:**

- Enable coalescing path on XPU for `batch_isend_irecv`.
- If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735
Approved by: https://github.com/kwen2501
2025-01-14 08:37:48 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
lzhang2
5d6acd5a31 Register Intel distributed Backend (XCCL) in PyTorch distributed package (#141856)
### Motivation:

As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.

### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-10 01:58:06 +00:00
Ke Wen
452e1a7840 [c10d] Update backend arg documentation (#142404)
Update doc to reflect change brought by https://github.com/pytorch/pytorch/pull/142216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142404
Approved by: https://github.com/XilunWu
2024-12-09 21:53:44 +00:00
Ke Wen
cc64ad659d Detect accelerator type when backend is not specified (#142216)
Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all **default** backends to cover what the user may do later.

A side effect is increase of initialization time and resources.

This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142216
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-12-06 10:55:56 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
ankurneog
f497a0039c API to retrieve default distributed backend from device (#140536)
# Motivation
The distributed APIs rely on backend names for creation of process group.
To abstract out references of these names from PG creation, an API is added to get default distributed backend for  device.
The device code would need to register its device and backend  via  ```torch.distributed.Backend.register_backend```  or  update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ```  prior to using the API.

An example of use is added in the test file ( which can be used to check abstracted APIs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140536
Approved by: https://github.com/kwen2501
2024-11-22 11:01:53 +00:00
Will Constable
b25c291563 [C10D] Support group ranks in P2POp and batch_isend_irecv (#141054)
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead
of global ranks. I think this is OK since I also updated the field names
to make this obvious.

Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141054
Approved by: https://github.com/kwen2501
2024-11-21 14:51:56 +00:00
Junjie Wang (PyTorch)
b44ecd91ba [c10d] Switch all timer logging in c10d to wait_counter (#141154)
Summary: The original decorator based time logger is bad in performance and capacity. So we want to replace it with pytorch `_WaitCounter` now.

Test Plan: Tested on workload and no regression has been seen: https://fburl.com/scuba/aps_instrumentation_components/mskj73ea

Differential Revision: D66218675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141154
Approved by: https://github.com/wz337
2024-11-21 01:10:11 +00:00