Commit Graph

103 Commits

Author SHA1 Message Date
PyTorch MergeBot
420c52ecf3 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 626cb7df81.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))
2025-09-02 20:24:01 +00:00
Edward Z. Yang
626cb7df81 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-01 23:00:21 +00:00
Scott Wolchok
c35538d3c5 Minor cleanup of DeviceMesh.__eq__ (#161235)
`self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161235
Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj
ghstack dependencies: #161231, #161234
2025-08-25 18:35:21 +00:00
Edward Yang
838f22c57d Do not incorrectly chain each of the strings as iterables (#160709)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160709
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-08-15 23:22:24 +00:00
Luca Wehrstedt
aeb5321b63 Allow controlling PG backend and options via init_device_mesh (#159371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371
Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol
2025-08-05 12:44:14 +00:00
fduwjj
fd48681b6a [DeviceMesh][ez] Make the logic within flatten simpler (#158999)
While looking at the code of device mesh I find that this logic can be simplified. Also the naming needs to be correct. Because this mesh is not "flattened" yet, so we can just call it flatten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158999
Approved by: https://github.com/wz337, https://github.com/wconstab
ghstack dependencies: #158900
2025-07-24 15:40:13 +00:00
fduwjj
633d5faf3f [DeviceMesh] Enable slicing a submesh with warnings (#158899)
We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899
Approved by: https://github.com/wz337
2025-07-23 21:13:41 +00:00
Panagiotis Kourdis
fd47401536 [doc] Updates to distributed.md for XCCL backend (#155834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 21:01:43 +00:00
fduwjj
6499420e45 [DeviceMesh] Make the repr shorter when debug ENV not set (#158822)
Users want a shorter repr so this PR is trying to address that when TORCH_DISTRIBUTED_DEBUG is not set to DETAIL. Feedback and discussion is welcomed. Somehow I found that torch.set_printoptions is global, so I am hesitated to use it.

Now the print is like

<img width="435" height="79" alt="image" src="https://github.com/user-attachments/assets/8f173287-7138-4fbe-a4a3-8483523b21e4" />

or

<img width="485" height="104" alt="image" src="https://github.com/user-attachments/assets/21e34db9-56b5-47e2-9767-750d6105a273" />

or

<img width="675" height="97" alt="image" src="https://github.com/user-attachments/assets/53aa763e-7edd-4622-9cdb-37e2af8ec11f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158822
Approved by: https://github.com/wz337, https://github.com/wconstab, https://github.com/xmfan
2025-07-22 20:31:44 +00:00
fduwjj
63a96eaeb8 [DeviceMesh] Add error when users try to slice non contiguous flattened dim submesh (#157523)
With https://github.com/pytorch/pytorch/issues/157393, we want to first throw a clearer error for users and then fix it in the long-term

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157523
Approved by: https://github.com/fegin
ghstack dependencies: #157501
2025-07-07 19:43:51 +00:00
fduwjj
2b8d3b1b2b [DeviceMesh] Use user set backend and pg option even for the global mesh (#157501)
Short term solution to https://github.com/pytorch/pytorch/issues/156593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157501
Approved by: https://github.com/fegin, https://github.com/lw
2025-07-07 19:43:51 +00:00
Tom Ritchford
e3afbb0362 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-30 15:56:35 +00:00
vfdev
456b7451c7 Minor error message fix in device_mesh.py (#157096)
Fixed error message:
On main:
```
KeyError: ("Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. ", 'Found mesh dim indices to slice: [(1,), (1,)]. ', 'Mesh dim indices should be in ascending order.')
```
On PR:
```
KeyError: Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. Found mesh dim indices to slice: [(1,), (1,)]. Mesh dim indices should be in ascending order.'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157096
Approved by: https://github.com/Skylion007
2025-06-27 17:42:29 +00:00
Xuehai Pan
4ccc0381de [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-23 02:57:28 +00:00
PyTorch MergeBot
145d4cdc11 Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)"
This reverts commit c2f0292bd5.

Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
Xuehai Pan
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
Aaron Orenstein
e95e8eed0a mypy 1.16.0 (#155821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-14 18:18:43 +00:00
PyTorch MergeBot
7e4c097b07 Revert "[inductor] Add typing to _inductor/ir.py (#149958)"
This reverts commit 529e0357c6.

Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see b0fbbef136/1 ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))
2025-06-06 15:19:16 +00:00
Tom Ritchford
529e0357c6 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-06 14:15:01 +00:00
Jane Xu
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
Wanchao Liang
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
Wanchao Liang
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
Iris Z
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
lzhang2
b7ad52abb0 Use new group instead of split group on non-CUDA device (#141469)
Motivation:

Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-13 05:11:33 +00:00
Chien-Chin Huang
daa27fe59d [DeviceMesh] Call no_dispatch before doing tensor slicing in DeviceMesh (#142287)
Summary:
DeviceMesh's tensor operation is a control plane operation not data plane and should not be affected by FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142287
Approved by: https://github.com/XilunWu
2024-12-10 06:33:01 +00:00
SCh-zx
c8c669ce74 When using a third-party device to test DeviceMesh,the error check for 'test_raises_invalid_device_type' can only prints 'GPU' (#142038)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142038
Approved by: https://github.com/kwen2501
2024-12-06 11:14:00 +00:00
Xilun Wu
7964bcc3dc [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945)
**Summary**
This PR fixes a calculation miss in DeviceMesh's create_sub_mesh().

**Error Description**
When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example:
```
flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2")
alias_flattened_mesh = device_mesh["dim0-2"]  # this mesh slice leads to error in current impl
```

It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`:
```
IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4)
```

**Fix**
The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value.

**Test**
`pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945
Approved by: https://github.com/wz337
2024-10-29 17:56:56 +00:00
wz337
d7e0e1dbc4 [DeviceMesh] Use split_group to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129)
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
2024-10-22 00:00:05 +00:00
wz337
fb0da32377 [DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117)
As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension).

For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117
Approved by: https://github.com/kwen2501
2024-10-21 03:04:24 +00:00
Andrew Gu
39c5048549 [DeviceMesh] Fixed from_group when passing tensor mesh (#137713)
This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713
Approved by: https://github.com/wz337
2024-10-11 14:53:51 +00:00
wz337
93dcb92bae [DeviceMesh][EZ] Add group description to new group (#136558)
Add group description to new_group in device_mesh to help with debuggability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-09-28 03:09:41 +00:00
fduwjj
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
wz337
67f98a99a4 [DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271
Approved by: https://github.com/fduwjj
2024-09-06 06:23:20 +00:00
PyTorch MergeBot
351ba3e67c Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)"
This reverts commit 65864d0134.

Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))
2024-08-30 16:27:40 +00:00
fduwjj
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
wz337
761cf91e3c [DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275)
Adding a private helper method for Shampoo HSDP use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275
Approved by: https://github.com/XilunWu
2024-08-27 14:51:19 +00:00
wz337
268092db83 [DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048)
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".

For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
2024-08-25 10:36:01 +00:00
wz337
5d39b14b68 [DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839)
Add DeviceMesh slicing support such that we could do the following:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp")
)
shard_cp_mesh = mesh_3d["shard", "cp"]._flatten()
hsdp_mesh = mesh_3d["replicate", "shard_cp"]
# we can get the corresponding group of the flatten mesh through

group = shard_cp_mesh.get_group()
# or
group = mesh_3d["shard_cp"].get_group()
# or
mesh_3d.get_group(mesh_dim="shard_cp")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839
Approved by: https://github.com/fegin
ghstack dependencies: #133838
2024-08-24 03:49:29 +00:00
wz337
49430bfd5c [DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838)
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()

"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
    "dp_cp": (0, 1)
}
"""
```

We need this information to validate the order mesh slice including flatten mesh dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
2024-08-20 19:43:45 +00:00
wz337
4bae7ae3d9 [DeviceMesh][Easy] Fix typo (#133790)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790
Approved by: https://github.com/Skylion007
2024-08-19 05:20:22 +00:00
wz337
3fc9ee5a31 [DeviceMesh] Directly retrieve flattened mesh if already created (#133195)
Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #133193
2024-08-14 21:11:04 +00:00
wz337
ef580a0e5c [DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193)
This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed.

```
mesh_3d = init_device_mesh(
    self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"),
)

# valid 2d slicing
mesh_2d = mesh_3d["dp", "cp"]
mesh_2d = mesh_3d["dp", "tp"]
mesh_2d = mesh_3d["cp", "tp"]

# invalid 2d slicing
mesh_2d = mesh_3d["cp", "dp"]
mesh_2d = mesh_3d["tp", "cp"]
mesh_2d = mesh_3d["tp", "dp"]

# valid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
# invalid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["cp", "dp"]._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-08-14 07:18:41 +00:00
wz337
479d460471 [DeviceMesh] Add a private _flatten() API for device_mesh (#132632)
Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"),
)

dp_cp_mesh = mesh_3d["dp", "cp"]
# flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',))
# flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',))
flattened_dp_cp_mesh = dp_cp_mesh._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #132310, #132311, #132339
2024-08-08 06:46:42 +00:00
wz337
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
wz337
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00