Commit Graph

80 Commits

Author SHA1 Message Date
Wanchao Liang
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
Wanchao Liang
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
Xilun Wu
cbb03e6971 [BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-05-09 20:40:54 +00:00
Wanchao Liang
3380a46b44 Fix DTensorTestBase to barrier with device ids (#150896)
try to get rid of the below annoying warnings when running the unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150896
Approved by: https://github.com/fegin
2025-04-22 20:22:55 +00:00
Iris Z
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
Tom Ritchford
d25e6e623f Fix unused Python variables in test/[a-d]* (#134665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665
Approved by: https://github.com/albanD
2024-12-13 22:13:12 +00:00
Chien-Chin Huang
daa27fe59d [DeviceMesh] Call no_dispatch before doing tensor slicing in DeviceMesh (#142287)
Summary:
DeviceMesh's tensor operation is a control plane operation not data plane and should not be affected by FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142287
Approved by: https://github.com/XilunWu
2024-12-10 06:33:01 +00:00
Xilun Wu
bef103934a [DeviceMesh][ROCm] skip ProcessGroup init test on ROCm because #ranks != #devices in CI (#142386)
**Summary**
Fixes #142361

Skip the DeviceMesh test since the test suite doesn't consider the case where `# ranks != # devices`.

**Test**
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142386
Approved by: https://github.com/huydhn, https://github.com/fegin
2024-12-10 01:22:21 +00:00
SCh-zx
c8c669ce74 When using a third-party device to test DeviceMesh,the error check for 'test_raises_invalid_device_type' can only prints 'GPU' (#142038)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142038
Approved by: https://github.com/kwen2501
2024-12-06 11:14:00 +00:00
Xilun Wu
7964bcc3dc [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945)
**Summary**
This PR fixes a calculation miss in DeviceMesh's create_sub_mesh().

**Error Description**
When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example:
```
flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2")
alias_flattened_mesh = device_mesh["dim0-2"]  # this mesh slice leads to error in current impl
```

It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`:
```
IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4)
```

**Fix**
The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value.

**Test**
`pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945
Approved by: https://github.com/wz337
2024-10-29 17:56:56 +00:00
wz337
5b39734a0a [DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097)
We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error:
```
RuntimeError: No backend for the parent process group or its backend does not support splitting
```

This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097
Approved by: https://github.com/XilunWu
2024-10-29 00:04:06 +00:00
wz337
d7e0e1dbc4 [DeviceMesh] Use split_group to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129)
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
2024-10-22 00:00:05 +00:00
Andrew Gu
39c5048549 [DeviceMesh] Fixed from_group when passing tensor mesh (#137713)
This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713
Approved by: https://github.com/wz337
2024-10-11 14:53:51 +00:00
fduwjj
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
351ba3e67c Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)"
This reverts commit 65864d0134.

Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))
2024-08-30 16:27:40 +00:00
wz337
50efbb9f1e [DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603
Approved by: https://github.com/fduwjj
ghstack dependencies: #133838, #133839, #134048
2024-08-30 08:13:37 +00:00
fduwjj
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
wz337
761cf91e3c [DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275)
Adding a private helper method for Shampoo HSDP use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275
Approved by: https://github.com/XilunWu
2024-08-27 14:51:19 +00:00
wz337
268092db83 [DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048)
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".

For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
2024-08-25 10:36:01 +00:00
wz337
5d39b14b68 [DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839)
Add DeviceMesh slicing support such that we could do the following:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp")
)
shard_cp_mesh = mesh_3d["shard", "cp"]._flatten()
hsdp_mesh = mesh_3d["replicate", "shard_cp"]
# we can get the corresponding group of the flatten mesh through

group = shard_cp_mesh.get_group()
# or
group = mesh_3d["shard_cp"].get_group()
# or
mesh_3d.get_group(mesh_dim="shard_cp")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839
Approved by: https://github.com/fegin
ghstack dependencies: #133838
2024-08-24 03:49:29 +00:00
wz337
49430bfd5c [DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838)
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()

"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
    "dp_cp": (0, 1)
}
"""
```

We need this information to validate the order mesh slice including flatten mesh dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
2024-08-20 19:43:45 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
wz337
3fc9ee5a31 [DeviceMesh] Directly retrieve flattened mesh if already created (#133195)
Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #133193
2024-08-14 21:11:04 +00:00
wz337
ef580a0e5c [DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193)
This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed.

```
mesh_3d = init_device_mesh(
    self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"),
)

# valid 2d slicing
mesh_2d = mesh_3d["dp", "cp"]
mesh_2d = mesh_3d["dp", "tp"]
mesh_2d = mesh_3d["cp", "tp"]

# invalid 2d slicing
mesh_2d = mesh_3d["cp", "dp"]
mesh_2d = mesh_3d["tp", "cp"]
mesh_2d = mesh_3d["tp", "dp"]

# valid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
# invalid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["cp", "dp"]._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-08-14 07:18:41 +00:00
wz337
0ff0bf3d31 [Replicate] Fix replicate with DeviceMesh initialization (#133024)
A follow up on https://github.com/pytorch/pytorch/pull/132339.

`get_parent_mesh` is replaced by `get_root_mesh`. In addition, modify a few places that parent mesh is mentioned in test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133024
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-08-09 00:45:47 +00:00
wz337
479d460471 [DeviceMesh] Add a private _flatten() API for device_mesh (#132632)
Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"),
)

dp_cp_mesh = mesh_3d["dp", "cp"]
# flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',))
# flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',))
flattened_dp_cp_mesh = dp_cp_mesh._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #132310, #132311, #132339
2024-08-08 06:46:42 +00:00
wz337
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
wz337
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
PyTorch MergeBot
c7113a6186 Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)"
This reverts commit 1a23ef2ece.

Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](1a23ef2ece).  Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))
2024-08-06 23:47:53 +00:00
wz337
073cee531c [Test][Easy] Remove print in test_device_mesh.py (#132780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780
Approved by: https://github.com/XilunWu
2024-08-06 22:04:39 +00:00
wz337
1a23ef2ece [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-06 22:00:09 +00:00
wz337
fb87796d4f [DeviceMesh] Add supports for non-continuous slicing (#132310)
Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310
Approved by: https://github.com/wanchaol
2024-08-05 09:30:07 +00:00
wz337
4de85e3c30 [DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636)
We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes.

As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636
Approved by: https://github.com/wanchaol
2024-07-25 22:47:22 +00:00
Xuehai Pan
db3290846e [BE][Easy][10/19] enforce style for empty lines in import segments in test/d*/ (#129761)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761
Approved by: https://github.com/fegin
2024-07-17 16:57:39 +00:00
Iris Z
1d84c7e100 [DeviceMesh] Update get_group and add get_all_groups (#128097)
Fixes #121984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-06-08 04:28:56 +00:00
Yifu Wang
1d0c1087dd Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599
Approved by: https://github.com/wanchaol
ghstack dependencies: #126598
2024-06-06 17:18:12 +00:00
Aaron Gokaslan
12c4a2c297 [BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716)
Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716
Approved by: https://github.com/ezyang
2024-06-03 17:22:13 +00:00
Iris Z
1699edaabb [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-05-31 17:06:36 +00:00
PyTorch MergeBot
f6e303fa47 Revert "[DeviceMesh] Adding nD slicing support back (#127465)"
This reverts commit e72232f8f0.

Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint e72232f8f0, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))
2024-05-31 00:43:13 +00:00
wz337
e72232f8f0 [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab
2024-05-30 23:55:21 +00:00
Andrew Gu
697ed6f5b3 [DeviceMesh] Supported N groups in from_group (#126258)
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258
Approved by: https://github.com/wanchaol
2024-05-17 01:03:21 +00:00
Wanchao Liang
d0dfcd2c34 fix the device type for with_comms decorator (#125798)
found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798
Approved by: https://github.com/yifuwang
2024-05-16 03:40:19 +00:00
wz337
059b68fbdf [DeviceMesh] Fix hash and eq not match (#123572)
Fixes #121799

We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh.
Examples can be found here: https://github.com/pytorch/pytorch/issues/121799

Also need this to unblock https://github.com/pytorch/pytorch/pull/123394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu
2024-05-16 02:00:45 +00:00
Wanchao Liang
04a241947a [dtensor] delete the old unused mesh_alltoall (#124879)
as titled, as we have a dedicated comm op, this is not needed anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #124871, #124872
2024-04-30 18:30:34 +00:00
Yuanhao Ji
e0d2c24de1 Fix device type issue in _get_device_handle (#124390)
Fix #124327

`device_type`, the first arg of [init_device_mesh()](a0466061e1/torch/distributed/device_mesh.py (L503)),  does not support types with indexes, such as `cuda:0`.
If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`.
So the exception should be thrown before creating DeviceMesh object.

> See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-04-30 06:59:56 +00:00
PyTorch MergeBot
3bd67dab32 Revert "[dtensor] delete the old unused mesh_alltoall (#124879)"
This reverts commit f7f018a0ed.

Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 f7f018a0ed.  Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))
2024-04-29 20:26:15 +00:00
Wanchao Liang
f7f018a0ed [dtensor] delete the old unused mesh_alltoall (#124879)
as titled, as we have a dedicated comm op, this is not needed anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #124871, #124872
2024-04-29 17:22:30 +00:00
Wanchao Liang
8d46ab4104 [dtensor] move pad/unpad_tensor to separate utils (#124871)
as titled, 1. pad/unpad is a general util not specific to the Shard
placement, 2. for the propose of the next PR, move these two out of Shard
placement itself, and give additional pad_dim argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu
2024-04-29 17:22:25 +00:00