Commit Graph

60 Commits

Author SHA1 Message Date
wz337
49430bfd5c [DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838)
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()

"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
    "dp_cp": (0, 1)
}
"""
```

We need this information to validate the order mesh slice including flatten mesh dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
2024-08-20 19:43:45 +00:00
wz337
4bae7ae3d9 [DeviceMesh][Easy] Fix typo (#133790)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790
Approved by: https://github.com/Skylion007
2024-08-19 05:20:22 +00:00
wz337
3fc9ee5a31 [DeviceMesh] Directly retrieve flattened mesh if already created (#133195)
Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #133193
2024-08-14 21:11:04 +00:00
wz337
ef580a0e5c [DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193)
This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed.

```
mesh_3d = init_device_mesh(
    self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"),
)

# valid 2d slicing
mesh_2d = mesh_3d["dp", "cp"]
mesh_2d = mesh_3d["dp", "tp"]
mesh_2d = mesh_3d["cp", "tp"]

# invalid 2d slicing
mesh_2d = mesh_3d["cp", "dp"]
mesh_2d = mesh_3d["tp", "cp"]
mesh_2d = mesh_3d["tp", "dp"]

# valid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
# invalid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["cp", "dp"]._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-08-14 07:18:41 +00:00
wz337
479d460471 [DeviceMesh] Add a private _flatten() API for device_mesh (#132632)
Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"),
)

dp_cp_mesh = mesh_3d["dp", "cp"]
# flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',))
# flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',))
flattened_dp_cp_mesh = dp_cp_mesh._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #132310, #132311, #132339
2024-08-08 06:46:42 +00:00
wz337
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
wz337
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
PyTorch MergeBot
c7113a6186 Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)"
This reverts commit 1a23ef2ece.

Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](1a23ef2ece).  Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))
2024-08-06 23:47:53 +00:00
wz337
1a23ef2ece [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-06 22:00:09 +00:00
Brian Hirsh
e6eee04875 dynamo: use equality guards instead of id guards for Placement/DeviceMesh (#124401)
After talking to @anijain2305, we probably can't land this since it won't work for C++ guards. But we should still be able to do better than ID_MATCH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124401
Approved by: https://github.com/anijain2305
2024-08-06 17:14:44 +00:00
wz337
4306eebab1 [DeviceMesh] Update slicing documentation to include nD and non-continuous slicing (#132311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132311
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310
2024-08-05 23:44:23 +00:00
wz337
fb87796d4f [DeviceMesh] Add supports for non-continuous slicing (#132310)
Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310
Approved by: https://github.com/wanchaol
2024-08-05 09:30:07 +00:00
wz337
4de85e3c30 [DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636)
We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes.

As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636
Approved by: https://github.com/wanchaol
2024-07-25 22:47:22 +00:00
Iris Zhang (PyTorch)
ee6f0ab190 [DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495) (#130685)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs.

Test Plan:
```
[irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'
```

Reviewed By: gag1jain

Differential Revision: D59725924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685
Approved by: https://github.com/gag1jain
2024-07-15 20:05:26 +00:00
Gagan Jain
97cfc65dbc Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)" (#130676)
Summary:
Original commit changeset: 80c2ca639146

Original Phabricator Diff: D59612200

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'

Differential Revision: D59719562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676
Approved by: https://github.com/xunnanxu
2024-07-13 23:19:22 +00:00
wz337
3896ba3260 [DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)
Fixes #ISSUE_NUMBER

As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-07-11 17:02:18 +00:00
Zaida Zhou
bc8883a7c4 fix the error msg in device_mesh (#129747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747
Approved by: https://github.com/awgu, https://github.com/wconstab
2024-06-28 20:12:09 +00:00
Xuehai Pan
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
PyTorch MergeBot
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c41.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
Xuehai Pan
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Iris Z
1d84c7e100 [DeviceMesh] Update get_group and add get_all_groups (#128097)
Fixes #121984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-06-08 04:28:56 +00:00
Yifu Wang
1d0c1087dd Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599
Approved by: https://github.com/wanchaol
ghstack dependencies: #126598
2024-06-06 17:18:12 +00:00
Will Constable
22368eac10 [FSDP2] Fix submesh slicing to enable 3D parallelism (#127585)
Ensures the submesh used to create sharded parameters are created on a
submesh that excludes the Pipeline Parallelism dimension.

Also cleans up the logic for storing placements to no longer consider the outer / global dims.  Since we store an 'spmd' submesh, we can avoid this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127585
Approved by: https://github.com/wanchaol
2024-06-04 04:24:09 +00:00
Iris Z
1699edaabb [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-05-31 17:06:36 +00:00
PyTorch MergeBot
f6e303fa47 Revert "[DeviceMesh] Adding nD slicing support back (#127465)"
This reverts commit e72232f8f0.

Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint e72232f8f0, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))
2024-05-31 00:43:13 +00:00
wz337
e72232f8f0 [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab
2024-05-30 23:55:21 +00:00
Aaron Gokaslan
3cb16ebf08 [BE]: Update ruff to 0.4.5 (#126979)
Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979
Approved by: https://github.com/ezyang
2024-05-24 18:38:35 +00:00
Andrew Gu
697ed6f5b3 [DeviceMesh] Supported N groups in from_group (#126258)
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258
Approved by: https://github.com/wanchaol
2024-05-17 01:03:21 +00:00
wz337
059b68fbdf [DeviceMesh] Fix hash and eq not match (#123572)
Fixes #121799

We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh.
Examples can be found here: https://github.com/pytorch/pytorch/issues/121799

Also need this to unblock https://github.com/pytorch/pytorch/pull/123394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu
2024-05-16 02:00:45 +00:00
Yuanhao Ji
e0d2c24de1 Fix device type issue in _get_device_handle (#124390)
Fix #124327

`device_type`, the first arg of [init_device_mesh()](a0466061e1/torch/distributed/device_mesh.py (L503)),  does not support types with indexes, such as `cuda:0`.
If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`.
So the exception should be thrown before creating DeviceMesh object.

> See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-04-30 06:59:56 +00:00
Iris Zhang (PyTorch)
4ad291d07f [DeviceMesh] Removing mapping child_to_parent_mapping from _MeshEnv (#124890)
Summary: The mapping is no longer needed after https://github.com/pytorch/pytorch/pull/124780, as we are not going to re-create the pgs during mesh slicing.

Test Plan: CI

Differential Revision: D56499001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124890
Approved by: https://github.com/awgu
2024-04-26 06:40:36 +00:00
Iris Zhang (PyTorch)
43f4e71daa Making _MeshEnv subclassing thread local (#124555)
With _mesh_resources being global var, when thread pg based testing is used (aka spawn_threads_and_init_comms()), the last rank with the same key would overwrite the formers. This isn't an issue in regular process-based runtime as logically each key is unique.

Example failure: https://github.com/pytorch/pytorch/actions/runs/8779134353/job/24087295785
```
RuntimeError: Could not resolve the process group registered under the name 8
or
Throwing assert not none error
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124555
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol
2024-04-26 02:45:42 +00:00
Andrew Gu
36c983a973 [DeviceMesh] Added DeviceMesh.from_group() (#124787)
This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh.

Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787
Approved by: https://github.com/wanchaol
ghstack dependencies: #124651, #124741, #124767, #124768, #124780
2024-04-24 23:16:06 +00:00
Andrew Gu
48312a7fc3 [DeviceMesh] Removed unneeded .to(cpu) (#124768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124768
Approved by: https://github.com/wz337
ghstack dependencies: #124651, #124741, #124767
2024-04-24 18:07:20 +00:00
Andrew Gu
1db7d64af2 [DeviceMesh] Initialized mesh tensor with CPU context (#124767)
This PR makes sure to construct the `DeviceMesh`'s `mesh` tensor on CPU device in `init_device_mesh()`. This means that we can call `init_device_mesh()` under meta-device context and still construct the correct `mesh` tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124767
Approved by: https://github.com/wz337
ghstack dependencies: #124651, #124741
2024-04-24 18:04:06 +00:00
Wanchao Liang
0da94f3a08 [device_mesh] add a private init backend option (#124780)
This PR adds a private init backend option, to tackle the issues sub
mesh creation:

in device mesh slicing we don't want to create process groups again,
so explicitly turn the group creation off it's useful

Also I think there might be more submesh creation functionality so
having this flag would ensure that there's no new group created

Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780
Approved by: https://github.com/awgu
2024-04-24 04:31:58 +00:00
wz337
37fd547518 [DeviceMesh] Make dtype of mesh tensor from init_device_mesh() consistent with directly calling DeviceMesh() (#123677)
Currently, mesh tensor from `init_device_mesh()` has a dtype of `torch.int64` while mesh tensor from `DeviceMesh()` would have dtype of `torch.int32`. Making them consistent in this PR.

DeviceMesh ctor dtype pointer:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/device_mesh.py#L217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123677
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol
2024-04-10 09:14:34 +00:00
wz337
2b1ba0ceae [DeviceMesh] Cache and reuse sliced result (#122975)
Fixes #118849

Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors.

We will follow up with reusing pg from the parent_mesh during submesh creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975
Approved by: https://github.com/wanchaol
2024-03-30 23:56:55 +00:00
Iris Zhang (PyTorch)
e99fa0042c Back out "[DeviceMesh] Add support for nD slicing (#119752)" (#121763)
Summary:
Original commit changeset: e52b8809c8d8

Original Phabricator Diff: D54778906

We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248

Test Plan: Sandcastle

Reviewed By: satgera

Differential Revision: D54825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar
2024-03-13 07:22:08 +00:00
wz337
60cd2a43ca [DeviceMesh] Add support for nD slicing (#119752)
Fixes one of the issue mentioned in #118639
@mvpatel2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752
Approved by: https://github.com/wanchaol
2024-03-10 00:16:37 +00:00
wz337
5603d95375 [DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046)
More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614

In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-02-22 22:03:13 +00:00
Iris Zhang (PyTorch)
0245000be8 [DeviceMesh] Temporarily disable re-use subgroup (#118940)
Summary:
The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan).
We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940
Approved by: https://github.com/wanchaol
2024-02-05 06:30:00 +00:00
wz337
c908caf92b [DeviceMesh] Alllow 1d slice from 1d mesh (#118895)
Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851)

i.e.
mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp"))
then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895
Approved by: https://github.com/wanchaol
2024-02-02 22:00:24 +00:00
Yifu Wang
697ca4f292 Preliminary DeviceMesh + native c10d functional integration (#118423)
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.

### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
2024-01-31 04:36:12 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Andrew Gu
68b18dc2a2 [DeviceMesh] Removed print of self._dim_group_infos (#118527)
This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527
Approved by: https://github.com/wz337
2024-01-29 19:14:25 +00:00
wz337
e1f9eca113 [DeviceMesh] Reuse sub_group pg if exists (#115716)
Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:
1) re-use sub_group pg if it exsits,
2) create new sub_group pg if it does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716
Approved by: https://github.com/wanchaol
2024-01-25 18:07:16 +00:00