Commit Graph

87 Commits

Author SHA1 Message Date
Aaron Orenstein
e95e8eed0a mypy 1.16.0 (#155821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-14 18:18:43 +00:00
PyTorch MergeBot
7e4c097b07 Revert "[inductor] Add typing to _inductor/ir.py (#149958)"
This reverts commit 529e0357c6.

Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see b0fbbef136/1 ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))
2025-06-06 15:19:16 +00:00
Tom Ritchford
529e0357c6 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-06 14:15:01 +00:00
Jane Xu
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
Wanchao Liang
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
Wanchao Liang
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
Iris Z
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
lzhang2
b7ad52abb0 Use new group instead of split group on non-CUDA device (#141469)
Motivation:

Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-13 05:11:33 +00:00
Chien-Chin Huang
daa27fe59d [DeviceMesh] Call no_dispatch before doing tensor slicing in DeviceMesh (#142287)
Summary:
DeviceMesh's tensor operation is a control plane operation not data plane and should not be affected by FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142287
Approved by: https://github.com/XilunWu
2024-12-10 06:33:01 +00:00
SCh-zx
c8c669ce74 When using a third-party device to test DeviceMesh,the error check for 'test_raises_invalid_device_type' can only prints 'GPU' (#142038)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142038
Approved by: https://github.com/kwen2501
2024-12-06 11:14:00 +00:00
Xilun Wu
7964bcc3dc [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945)
**Summary**
This PR fixes a calculation miss in DeviceMesh's create_sub_mesh().

**Error Description**
When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example:
```
flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2")
alias_flattened_mesh = device_mesh["dim0-2"]  # this mesh slice leads to error in current impl
```

It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`:
```
IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4)
```

**Fix**
The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value.

**Test**
`pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945
Approved by: https://github.com/wz337
2024-10-29 17:56:56 +00:00
wz337
d7e0e1dbc4 [DeviceMesh] Use split_group to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129)
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
2024-10-22 00:00:05 +00:00
wz337
fb0da32377 [DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117)
As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension).

For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117
Approved by: https://github.com/kwen2501
2024-10-21 03:04:24 +00:00
Andrew Gu
39c5048549 [DeviceMesh] Fixed from_group when passing tensor mesh (#137713)
This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713
Approved by: https://github.com/wz337
2024-10-11 14:53:51 +00:00
wz337
93dcb92bae [DeviceMesh][EZ] Add group description to new group (#136558)
Add group description to new_group in device_mesh to help with debuggability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-09-28 03:09:41 +00:00
fduwjj
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
wz337
67f98a99a4 [DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271
Approved by: https://github.com/fduwjj
2024-09-06 06:23:20 +00:00
PyTorch MergeBot
351ba3e67c Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)"
This reverts commit 65864d0134.

Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))
2024-08-30 16:27:40 +00:00
fduwjj
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
wz337
761cf91e3c [DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275)
Adding a private helper method for Shampoo HSDP use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275
Approved by: https://github.com/XilunWu
2024-08-27 14:51:19 +00:00
wz337
268092db83 [DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048)
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".

For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
2024-08-25 10:36:01 +00:00
wz337
5d39b14b68 [DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839)
Add DeviceMesh slicing support such that we could do the following:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp")
)
shard_cp_mesh = mesh_3d["shard", "cp"]._flatten()
hsdp_mesh = mesh_3d["replicate", "shard_cp"]
# we can get the corresponding group of the flatten mesh through

group = shard_cp_mesh.get_group()
# or
group = mesh_3d["shard_cp"].get_group()
# or
mesh_3d.get_group(mesh_dim="shard_cp")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839
Approved by: https://github.com/fegin
ghstack dependencies: #133838
2024-08-24 03:49:29 +00:00
wz337
49430bfd5c [DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838)
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()

"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
    "dp_cp": (0, 1)
}
"""
```

We need this information to validate the order mesh slice including flatten mesh dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
2024-08-20 19:43:45 +00:00
wz337
4bae7ae3d9 [DeviceMesh][Easy] Fix typo (#133790)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790
Approved by: https://github.com/Skylion007
2024-08-19 05:20:22 +00:00
wz337
3fc9ee5a31 [DeviceMesh] Directly retrieve flattened mesh if already created (#133195)
Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #133193
2024-08-14 21:11:04 +00:00
wz337
ef580a0e5c [DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193)
This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed.

```
mesh_3d = init_device_mesh(
    self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"),
)

# valid 2d slicing
mesh_2d = mesh_3d["dp", "cp"]
mesh_2d = mesh_3d["dp", "tp"]
mesh_2d = mesh_3d["cp", "tp"]

# invalid 2d slicing
mesh_2d = mesh_3d["cp", "dp"]
mesh_2d = mesh_3d["tp", "cp"]
mesh_2d = mesh_3d["tp", "dp"]

# valid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
# invalid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["cp", "dp"]._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-08-14 07:18:41 +00:00
wz337
479d460471 [DeviceMesh] Add a private _flatten() API for device_mesh (#132632)
Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"),
)

dp_cp_mesh = mesh_3d["dp", "cp"]
# flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',))
# flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',))
flattened_dp_cp_mesh = dp_cp_mesh._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #132310, #132311, #132339
2024-08-08 06:46:42 +00:00
wz337
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
wz337
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
PyTorch MergeBot
c7113a6186 Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)"
This reverts commit 1a23ef2ece.

Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](1a23ef2ece).  Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))
2024-08-06 23:47:53 +00:00
wz337
1a23ef2ece [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-06 22:00:09 +00:00
Brian Hirsh
e6eee04875 dynamo: use equality guards instead of id guards for Placement/DeviceMesh (#124401)
After talking to @anijain2305, we probably can't land this since it won't work for C++ guards. But we should still be able to do better than ID_MATCH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124401
Approved by: https://github.com/anijain2305
2024-08-06 17:14:44 +00:00
wz337
4306eebab1 [DeviceMesh] Update slicing documentation to include nD and non-continuous slicing (#132311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132311
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310
2024-08-05 23:44:23 +00:00
wz337
fb87796d4f [DeviceMesh] Add supports for non-continuous slicing (#132310)
Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310
Approved by: https://github.com/wanchaol
2024-08-05 09:30:07 +00:00
wz337
4de85e3c30 [DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636)
We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes.

As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636
Approved by: https://github.com/wanchaol
2024-07-25 22:47:22 +00:00
Iris Zhang (PyTorch)
ee6f0ab190 [DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495) (#130685)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs.

Test Plan:
```
[irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'
```

Reviewed By: gag1jain

Differential Revision: D59725924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685
Approved by: https://github.com/gag1jain
2024-07-15 20:05:26 +00:00
Gagan Jain
97cfc65dbc Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)" (#130676)
Summary:
Original commit changeset: 80c2ca639146

Original Phabricator Diff: D59612200

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'

Differential Revision: D59719562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676
Approved by: https://github.com/xunnanxu
2024-07-13 23:19:22 +00:00
wz337
3896ba3260 [DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)
Fixes #ISSUE_NUMBER

As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-07-11 17:02:18 +00:00
Zaida Zhou
bc8883a7c4 fix the error msg in device_mesh (#129747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747
Approved by: https://github.com/awgu, https://github.com/wconstab
2024-06-28 20:12:09 +00:00
Xuehai Pan
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
PyTorch MergeBot
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c41.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
Xuehai Pan
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Iris Z
1d84c7e100 [DeviceMesh] Update get_group and add get_all_groups (#128097)
Fixes #121984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-06-08 04:28:56 +00:00
Yifu Wang
1d0c1087dd Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599
Approved by: https://github.com/wanchaol
ghstack dependencies: #126598
2024-06-06 17:18:12 +00:00