Commit Graph

21 Commits

Author SHA1 Message Date
Iris Zhang (PyTorch)
e99fa0042c Back out "[DeviceMesh] Add support for nD slicing (#119752)" (#121763)
Summary:
Original commit changeset: e52b8809c8d8

Original Phabricator Diff: D54778906

We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248

Test Plan: Sandcastle

Reviewed By: satgera

Differential Revision: D54825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar
2024-03-13 07:22:08 +00:00
wz337
60cd2a43ca [DeviceMesh] Add support for nD slicing (#119752)
Fixes one of the issue mentioned in #118639
@mvpatel2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752
Approved by: https://github.com/wanchaol
2024-03-10 00:16:37 +00:00
wz337
5603d95375 [DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046)
More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614

In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-02-22 22:03:13 +00:00
Iris Zhang (PyTorch)
0245000be8 [DeviceMesh] Temporarily disable re-use subgroup (#118940)
Summary:
The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan).
We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940
Approved by: https://github.com/wanchaol
2024-02-05 06:30:00 +00:00
wz337
c908caf92b [DeviceMesh] Alllow 1d slice from 1d mesh (#118895)
Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851)

i.e.
mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp"))
then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895
Approved by: https://github.com/wanchaol
2024-02-02 22:00:24 +00:00
Yifu Wang
697ca4f292 Preliminary DeviceMesh + native c10d functional integration (#118423)
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.

### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
2024-01-31 04:36:12 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Andrew Gu
68b18dc2a2 [DeviceMesh] Removed print of self._dim_group_infos (#118527)
This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527
Approved by: https://github.com/wz337
2024-01-29 19:14:25 +00:00
wz337
e1f9eca113 [DeviceMesh] Reuse sub_group pg if exists (#115716)
Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:
1) re-use sub_group pg if it exsits,
2) create new sub_group pg if it does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716
Approved by: https://github.com/wanchaol
2024-01-25 18:07:16 +00:00
Carlos Mocholí
a31effa15f Update device_mesh.py docs imports (#116074)
These are not importable from `torch.distributed`, at least today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116074
Approved by: https://github.com/wz337, https://github.com/fegin
2023-12-19 09:44:55 +00:00
wz337
b48abbc020 [DeviceMesh] Fix DeviceMesh docstring (#116053)
1. remove outdated comments
2. fix examples in docstring

Doc after fix:
<img width="706" alt="image" src="https://github.com/pytorch/pytorch/assets/31293777/19f4f03c-0fd7-4e88-bca1-1a6ce693fbb7">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116053
Approved by: https://github.com/wanchaol
2023-12-19 04:05:49 +00:00
Iris Z
1eca63c6ac [DeviceMesh] Move helper function 'get_mesh_dim_by_name' to MeshEnv class (#115572)
Move helper function `get_mesh_dim_by_name ` outside of the DeviceMesh class to keep the public class cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115572
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2023-12-12 06:29:46 +00:00
wz337
c70f995b5c [DeviceMesh] Add mesh_dim_names to DeviceMesh __repr__ if it exists (#115579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115579
Approved by: https://github.com/wanchaol
2023-12-12 02:18:34 +00:00
Chip Turner
937d616e82 Re-enable type checking for distributed_c10d.py (#115223)
Re-enable type checking for distributed_c10d.py

Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.

Note: the backwards compatibility linter does not like some of these changes.  But they were incorrect before.  This needs human verification, however.

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
2023-12-09 11:07:54 +00:00
Iris Zhang (PyTorch)
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
Nikita Shulga
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d6.
2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
PyTorch MergeBot
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00