pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Andrew Gu	697ed6f5b3	[DeviceMesh] Supported N groups in `from_group` (#126258 ) Overview This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise). This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience. <details> <summary> Old Approach </summary> Overview - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.) - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general. - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258 Approved by: https://github.com/wanchaol	2024-05-17 01:03:21 +00:00
wz337	059b68fbdf	[DeviceMesh] Fix hash and eq not match (#123572 ) Fixes #121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: https://github.com/pytorch/pytorch/issues/121799 Also need this to unblock https://github.com/pytorch/pytorch/pull/123394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu	2024-05-16 02:00:45 +00:00
Yuanhao Ji	e0d2c24de1	Fix device type issue in `_get_device_handle` (#124390 ) Fix #124327 `device_type`, the first arg of [init_device_mesh()](`a0466061e1/torch/distributed/device_mesh.py (L503)`), does not support types with indexes, such as `cuda:0`. If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`. So the exception should be thrown before creating DeviceMesh object. > See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161, Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-04-30 06:59:56 +00:00
Iris Zhang (PyTorch)	4ad291d07f	[DeviceMesh] Removing mapping child_to_parent_mapping from `_MeshEnv` (#124890 ) Summary: The mapping is no longer needed after https://github.com/pytorch/pytorch/pull/124780, as we are not going to re-create the pgs during mesh slicing. Test Plan: CI Differential Revision: D56499001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124890 Approved by: https://github.com/awgu	2024-04-26 06:40:36 +00:00
Iris Zhang (PyTorch)	43f4e71daa	Making _MeshEnv subclassing thread local (#124555 ) With _mesh_resources being global var, when thread pg based testing is used (aka spawn_threads_and_init_comms()), the last rank with the same key would overwrite the formers. This isn't an issue in regular process-based runtime as logically each key is unique. Example failure: https://github.com/pytorch/pytorch/actions/runs/8779134353/job/24087295785 ``` RuntimeError: Could not resolve the process group registered under the name 8 or Throwing assert not none error ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124555 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol	2024-04-26 02:45:42 +00:00
Andrew Gu	36c983a973	[DeviceMesh] Added `DeviceMesh.from_group()` (#124787 ) This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh. Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787 Approved by: https://github.com/wanchaol ghstack dependencies: #124651, #124741, #124767, #124768, #124780	2024-04-24 23:16:06 +00:00
Andrew Gu	48312a7fc3	[DeviceMesh] Removed unneeded `.to(cpu)` (#124768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124768 Approved by: https://github.com/wz337 ghstack dependencies: #124651, #124741, #124767	2024-04-24 18:07:20 +00:00
Andrew Gu	1db7d64af2	[DeviceMesh] Initialized mesh tensor with CPU context (#124767 ) This PR makes sure to construct the `DeviceMesh`'s `mesh` tensor on CPU device in `init_device_mesh()`. This means that we can call `init_device_mesh()` under meta-device context and still construct the correct `mesh` tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124767 Approved by: https://github.com/wz337 ghstack dependencies: #124651, #124741	2024-04-24 18:04:06 +00:00
Wanchao Liang	0da94f3a08	[device_mesh] add a private init backend option (#124780 ) This PR adds a private init backend option, to tackle the issues sub mesh creation: in device mesh slicing we don't want to create process groups again, so explicitly turn the group creation off it's useful Also I think there might be more submesh creation functionality so having this flag would ensure that there's no new group created Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780 Approved by: https://github.com/awgu	2024-04-24 04:31:58 +00:00
wz337	37fd547518	[DeviceMesh] Make dtype of mesh tensor from `init_device_mesh()` consistent with directly calling `DeviceMesh()` (#123677 ) Currently, mesh tensor from `init_device_mesh()` has a dtype of `torch.int64` while mesh tensor from `DeviceMesh()` would have dtype of `torch.int32`. Making them consistent in this PR. DeviceMesh ctor dtype pointer: https://github.com/pytorch/pytorch/blob/main/torch/distributed/device_mesh.py#L217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123677 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol	2024-04-10 09:14:34 +00:00
wz337	2b1ba0ceae	[DeviceMesh] Cache and reuse sliced result (#122975 ) Fixes #118849 Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors. We will follow up with reusing pg from the parent_mesh during submesh creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975 Approved by: https://github.com/wanchaol	2024-03-30 23:56:55 +00:00
Iris Zhang (PyTorch)	e99fa0042c	Back out "[DeviceMesh] Add support for nD slicing (#119752 )" (#121763 ) Summary: Original commit changeset: e52b8809c8d8 Original Phabricator Diff: D54778906 We have to backout this diff. D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248 Test Plan: Sandcastle Reviewed By: satgera Differential Revision: D54825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763 Approved by: https://github.com/osalpekar	2024-03-13 07:22:08 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
wz337	5603d95375	[DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046 ) More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614 In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-02-22 22:03:13 +00:00
Iris Zhang (PyTorch)	0245000be8	[DeviceMesh] Temporarily disable re-use subgroup (#118940 ) Summary: The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan). We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940 Approved by: https://github.com/wanchaol	2024-02-05 06:30:00 +00:00
wz337	c908caf92b	[DeviceMesh] Alllow 1d slice from 1d mesh (#118895 ) Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851) i.e. mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp")) then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895 Approved by: https://github.com/wanchaol	2024-02-02 22:00:24 +00:00
Yifu Wang	697ca4f292	Preliminary DeviceMesh + native c10d functional integration (#118423 ) ### Summary - Added `group_name` as the third field in `dim_group_infos`. - `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI. ### Other fixes - Convert `reduceOp` to lower case before passing it into c10d_functional ops. - Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423 Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab	2024-01-31 04:36:12 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Andrew Gu	68b18dc2a2	[DeviceMesh] Removed print of `self._dim_group_infos` (#118527 ) This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527 Approved by: https://github.com/wz337	2024-01-29 19:14:25 +00:00
wz337	e1f9eca113	[DeviceMesh] Reuse sub_group pg if exists (#115716 ) Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will: 1) re-use sub_group pg if it exsits, 2) create new sub_group pg if it does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716 Approved by: https://github.com/wanchaol	2024-01-25 18:07:16 +00:00
Carlos Mocholí	a31effa15f	Update device_mesh.py docs imports (#116074 ) These are not importable from `torch.distributed`, at least today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116074 Approved by: https://github.com/wz337, https://github.com/fegin	2023-12-19 09:44:55 +00:00
wz337	b48abbc020	[DeviceMesh] Fix DeviceMesh docstring (#116053 ) 1. remove outdated comments 2. fix examples in docstring Doc after fix: <img width="706" alt="image" src="https://github.com/pytorch/pytorch/assets/31293777/19f4f03c-0fd7-4e88-bca1-1a6ce693fbb7"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116053 Approved by: https://github.com/wanchaol	2023-12-19 04:05:49 +00:00
Iris Z	1eca63c6ac	[DeviceMesh] Move helper function 'get_mesh_dim_by_name' to MeshEnv class (#115572 ) Move helper function `get_mesh_dim_by_name ` outside of the DeviceMesh class to keep the public class cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115572 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2023-12-12 06:29:46 +00:00
wz337	c70f995b5c	[DeviceMesh] Add mesh_dim_names to DeviceMesh __repr__ if it exists (#115579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115579 Approved by: https://github.com/wanchaol	2023-12-12 02:18:34 +00:00
Chip Turner	937d616e82	Re-enable type checking for distributed_c10d.py (#115223 ) Re-enable type checking for distributed_c10d.py Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since. Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however. #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223 Approved by: https://github.com/wconstab	2023-12-09 11:07:54 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit `eaa64339d6`.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit `729ac7317a`. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00

32 Commits