pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
PyTorch MergeBot	351ba3e67c	Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )" This reverts commit `65864d0134`. Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))	2024-08-30 16:27:40 +00:00
wz337	50efbb9f1e	[DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603 Approved by: https://github.com/fduwjj ghstack dependencies: #133838, #133839, #134048	2024-08-30 08:13:37 +00:00
fduwjj	65864d0134	[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931 Approved by: https://github.com/H-Huang	2024-08-29 22:40:12 +00:00
wz337	761cf91e3c	[DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275 ) Adding a private helper method for Shampoo HSDP use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275 Approved by: https://github.com/XilunWu	2024-08-27 14:51:19 +00:00
wz337	268092db83	[DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048 ) If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim. Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_". For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048 Approved by: https://github.com/fegin ghstack dependencies: #133838, #133839	2024-08-25 10:36:01 +00:00
wz337	5d39b14b68	[DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839 ) Add DeviceMesh slicing support such that we could do the following: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp") ) shard_cp_mesh = mesh_3d["shard", "cp"]._flatten() hsdp_mesh = mesh_3d["replicate", "shard_cp"] # we can get the corresponding group of the flatten mesh through group = shard_cp_mesh.get_group() # or group = mesh_3d["shard_cp"].get_group() # or mesh_3d.get_group(mesh_dim="shard_cp") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839 Approved by: https://github.com/fegin ghstack dependencies: #133838	2024-08-24 03:49:29 +00:00
wz337	49430bfd5c	[DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838 ) ``` # supposed we have a 3d mesh mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp") dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() """ then we would have flatten_name_to_root_dims[mesh_3d]: { "dp_cp": (0, 1) } """ ``` We need this information to validate the order mesh slice including flatten mesh dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838 Approved by: https://github.com/fegin	2024-08-20 19:43:45 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit `2ee6b97464`. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
wz337	3fc9ee5a31	[DeviceMesh] Directly retrieve flattened mesh if already created (#133195 ) Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #133193	2024-08-14 21:11:04 +00:00
wz337	ef580a0e5c	[DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193 ) This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed. ``` mesh_3d = init_device_mesh( self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"), ) # valid 2d slicing mesh_2d = mesh_3d["dp", "cp"] mesh_2d = mesh_3d["dp", "tp"] mesh_2d = mesh_3d["cp", "tp"] # invalid 2d slicing mesh_2d = mesh_3d["cp", "dp"] mesh_2d = mesh_3d["tp", "cp"] mesh_2d = mesh_3d["tp", "dp"] # valid way to create dp_cp flatten slice dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() # invalid way to create dp_cp flatten slice dp_cp_mesh = mesh_3d["cp", "dp"]._flatten() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193 Approved by: https://github.com/fegin, https://github.com/wanchaol	2024-08-14 07:18:41 +00:00
wz337	0ff0bf3d31	[Replicate] Fix replicate with DeviceMesh initialization (#133024 ) A follow up on https://github.com/pytorch/pytorch/pull/132339. `get_parent_mesh` is replaced by `get_root_mesh`. In addition, modify a few places that parent mesh is mentioned in test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133024 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-08-09 00:45:47 +00:00
wz337	479d460471	[DeviceMesh] Add a private _flatten() API for device_mesh (#132632 ) Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"), ) dp_cp_mesh = mesh_3d["dp", "cp"] # flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',)) # flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',)) flattened_dp_cp_mesh = dp_cp_mesh._flatten() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #132310, #132311, #132339	2024-08-08 06:46:42 +00:00
wz337	8b50d5398f	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-07 16:13:11 +00:00
wz337	87053132ea	[DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339 ) Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0)) ``` We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0)) ``` With this change, we will have two types of meshes in an environment. 1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing. 2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339 Approved by: https://github.com/wanchaol ghstack dependencies: #132310, #132311	2024-08-07 07:01:12 +00:00
PyTorch MergeBot	c7113a6186	Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 )" This reverts commit `1a23ef2ece`. Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](`1a23ef2ece`). Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))	2024-08-06 23:47:53 +00:00
wz337	073cee531c	[Test][Easy] Remove print in test_device_mesh.py (#132780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780 Approved by: https://github.com/XilunWu	2024-08-06 22:04:39 +00:00
wz337	1a23ef2ece	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-06 22:00:09 +00:00
wz337	fb87796d4f	[DeviceMesh] Add supports for non-continuous slicing (#132310 ) Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310 Approved by: https://github.com/wanchaol	2024-08-05 09:30:07 +00:00
wz337	4de85e3c30	[DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636 ) We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes. As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636 Approved by: https://github.com/wanchaol	2024-07-25 22:47:22 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Iris Z	1d84c7e100	[DeviceMesh] Update get_group and add get_all_groups (#128097 ) Fixes #121984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-06-08 04:28:56 +00:00
Yifu Wang	1d0c1087dd	Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599 Approved by: https://github.com/wanchaol ghstack dependencies: #126598	2024-06-06 17:18:12 +00:00
Aaron Gokaslan	12c4a2c297	[BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716 ) Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716 Approved by: https://github.com/ezyang	2024-06-03 17:22:13 +00:00
Iris Z	1699edaabb	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-05-31 17:06:36 +00:00
PyTorch MergeBot	f6e303fa47	Revert "[DeviceMesh] Adding nD slicing support back (#127465 )" This reverts commit `e72232f8f0`. Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint `e72232f8f0`, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))	2024-05-31 00:43:13 +00:00
wz337	e72232f8f0	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab	2024-05-30 23:55:21 +00:00
Andrew Gu	697ed6f5b3	[DeviceMesh] Supported N groups in `from_group` (#126258 ) Overview This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise). This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience. <details> <summary> Old Approach </summary> Overview - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.) - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general. - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258 Approved by: https://github.com/wanchaol	2024-05-17 01:03:21 +00:00
Wanchao Liang	d0dfcd2c34	fix the device type for with_comms decorator (#125798 ) found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798 Approved by: https://github.com/yifuwang	2024-05-16 03:40:19 +00:00
wz337	059b68fbdf	[DeviceMesh] Fix hash and eq not match (#123572 ) Fixes #121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: https://github.com/pytorch/pytorch/issues/121799 Also need this to unblock https://github.com/pytorch/pytorch/pull/123394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu	2024-05-16 02:00:45 +00:00
Wanchao Liang	04a241947a	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-30 18:30:34 +00:00
Yuanhao Ji	e0d2c24de1	Fix device type issue in `_get_device_handle` (#124390 ) Fix #124327 `device_type`, the first arg of [init_device_mesh()](`a0466061e1/torch/distributed/device_mesh.py (L503)`), does not support types with indexes, such as `cuda:0`. If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`. So the exception should be thrown before creating DeviceMesh object. > See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161, Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-04-30 06:59:56 +00:00
PyTorch MergeBot	3bd67dab32	Revert "[dtensor] delete the old unused mesh_alltoall (#124879 )" This reverts commit `f7f018a0ed`. Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 `f7f018a0ed`. Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))	2024-04-29 20:26:15 +00:00
Wanchao Liang	f7f018a0ed	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-29 17:22:30 +00:00
Wanchao Liang	8d46ab4104	[dtensor] move pad/unpad_tensor to separate utils (#124871 ) as titled, 1. pad/unpad is a general util not specific to the Shard placement, 2. for the propose of the next PR, move these two out of Shard placement itself, and give additional pad_dim argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu	2024-04-29 17:22:25 +00:00
Andrew Gu	36c983a973	[DeviceMesh] Added `DeviceMesh.from_group()` (#124787 ) This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh. Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787 Approved by: https://github.com/wanchaol ghstack dependencies: #124651, #124741, #124767, #124768, #124780	2024-04-24 23:16:06 +00:00
Wanchao Liang	0da94f3a08	[device_mesh] add a private init backend option (#124780 ) This PR adds a private init backend option, to tackle the issues sub mesh creation: in device mesh slicing we don't want to create process groups again, so explicitly turn the group creation off it's useful Also I think there might be more submesh creation functionality so having this flag would ensure that there's no new group created Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780 Approved by: https://github.com/awgu	2024-04-24 04:31:58 +00:00
Iris Z	2f45be46f6	[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142 Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu	2024-04-18 23:19:17 +00:00
PyTorch MergeBot	944d046645	Revert "[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 )" This reverts commit `a403757913`. Reverted https://github.com/pytorch/pytorch/pull/124142 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/124142#issuecomment-2062587289))	2024-04-17 22:31:30 +00:00
wz337	a403757913	[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142 Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu	2024-04-17 20:45:49 +00:00
Yifu Wang	2a2e1d8e4f	[functional collective] change the Python APIs to only use the native funcol ops (#123777 ) ## Summary After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR: - Removed `use_native_funcol()`. - Removed the code path in the Python APIs when `use_native_funcol()` is `False`. - Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol. ## Test Changes `test_functional_api.py` - Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol. - Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` `b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)` `test/distributed/_tensor/test_dtensor.py` `test/distributed/_tensor/test_dtensor_compile.py` `test/distributed/test_device_mesh.py` `test/distributed/_tensor/experimental/test_tp_transform.py` `test/distributed/_tensor/test_matrix_ops.py` `test/distributed/test_inductor_collectives.py` - All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol. `test/distributed/test_c10d_functional_native.py` - Removed the `run_with_native_funcol` decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777 Approved by: https://github.com/wanchaol ghstack dependencies: #123776	2024-04-13 03:08:36 +00:00
wz337	2b1ba0ceae	[DeviceMesh] Cache and reuse sliced result (#122975 ) Fixes #118849 Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors. We will follow up with reusing pg from the parent_mesh during submesh creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975 Approved by: https://github.com/wanchaol	2024-03-30 23:56:55 +00:00
Iris Zhang (PyTorch)	e99fa0042c	Back out "[DeviceMesh] Add support for nD slicing (#119752 )" (#121763 ) Summary: Original commit changeset: e52b8809c8d8 Original Phabricator Diff: D54778906 We have to backout this diff. D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248 Test Plan: Sandcastle Reviewed By: satgera Differential Revision: D54825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763 Approved by: https://github.com/osalpekar	2024-03-13 07:22:08 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
wz337	5603d95375	[DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046 ) More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614 In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-02-22 22:03:13 +00:00
Yifu Wang	637cf4a3f2	Test parametrization utils for native funcol migration (#119950 ) ``` Between the time we switch to the native funcol by default and the time when we are confident that we can remove the legacy implementation, we want to ensure that the legacy funcol remains covered by unit tests. This is to prepare for any potential (but unlikely) reverts. The following utilities help achieve this goal. run_with_{native,legacy}_funcol - mark a test to run with only {native,legacy} funcol. These decorators are for impl specific tests (e.g. verifying generated code with FileCheck). run_with_both_funcol_impls - parametrize a test to run with both legacy and native funcol. run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but passes `enable_native_funcol` to the test so impl specific checks can be carried out. ``` This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950 Approved by: https://github.com/wanchaol ghstack dependencies: #119881	2024-02-19 02:46:03 +00:00
wz337	5f9f771711	[DeviceMesh][Test] Remove test_raises_mesh_dim_less_than_2 (#119172 ) The test is no longer applicable after we allow 1D slice from 1D mesh. https://github.com/pytorch/pytorch/pull/118895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119172 Approved by: https://github.com/awgu, https://github.com/atalman	2024-02-05 17:34:51 +00:00
Iris Zhang (PyTorch)	0245000be8	[DeviceMesh] Temporarily disable re-use subgroup (#118940 ) Summary: The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan). We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940 Approved by: https://github.com/wanchaol	2024-02-05 06:30:00 +00:00

1 2

67 Commits