pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
wz337	1a23ef2ece	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-06 22:00:09 +00:00
wz337	fb87796d4f	[DeviceMesh] Add supports for non-continuous slicing (#132310 ) Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310 Approved by: https://github.com/wanchaol	2024-08-05 09:30:07 +00:00
wz337	4de85e3c30	[DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636 ) We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes. As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636 Approved by: https://github.com/wanchaol	2024-07-25 22:47:22 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Iris Z	1d84c7e100	[DeviceMesh] Update get_group and add get_all_groups (#128097 ) Fixes #121984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-06-08 04:28:56 +00:00
Yifu Wang	1d0c1087dd	Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599 Approved by: https://github.com/wanchaol ghstack dependencies: #126598	2024-06-06 17:18:12 +00:00
Aaron Gokaslan	12c4a2c297	[BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716 ) Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716 Approved by: https://github.com/ezyang	2024-06-03 17:22:13 +00:00
Iris Z	1699edaabb	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-05-31 17:06:36 +00:00
PyTorch MergeBot	f6e303fa47	Revert "[DeviceMesh] Adding nD slicing support back (#127465 )" This reverts commit `e72232f8f0`. Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint `e72232f8f0`, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))	2024-05-31 00:43:13 +00:00
wz337	e72232f8f0	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab	2024-05-30 23:55:21 +00:00
Andrew Gu	697ed6f5b3	[DeviceMesh] Supported N groups in `from_group` (#126258 ) Overview This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise). This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience. <details> <summary> Old Approach </summary> Overview - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.) - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general. - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258 Approved by: https://github.com/wanchaol	2024-05-17 01:03:21 +00:00
Wanchao Liang	d0dfcd2c34	fix the device type for with_comms decorator (#125798 ) found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798 Approved by: https://github.com/yifuwang	2024-05-16 03:40:19 +00:00
wz337	059b68fbdf	[DeviceMesh] Fix hash and eq not match (#123572 ) Fixes #121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: https://github.com/pytorch/pytorch/issues/121799 Also need this to unblock https://github.com/pytorch/pytorch/pull/123394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu	2024-05-16 02:00:45 +00:00
Wanchao Liang	04a241947a	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-30 18:30:34 +00:00
Yuanhao Ji	e0d2c24de1	Fix device type issue in `_get_device_handle` (#124390 ) Fix #124327 `device_type`, the first arg of [init_device_mesh()](`a0466061e1/torch/distributed/device_mesh.py (L503)`), does not support types with indexes, such as `cuda:0`. If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`. So the exception should be thrown before creating DeviceMesh object. > See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161, Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-04-30 06:59:56 +00:00
PyTorch MergeBot	3bd67dab32	Revert "[dtensor] delete the old unused mesh_alltoall (#124879 )" This reverts commit `f7f018a0ed`. Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 `f7f018a0ed`. Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))	2024-04-29 20:26:15 +00:00
Wanchao Liang	f7f018a0ed	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-29 17:22:30 +00:00
Wanchao Liang	8d46ab4104	[dtensor] move pad/unpad_tensor to separate utils (#124871 ) as titled, 1. pad/unpad is a general util not specific to the Shard placement, 2. for the propose of the next PR, move these two out of Shard placement itself, and give additional pad_dim argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu	2024-04-29 17:22:25 +00:00
Andrew Gu	36c983a973	[DeviceMesh] Added `DeviceMesh.from_group()` (#124787 ) This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh. Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787 Approved by: https://github.com/wanchaol ghstack dependencies: #124651, #124741, #124767, #124768, #124780	2024-04-24 23:16:06 +00:00
Wanchao Liang	0da94f3a08	[device_mesh] add a private init backend option (#124780 ) This PR adds a private init backend option, to tackle the issues sub mesh creation: in device mesh slicing we don't want to create process groups again, so explicitly turn the group creation off it's useful Also I think there might be more submesh creation functionality so having this flag would ensure that there's no new group created Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780 Approved by: https://github.com/awgu	2024-04-24 04:31:58 +00:00
Iris Z	2f45be46f6	[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142 Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu	2024-04-18 23:19:17 +00:00
PyTorch MergeBot	944d046645	Revert "[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 )" This reverts commit `a403757913`. Reverted https://github.com/pytorch/pytorch/pull/124142 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/124142#issuecomment-2062587289))	2024-04-17 22:31:30 +00:00
wz337	a403757913	[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142 Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu	2024-04-17 20:45:49 +00:00
Yifu Wang	2a2e1d8e4f	[functional collective] change the Python APIs to only use the native funcol ops (#123777 ) ## Summary After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR: - Removed `use_native_funcol()`. - Removed the code path in the Python APIs when `use_native_funcol()` is `False`. - Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol. ## Test Changes `test_functional_api.py` - Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol. - Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` `b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)` `test/distributed/_tensor/test_dtensor.py` `test/distributed/_tensor/test_dtensor_compile.py` `test/distributed/test_device_mesh.py` `test/distributed/_tensor/experimental/test_tp_transform.py` `test/distributed/_tensor/test_matrix_ops.py` `test/distributed/test_inductor_collectives.py` - All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol. `test/distributed/test_c10d_functional_native.py` - Removed the `run_with_native_funcol` decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777 Approved by: https://github.com/wanchaol ghstack dependencies: #123776	2024-04-13 03:08:36 +00:00
wz337	2b1ba0ceae	[DeviceMesh] Cache and reuse sliced result (#122975 ) Fixes #118849 Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors. We will follow up with reusing pg from the parent_mesh during submesh creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975 Approved by: https://github.com/wanchaol	2024-03-30 23:56:55 +00:00
Iris Zhang (PyTorch)	e99fa0042c	Back out "[DeviceMesh] Add support for nD slicing (#119752 )" (#121763 ) Summary: Original commit changeset: e52b8809c8d8 Original Phabricator Diff: D54778906 We have to backout this diff. D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248 Test Plan: Sandcastle Reviewed By: satgera Differential Revision: D54825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763 Approved by: https://github.com/osalpekar	2024-03-13 07:22:08 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
wz337	5603d95375	[DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046 ) More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614 In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-02-22 22:03:13 +00:00
Yifu Wang	637cf4a3f2	Test parametrization utils for native funcol migration (#119950 ) ``` Between the time we switch to the native funcol by default and the time when we are confident that we can remove the legacy implementation, we want to ensure that the legacy funcol remains covered by unit tests. This is to prepare for any potential (but unlikely) reverts. The following utilities help achieve this goal. run_with_{native,legacy}_funcol - mark a test to run with only {native,legacy} funcol. These decorators are for impl specific tests (e.g. verifying generated code with FileCheck). run_with_both_funcol_impls - parametrize a test to run with both legacy and native funcol. run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but passes `enable_native_funcol` to the test so impl specific checks can be carried out. ``` This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950 Approved by: https://github.com/wanchaol ghstack dependencies: #119881	2024-02-19 02:46:03 +00:00
wz337	5f9f771711	[DeviceMesh][Test] Remove test_raises_mesh_dim_less_than_2 (#119172 ) The test is no longer applicable after we allow 1D slice from 1D mesh. https://github.com/pytorch/pytorch/pull/118895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119172 Approved by: https://github.com/awgu, https://github.com/atalman	2024-02-05 17:34:51 +00:00
Iris Zhang (PyTorch)	0245000be8	[DeviceMesh] Temporarily disable re-use subgroup (#118940 ) Summary: The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan). We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940 Approved by: https://github.com/wanchaol	2024-02-05 06:30:00 +00:00
Yifu Wang	8f82a44a5b	Run device mesh tests with native funcol enabled (#118437 ) ### Summary Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled. All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437 Approved by: https://github.com/LucasLLC ghstack dependencies: #118910, #118911	2024-02-04 04:11:11 +00:00
wz337	c908caf92b	[DeviceMesh] Alllow 1d slice from 1d mesh (#118895 ) Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851) i.e. mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp")) then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895 Approved by: https://github.com/wanchaol	2024-02-02 22:00:24 +00:00
Yifu Wang	697ca4f292	Preliminary DeviceMesh + native c10d functional integration (#118423 ) ### Summary - Added `group_name` as the third field in `dim_group_infos`. - `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI. ### Other fixes - Convert `reduceOp` to lower case before passing it into c10d_functional ops. - Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423 Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab	2024-01-31 04:36:12 +00:00
wz337	e1f9eca113	[DeviceMesh] Reuse sub_group pg if exists (#115716 ) Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will: 1) re-use sub_group pg if it exsits, 2) create new sub_group pg if it does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716 Approved by: https://github.com/wanchaol	2024-01-25 18:07:16 +00:00
Yue Dong	87ea6fb844	Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847 ) Summary: This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed. There's no exception thrown during training, but we ran into numerical value correctness issue without the change. Test Plan: CI CI test WHEN model test: - Verified loss for each iteration within the expected range. - Verified NE on-par with this change with 4B training data. Differential Revision: D52170822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847 Approved by: https://github.com/wanchaol	2023-12-17 01:35:09 +00:00
Iris Z	1eca63c6ac	[DeviceMesh] Move helper function 'get_mesh_dim_by_name' to MeshEnv class (#115572 ) Move helper function `get_mesh_dim_by_name ` outside of the DeviceMesh class to keep the public class cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115572 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2023-12-12 06:29:46 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit `eaa64339d6`.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit `729ac7317a`. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00
wz337	2dd2fb91d9	[DeviceMesh] Add get_local_rank() API to DeviceMesh (#114709 ) As title. Differential Revision: [D51625152](https://our.internmc.facebook.com/intern/diff/D51625152/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114709 Approved by: https://github.com/wanchaol, https://github.com/fegin ghstack dependencies: #114708	2023-12-01 03:28:55 +00:00
wz337	7b3e45be59	[DeviceMesh] Rename get_dim_groups to get_group (#114708 ) Rename get_dim_groups to get_group and update all callsites. Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin	2023-11-30 23:40:14 +00:00
wz337	272e38e78b	[DeviceMesh] Update DeviceMesh's hash (#114812 ) Currently, when we create two DeviceMesh with the same mesh_tensor, the hash of the DeviceMesh will be the same. To follow the pattern of `dist.new_group()`, the two DeviceMesh should be different. Therefore, adding an id field for DeviceMesh creation to distinguish different DeviceMesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114812 Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu, https://github.com/fegin	2023-11-30 12:14:19 +00:00
Iris Zhang	596dab4277	[DeviceMesh] Remove _validate_mesh from device_mesh.py (#112928 ) Plan B for https://github.com/pytorch/pytorch/pull/112839 Motivation for the change: 1. We need to remove `funcol` as a dependency for device_mesh.py to resolve circular dependency issues when introducing device_mesh as an arg for DDP. In the meantime, we should not go from funcol to non-funcol as @voznesenskym suggested. Therefore, we want to remove this all_gather check completely. 2. For large scale, it would not make sense to validate the mesh at global scale anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112928 Approved by: https://github.com/wanchaol	2023-11-04 05:12:27 +00:00
Kazuaki Ishizaki	9089242048	Fix typo under test directory (#112346 ) This PR fixes typo in comments and messages under `test` directory. This PR also fixes related typo in messages under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112346 Approved by: https://github.com/kit1980, https://github.com/ezyang	2023-11-03 07:53:33 +00:00
Iris Zhang	b07cfd79fe	[DeviceMesh] Move DeviceMesh out from torch.distributed._tensor (#112364 ) Move DeviceMesh out as a standalone module. Once we make sure everything is migrated and doc is ready, we will make `torch.distributed._device_mesh` public in follow-up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112364 Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/fduwjj	2023-11-02 04:44:25 +00:00

48 Commits