pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Maggie Moss	8f80892359	Use correct pyrefly syntax in suppressions distributed/... (#166241 ) Updates the pyrefy-ignores in the torch/distributed directory to use the correct syntax. No functional changes. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166241 Approved by: https://github.com/oulgen	2025-10-26 04:16:41 +00:00
fduwjj	7406d2e665	[DeviceMesh] Clean up the call into mesh_resouces to get root mesh (#165787 ) We moved the method to get root mesh into class in https://github.com/pytorch/pytorch/pull/164510. This is to further clean code up. Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165787 Approved by: https://github.com/fegin	2025-10-21 02:54:04 +00:00
Rohit Singh Rathaur	c4565c3b94	[distributed] Replace 164 assert statements in fsdp directory (#165235 ) Replace assert statements with explicit if/raise patterns across 20 files: - _optim_utils.py (38 asserts) - _flat_param.py (25 asserts) - _fully_shard/_fsdp_param.py (23 asserts) - sharded_grad_scaler.py (12 asserts) - fully_sharded_data_parallel.py (11 asserts) - wrap.py (10 asserts) - _state_dict_utils.py (9 asserts) - _fully_shard/_fsdp_param_group.py (8 asserts) - _runtime_utils.py (6 asserts) - _init_utils.py (6 asserts) - 10 additional files (16 asserts) This prevents assertions from being disabled with Python -O flag. Fixes partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165235 Approved by: https://github.com/albanD	2025-10-14 18:04:57 +00:00
Maggie Moss	9944cac6e6	Add suppressions to torch/_inductor (#165062 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Split this directory into two PRs to keep them from being too large. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062 Approved by: https://github.com/oulgen, https://github.com/mlazos	2025-10-09 20:34:20 +00:00
Maggie Moss	7457d139c5	Add pyrefly suppressions to torch/distributed (7/n) (#165002 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 One more PR after this one. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002 Approved by: https://github.com/oulgen	2025-10-09 04:08:25 +00:00
Yuanyuan Chen	f7ab8a2710	[1/N] Fix ruff warnings (#164333 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164333 Approved by: https://github.com/albanD	2025-10-01 16:48:32 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit `c2f0292bd5`. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Ivan Skorokhodov	df776d64f7	chore: fix typos in error messages in FSDP (#146805 ) Fixes two small typos in FSDP error messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-02-13 15:22:13 +00:00
Aaron Orenstein	c64e657632	PEP585 update - torch/distributed/fsdp (#145162 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162 Approved by: https://github.com/bobrenjc93	2025-01-19 20:04:05 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Chien-Chin Huang	d53dfa4680	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu, https://github.com/wz337	2024-08-12 18:28:02 +00:00
PyTorch MergeBot	50595ecef4	Revert "[BE] Raise when the target model has scalar parameters (#132934 )" This reverts commit `ea00036841`. Reverted https://github.com/pytorch/pytorch/pull/132934 on behalf of https://github.com/clee2000 due to I think this broke distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter [GH job link](https://github.com/pytorch/pytorch/actions/runs/10314920655/job/28563430905) [HUD commit link](`ea00036841`). Dr CI is wrong, it is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132934#issuecomment-2278208789))	2024-08-09 15:30:34 +00:00
Chien-Chin Huang	ea00036841	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu ghstack dependencies: #132908, #132933	2024-08-09 06:45:48 +00:00
daitian1995	aff48f7378	Autoselect default device in FSDP construction. (#127609 ) There are still some differences between CUDA and non-CUDA custom devices when construct FSDP because CUDA is selected as the default device. For example, when construct FSDP from CPU model and device_id is not passed, device_handle will choose CUDA as default device. This PR will autoselect the real device as the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609 Approved by: https://github.com/awgu	2024-08-08 05:25:17 +00:00
wz337	87053132ea	[DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339 ) Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0)) ``` We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0)) ``` With this change, we will have two types of meshes in an environment. 1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing. 2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339 Approved by: https://github.com/wanchaol ghstack dependencies: #132310, #132311	2024-08-07 07:01:12 +00:00
Vishwa Raj Singh	bcdba9f91d	Added hpu backend support in fsdp utils (#127757 ) In fsdp init_utils, adding support for hpu backend device on _get_device API. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757 Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu	2024-07-27 03:30:59 +00:00
Aaron Orenstein	634b62f111	typing proxy_tensor.py (#129182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182 Approved by: https://github.com/Chillee	2024-07-12 23:17:09 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Iris Z	1d84c7e100	[DeviceMesh] Update get_group and add get_all_groups (#128097 ) Fixes #121984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-06-08 04:28:56 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Sergii Dymchenko	a2bff4dc8c	Fix lint (#127584 ) Trivial fix after https://github.com/pytorch/pytorch/pull/124678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127584 Approved by: https://github.com/huydhn	2024-05-31 00:00:11 +00:00
Rohan Varma	f9a1bc2c65	[FSDP] Remove _sync_module_states (#124678 ) Remove this unused API Differential Revision: [D56445639](https://our.internmc.facebook.com/intern/diff/D56445639/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124678 Approved by: https://github.com/awgu	2024-05-30 23:02:09 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Andrew Gu	2978f07d0e	[FSDP] Fixed docs for inter/intra node PG helpers (#126288 ) 1. This fixes an issue where we had 9 ranks in one node and 7 in the other. 2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126288 Approved by: https://github.com/weifengpy	2024-05-15 19:45:10 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Andrew Gu	79af814369	[FSDP] Added private `_unshard` API (#124304 ) Some toy example: <img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd"> We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure: 1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`. 2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration). Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304 Approved by: https://github.com/wanchaol	2024-05-03 13:14:15 +00:00
Shawn Xu	e203aa9fab	[FSDP] [easy] fix HSDP validation error msg (#123019 ) Summary: This would otherwise yield > ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.') which is odd. Remove the extra tailing commas. Test Plan: CI Differential Revision: D55549851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019 Approved by: https://github.com/Skylion007	2024-03-30 18:12:34 +00:00
Andrew Gu	bf8db86a19	[FSDP] Added deprecation msg for `NO_SHARD` (#119553 ) This only includes the warning for world size >1 since we clamp to `NO_SHARD` for world size 1. We mainly do not want `NO_SHARD` to proliferate anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119553 Approved by: https://github.com/Skylion007	2024-02-09 20:32:03 +00:00
Mihir Patel	33761969a4	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/Skylion007	2024-02-08 00:49:28 +00:00
PyTorch MergeBot	3aeaa21eb0	Revert "Remove parent device mesh check (#118620 )" This reverts commit `3f1f057adf`. Reverted https://github.com/pytorch/pytorch/pull/118620 on behalf of https://github.com/atalman due to broke periodic linux-focal-cuda11.8-py3.9-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/118620#issuecomment-1924933878))	2024-02-03 00:22:56 +00:00
Mihir Patel	3f1f057adf	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-02-02 05:29:49 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Wanchao Liang	eebf115686	[fsdp][2d] FSDP sync module states handle tensor subclass (#117336 ) This PR adds the ability to let FSDP sync module states kwarg to handle tensor subclass, because FSDP works on the "dp" mesh dimension, as long as FSDP works on a different device mesh dimension, we can safety let FSDP just broadcast the DTensor local shards. fixes https://github.com/pytorch/pytorch/issues/117126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117336 Approved by: https://github.com/awgu	2024-01-13 19:33:47 +00:00
Wanchao Liang	848cfe8d45	[reland] unflatten_tensor on compute stream for DTensorExtension (#117020 ) reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020 Approved by: https://github.com/awgu	2024-01-09 21:25:15 +00:00
Qinfan Wu	b847290ddd	Back out "[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 )" (#116939 ) Summary: Original commit changeset: 65298112f3db Original Phabricator Diff: D52530451 Differential Revision: D52583345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116939 Approved by: https://github.com/842974287	2024-01-07 03:53:40 +00:00
Wanchao Liang	d9c0e37bab	[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 ) Context: Existing FSDPExtension have some bug in the case when the unflatten tensor involves some compute/communications in cuda stream, the current logic of FSDPExtension unflatten tensor happens in the unshard stream, which makes runtime lost sync with the compute stream, and if there're some dependencies between the compute stream and the unflatten tensor logic, currently it would lose sync point, which could possibly lead to NaN. This PR make the FSDPExtension to record the compute stream and let DTensorExtension to directly use the compute stream for unflatten_tensor. In long term we might want to directly make the FSDP runtime logic to only make the unshard happen in unshard stream, and use unshard views to happen in the compute stream. We currently fix this in the Extension directly as this is the simplest thing to do without affecting FSDP runtime logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang ghstack dependencies: #116426	2024-01-03 07:29:08 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit `eaa64339d6`.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit `729ac7317a`. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00

1 2 3

135 Commits