Commit Graph

138 Commits

Author SHA1 Message Date
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
c64e657632 PEP585 update - torch/distributed/fsdp (#145162)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162
Approved by: https://github.com/bobrenjc93
2025-01-19 20:04:05 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Alex Kiefer
2f1191fb6a Corrected metadata variable names (#142342)
Fixes #142341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142342
Approved by: https://github.com/janeyx99
2024-12-10 02:24:31 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Xuehai Pan
3b798df853 [BE][Easy] enable UFMT for torch/distributed/{fsdp,optim,rpc}/ (#128869)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869
Approved by: https://github.com/fegin
ghstack dependencies: #128868
2024-06-18 21:49:08 +00:00
Mihir Patel
153362fbc9 Support HSDP + Monolith Checkpointing (#128446)
Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128446
Approved by: https://github.com/fegin
2024-06-17 16:59:41 +00:00
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
Chien-Chin Huang
0dd53650dd [BE][FSDP] Change the logging level to info (#126362)
As title

Differential Revision: [D57419445](https://our.internmc.facebook.com/intern/diff/D57419445/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126362
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-05-16 17:31:06 +00:00
Jeeja
7f1d5aba93 [FSDP] Use generic device handle instead of cuda (#121620)
In FSDP _optim_utils.py  Use generic device handle instead of cuda
to support other backends

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121620
Approved by: https://github.com/awgu, https://github.com/wz337
2024-05-13 18:07:08 +00:00
Aaron Gokaslan
1dd42e42c4 [BE]: Try TCH autofixes on torch/ (#125536)
Tries TCH autofixes and see what breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536
Approved by: https://github.com/ezyang
2024-05-05 23:13:59 +00:00
Aaron Gokaslan
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
wz337
de8af28083 [FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837)
Fixes #120722

TL;DR for the issue:
As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError.
More context in #120722.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837
Approved by: https://github.com/Skylion007
2024-03-05 10:03:44 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Chien-Chin Huang
2ea38498b0 [FSDP][BE] Only show state_dict log when the debug level is detail (#118196)
As title

Differential Revision: [D53038704](https://our.internmc.facebook.com/intern/diff/D53038704/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118196
Approved by: https://github.com/rohan-varma, https://github.com/wz337
ghstack dependencies: #118197, #118195
2024-01-26 09:52:36 +00:00
Mihir Patel
84cfe6d8b2 Drop all gather stats to debug not warning (#117669)
Logger default level results in these all gather stats being spammed into every run which is very annoying

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117669
Approved by: https://github.com/Skylion007, https://github.com/awgu
2024-01-17 21:44:59 +00:00
Andrew Gu
92cc8ae172 [FSDP] Cloned unsharded tensor slice in optim state dict load (#117261)
This takes the fix from https://github.com/pytorch/pytorch/issues/116553. Cloning the slice allows the base (much larger) tensor to be freed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117261
Approved by: https://github.com/wz337
2024-01-11 20:21:12 +00:00
Aaron Gokaslan
ee5d981249 [BE]: Enable RUFF PERF402 and apply fixes (#115505)
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
2023-12-20 18:01:24 +00:00
Chien-Chin Huang
cc28f61fa3 [DCP][BE] Move DCP._state_dict_utils out from DCP (#115523)
DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import.

Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523
Approved by: https://github.com/wz337
2023-12-13 08:59:48 +00:00
Yue Dong
ab120e65fb Fix FSDP + TP state dict in param unflattening (#115105)
Summary:
This diff fix the param unflattening when using FSDP together with TP. Currently we hardcode the `reshape_size` to be multiplied by 2, which instead should be the size of the process group.

Before the fix, example exception: `shape '[257, 514]' is invalid for input of size 264196`, where the process group size is 4 instead of 2.

Test Plan:
**CI**:
CI test

**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/tensor/parallel:fsdp_2d_parallel`
- Passed

**Test model with WHEN**:
- Verified that checkpoint can be saved and resumed successfully;
- Verified the accuracy with window_ne, which is on-par with baseline.
https://pxl.cl/3Wp8w

Differential Revision: D51826120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115105
Approved by: https://github.com/fegin
2023-12-05 21:19:56 +00:00
Chien-Chin Huang
4ba649e207 [FSDP][state_dict] Avoid assigning the root _device_mesh to the children _device_mesh (#114384)
Assigning the root _device_mesh to the children _device_mesh is not correct as each FSDP state can have a different DeviceMesh. We are also replacing fully_shard with a new implementation. So there is no need to worry about the fully_shard behavior.

Differential Revision: [D51507959](https://our.internmc.facebook.com/intern/diff/D51507959/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114384
Approved by: https://github.com/wz337
2023-11-30 02:08:31 +00:00
Jez Ng
5cfa0647a7 Update mypy to 1.7.0 (#114160)
It appears that `mypy` is now checking a few more previously-unchecked files; these files
are being found via import-following. Not sure exactly why they weren't being checked before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114160
Approved by: https://github.com/eellison
ghstack dependencies: #114162
2023-11-28 06:45:55 +00:00
Gary Zheng
d1ae5efa94 [torch][fsdp] More informative assertion error when rank mismatch (#113765)
Summary: I had a job fail due to rank mismatch but didn't find enough information in the assertion message. This change makes the message more informative.

Test Plan:
CI tests and I ran a test job which failed as expected:

```
Rank 1 has different values for step: 8016.0. Other ranks: 7870.0
```

Differential Revision: D51322046

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113765
Approved by: https://github.com/wz337, https://github.com/fegin
2023-11-20 17:44:41 +00:00
wz337
ca9e654353 [FSDP] Fix FSDP submodule with DeviceMesh does not return DTensor state_dict error (#113593)
For scenarios where FSDP is not the root module, the `_use_dtensor` flag would not be switched on. This PR fixes it by checking whether the submodule has the `device_mesh` and turn `_use_dtensor` flag on accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113593
Approved by: https://github.com/fegin
2023-11-15 19:00:19 +00:00
Chien-Chin Huang
2bcff4d8e3 [state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict (#112837)
As title

Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112837
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #112836, #112885
2023-11-13 10:03:06 +00:00
wz337
31ded95cd5 [2D] Bind _fsdp_extension to FSDP instances (#113237)
Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.).

This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-11-09 03:31:03 +00:00
Chien-Chin Huang
a66f2a1b99 [state_dict] Move _gather_state_dict to dcp module (#112835)
This api is getting used by more than just FSDP. This PR moves it to DCP module.

Differential Revision: [D50962966](https://our.internmc.facebook.com/intern/diff/D50962966/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112835
Approved by: https://github.com/wz337
2023-11-08 19:42:56 +00:00
Chien-Chin Huang
a810126cf7 [FSDP][optim_state_dict] Skip the parameter if the parameter does not belong to the current FSDP instance (#112804)
Skip the fsdp managed parameter if the parameter is not managed by the current FSDP instance. This can happen if the not all FSDP instances have all the parameters. This can happen with FSDP + some MPMD style parallelism.

Differential Revision: [D50562170](https://our.internmc.facebook.com/intern/diff/D50562170/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112804
Approved by: https://github.com/wz337
2023-11-06 18:23:36 +00:00
Chien-Chin Huang
2a86bcbac2 [FSDP][state_dict] Cleanup the usage of _get_pg_default_device (#112168)
_get_pg_default_device is not suitable for FSDP use case. We should always use the compute_device when communicating.

Differential Revision: [D50698730](https://our.internmc.facebook.com/intern/diff/D50698730/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112168
Approved by: https://github.com/wz337
2023-10-27 08:09:08 +00:00
Iris Zhang
c84dbd2c03 [2D] Enable 2D optimizer set_state_dict() (#111778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111778
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #111774
2023-10-27 04:33:00 +00:00
wz337
8dc4887e84 [2D] Enable 2D optimizer get_state_dict() (#111774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111774
Approved by: https://github.com/fegin
2023-10-25 03:44:14 +00:00
Daniel Dale
90e2117a99 Allow optimizer state conversion to accommodate optimizers that have no tensor state (e.g. SGD) (#111501)
Fixes #111499

This PR slightly alters the new fused `all_gather` `optim_state_dict` implementation to support optimizers without tensor state (e.g. SGD) in a `use_orig_params=True` context.

The principle change is to short-circuit `_allgather_orig_param_states` if an empty `state_buffers` dict is returned after completing `_convert_all_state_info` here:
93e5065ba0/torch/distributed/fsdp/_optim_utils.py (L1481-L1484)

To allow `_convert_all_state_info` to accommodate optimizers with no tensor state, I also change the scope of `dtype` and make the return type `Optional`.

As discussed in the issue this PR fixes, I'm [extending](93e5065ba0/test/distributed/fsdp/test_fsdp_optim_state.py (L1915I)) `test_state_dict_with_none_tensor_state` to test with both Adam and SGD optimizers to validate scalar and non-tensor states continue to be restored for both optimizer types.

Thanks to the distributed team as always for their adroit design and exceptionally valuable contributions to the open source ML community. Hope you all feel appreciated commensurate with the compounding progress your work enables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111501
Approved by: https://github.com/fegin
2023-10-19 06:47:04 +00:00
Chien-Chin Huang
7b25c2b90e [FSDP][optim_state_dict] Move local optimizer state to FSDP compute_device (#110929)
This will ensure all the tensors are on FSDP compute_device.

Differential Revision: [D50059492](https://our.internmc.facebook.com/intern/diff/D50059492/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110929
Approved by: https://github.com/wz337
2023-10-10 10:34:31 +00:00
Chien-Chin Huang
90bf6e3938 [FSDP][optim_state_dict] Enable cpu_offload config for optimzer state_dict (#108434)
We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag.

Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434
Approved by: https://github.com/wz337
2023-10-07 01:14:49 +00:00
Chien-Chin Huang
1a729618ef [FSDP][optim_state_dict] Make the new optimizer allgather fusion work with fine-tuning models (#110540)
With use_orig_params=True, it is possible that some parameters with the same FlatParameter are in the optimizer while others parameters are frozen. This PR makes the allgather fusion logic support the case.

Differential Revision: [D49922028](https://our.internmc.facebook.com/intern/diff/D49922028/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110540
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-10-05 15:17:10 +00:00
Chien-Chin Huang
cdde899a73 [FSDP][optim_state_dict] Fuse allgather for optim_state_dict when use_orig_params is True (#108298)
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.

This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.

One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.

Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
2023-10-02 20:57:08 +00:00
Matthew Hoffman
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00
Chien-Chin Huang
1b3e5b53f3 [FSDP][optim_state_dict] Add device to _shard_utils.py to explicitly use the device from fsdp_state (#109631)
_get_pg_default_device does not always get the device we want. This PR let the user explicitly tell use the correct device.

Differential Revision: [D49425743](https://our.internmc.facebook.com/intern/diff/D49425743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109631
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/wz337
2023-09-20 01:59:38 +00:00
wz337
66af4f6ec7 [HSDP] Add device_mesh to FSDP kwarg and add dtensor state_dict support for HSDP (#107533)
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
2023-09-05 21:21:21 +00:00
Chien-Chin Huang
591cb776af [FSDP][state_dict][optim_state_dict] Log slow optim and model state_dict paths (#108290)
This PR adds SimpleProfiler for FSDP state_dict/load_state_dict logging purpose. SimpleProfiler use class variables to record profiling results and it does everything in the Python which can be slow. So it is only suitable for logging slow actions such as initialization and state_dict/load_state_dict.

This PR uses SimpleProfiler to log some critical/slow paths of the model and optimizer state_dict/load_state_dict.

Differential Revision: [D48774406](https://our.internmc.facebook.com/intern/diff/D48774406/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108290
Approved by: https://github.com/wz337
2023-09-01 06:57:59 +00:00
PyTorch MergeBot
ab5b4c4419 Revert "[HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533)"
This reverts commit cc220e45a8.

Reverted https://github.com/pytorch/pytorch/pull/107533 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing in trunk with the same failure on test_dynamo_distributed cc220e45a8 ([comment](https://github.com/pytorch/pytorch/pull/107533#issuecomment-1701983247))
2023-09-01 01:26:30 +00:00
wz337
cc220e45a8 [HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533)
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
2023-09-01 00:15:00 +00:00