Commit Graph

20 Commits

Author SHA1 Message Date
Xuehai Pan
22d258427b [BE][Easy] enable UFMT for torch/distributed/_shard/ (#128867)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128867
Approved by: https://github.com/fegin
ghstack dependencies: #128866
2024-06-18 14:39:25 +00:00
Xuehai Pan
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
PyTorch MergeBot
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
Xuehai Pan
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
Iris
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00
Kurt Mohler
ee28b865ee Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)
Part of #85302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
Approved by: https://github.com/ezyang
2022-11-08 18:11:01 +00:00
Rodrigo Kumpera
f66be71d77 [checkpoint] Adopt Planner interface across the board. (#83781)
Change StorageReader and StorageWriter to follow the new SavePlanner / LoadPlanner design.

Add optional planner param to load_state_dict and save_state_dict and implement the new protocol.

This includes a small rework of FileSystem layer to support single file per rank and making fsync optional to match torch.save behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83781
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2022-08-29 14:38:32 +00:00
Sergii Dymchenko
591222f5d9 Fix use-dict-literal lint (#83718)
Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD
2022-08-24 00:26:46 +00:00
joncrall
b136f3f310 More doctest refinements. (#83317)
Follow up to #82797

Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

@ezyang @vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
Approved by: https://github.com/ezyang
2022-08-22 20:07:26 +00:00
Rodrigo Kumpera
d11d3dd036 [dist.cp] Introduce LoadPlanner and SavePlanner extensibility API. (#83419)
The planners come with default implementations in default_planner.py.

The default planners expose their core functionality as separate functions
to make it easy for other checkpoint implementations to use this functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83419
Approved by: https://github.com/wanchaol
2022-08-18 19:40:15 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Rodrigo Kumpera
f4ee37453c [dist.checkpoint] Change metadata format and improve error reporting (#82078)
This PR implements the following changes.

Move to new checkpoint metadata format with split between logical and storage data.
This is a step in the direction of supporting extensible checkpointing as it moves us away from the hardcoded storage model enforced by the FileSystem storage layer.

Change CheckpointException to include exception traceback. Exception tracebacks are not serializable so we need to take care of that otherwise we provide horribly bad errors to users.

Finally, remove `validate_state_dict` as it lost its usefulness. Loading is becoming more and more flexible to the point that the only reasonable way to verify if it's possible to load a given configuration is to actually try it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82078
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2022-08-03 17:00:12 +00:00
Rodrigo Kumpera
69eecdbc9c Introduce MetadataIndex and helper to use it. (#81909)
MetadataIndex simplifies indexing into state dict and Metadata.

This includes a find_state_dict_object helper that searcher into a state dict.

This PR doesn't include search over Metadata at it requires changes that will land
in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81909
Approved by: https://github.com/wanchaol
2022-07-28 12:20:58 +00:00
Rodrigo Kumpera
d2078fac11 [dist.checkpoint] Cleanup usage of collectives and introduce narrow helper (#81828)
Introduce _DistWrapper class that wraps a process group and provides functional
variants of collectives. It works without c10d enabled and is exception
robust.

Introduce tensor_narrow_n that handle narrowing over multiple dimentions.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81828
Approved by: https://github.com/wanchaol
2022-07-27 12:59:58 +00:00
Sergii Dymchenko
d61ae1a773 Remove unused variables from state_dict_loader (#81513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81513
Approved by: https://github.com/mrshenli
2022-07-15 15:31:34 +00:00
Sergii Dymchenko
fe34bf1201 Remove unused storage_size (#81514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81514
Approved by: https://github.com/mrshenli
2022-07-15 15:30:52 +00:00
zilinzhu
3d9cef8c98 Clone tensor to write in ShardedTensor checkpoint (#79400)
The `torch.save` api will save the origin tensor of a view, which will results in saving a much larger checkpoint when parameters are fused, e.g. in torchrec.

Relates to #79016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79400
Approved by: https://github.com/kumpera
2022-06-29 03:47:24 +00:00
Rodrigo Kumpera
270c518be0 [checkpoint] Implement interop between Tensor and Sharded Tensor (#78120)
This allows loading a Tensor from a checkpoint with a SharedTensor in the same FQN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78120
Approved by: https://github.com/pritamdamania87
2022-06-16 15:31:09 +00:00
Rodrigo Kumpera
c9570e4b88 [checkpoint] Synchronize error handling across all ranks (#77091)
Introduce error handling across all ranks when loading and saving checkpoints.

This makes it a lot simpler for users to handle failures and, as a positive side-effect, coordination of when it successfully finished.

This change requires 3 collectives when saving and 1 when loading.
All those collectives carry a small payload so they will be latency bound and write time should dominate it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77091
Approved by: https://github.com/pritamdamania87, https://github.com/wanchaol
2022-05-18 21:24:09 +00:00
Rodrigo Kumpera
710246ea99 Introduce distributed checkpoint with ShardedTensor.
This is a copy of #76123.
I had to create a new PR due to some infra limitations so please look at the other PR for comment history.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76897
Approved by: https://github.com/wanchaol
2022-05-05 20:28:12 +00:00