Commit Graph

9 Commits

Author SHA1 Message Date
Lucas Pasqualin
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
Lucas Pasqualin
f073dcd4f7 Stateful Checkpointing for Distributed [1/N] (#113867)
First pass at adding a save/load API, as well as definition of Stateful objects.

Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113867
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 19:21:03 +00:00
Chien-Chin Huang
9d0c3e21d0 [state_dict][9/N] Add get and set APIs for model and optimizer state_dict (#112203)
The original get_state_dict and set_state_dict pair is too complicated because of the possible combinations of usages. This PR adds the APIs to get/set model_state_dict and optimizer_state_dict seperately.

Differential Revision: [D50713584](https://our.internmc.facebook.com/intern/diff/D50713584/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112203
Approved by: https://github.com/wz337
ghstack dependencies: #112167
2023-11-02 22:03:57 +00:00
Chien-Chin Huang
19a6487ad4 [state_dict][6/N] Change API names to avoid conflict and simplify the API signatures (#111120)
`state_dict` is a very common variable name people use to represent a local
state_dict and `load_state_dict` conflicts with DCP's `load_state_dict`.

This PR changes `state_dict` to `get_state_dict`. `get_state_dict` is more close to what is this API does -- users use the API to get the current state_dict for saving or for loading (passed to DCP for loading in-place)..

This PR also changes `load_state_dict` to `set_state_dict`. `set_state_dict` is less ideal compared to `get_state_dict` but is symetric. We can still change the API name before it goes to beta.

This PR also simplies the API signatures. `model_only` is removed and `optim_only` only exists for `get_state_dict`.

Differential Revision: [D50213931](https://our.internmc.facebook.com/intern/diff/D50213931/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111120
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275, #111109, #111110
2023-10-17 00:15:31 +00:00
Chien-Chin Huang
7c67139e7b [state_dict][3/N] Cleanup StateDictOptions, make it more readable (#111275)
This is a reland PR for https://github.com/pytorch/pytorch/pull/111108 with the proper docstring fix.

1. Rename DistributedStateDictOptions to StateDictOptions.
2. Remove cpu_offload as we have not yet required this option.
3. Rename save_frozen_parameters to ignore_frozen_params.

Differential Revision: [D50294352](https://our.internmc.facebook.com/intern/diff/D50294352/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111275
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107
2023-10-14 15:34:52 +00:00
Svetlana Karslioglu
d425da8bf3 Replace master with main in links and docs/conf.py (#100176)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100176
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-02 18:20:32 +00:00
Iris
a7698a8260 [DCP] Add DCP FSDP sharded_state_dict checkpoint example to DCP .rst file (#95517)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95517
Approved by: https://github.com/kumpera
2023-03-03 18:09:10 +00:00
Rodrigo Kumpera
9e56378ef2 Add documentation for DCP. (#92813)
This populates the website with some basic documentation.

It's far from ideal as we should include some basic usage example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92813
Approved by: https://github.com/wz337
2023-01-24 17:21:51 +00:00
Iris
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00