Commit Graph

27 Commits

Author SHA1 Message Date
Pradeep Fernando
1b08aaeafe Supporting non-tensor-data write_size in planner write items. (#149699)
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
first do a lookup in to the actual tensor/bytes
then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.

Test Plan: Existing UT.

Differential Revision: D71599725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699
Approved by: https://github.com/MeetVadakkanchery
2025-03-21 18:09:14 +00:00
Howard Huang
b98af95401 Fix DCP link (#148974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974
Approved by: https://github.com/svekars
2025-03-11 21:26:37 +00:00
Chien-Chin Huang
52acc1f955 [DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/140898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918
Approved by: https://github.com/fduwjj, https://github.com/mori360
ghstack dependencies: #148825
2025-03-11 18:24:12 +00:00
Meet Vadakkanchery
fdee60769a [DCP] Introduce process based async checkpointing (#147039)
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
2025-03-04 13:33:28 +00:00
Will Constable
2b400236c2 [DCP] Cross-link DCP doc to tutorials (#139776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139776
Approved by: https://github.com/mhorowitz, https://github.com/LucasLLC, https://github.com/fduwjj
ghstack dependencies: #139938
2024-11-07 02:19:49 +00:00
Wouter Devriendt
e8645fa2b9 [Doc] fix some typos (found by codespell and typos) (#132544)
Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544
Approved by: https://github.com/kit1980
2024-08-05 17:21:56 +00:00
Lucas Pasqualin
799f1460af [DCP] Provides default AsyncStager (#124939)
Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939
Approved by: https://github.com/fegin
ghstack dependencies: #122965
2024-05-02 19:48:54 +00:00
Lucas Pasqualin
3741fb3680 [DCP] Introduce async staging extension points (#122965)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #124944
* #124939
* __->__ #122965

Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)

*This PR is now ready for merge and is not an RFC*

Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)

This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a  thread to avoid GIL issues.

A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
    #     this step accomplishes staging and includes the usual 'planning' calls (issue 1)
    buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
    dcp.save(state_dict, storage_writer=buffered_writer)

    final_storage_writer = FileSystemWriter()
    mp.spawn(      # issue2 is gone, do whatever you want here
	dcp.save,    # or some custom sub-process method which calls dcp.save under the hood
        buffered_writer.state_dict,   # lot's of way's to do this, not really the most important part
	checkpoint_id=checkpoint_id,
	storage_writer=storage_writer,
	planner=planner,
	process_group=process_group, # this actually wouldn't work, but again not the pt.
      )
      # leaving out the rest of the details for managing your extra special subprocess.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
2024-05-02 19:01:55 +00:00
Lucas Pasqualin
de7edeea25 [DCP] DCP logger (#121352)
Adds additional logging for improved observability in DCP.

Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 17:50:50 +00:00
PyTorch MergeBot
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
Lucas Pasqualin
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
Chien-Chin Huang
2e789ad522 [DCP][state_dict][doc] Update the distributed state_dict document (#121290)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290
Approved by: https://github.com/LucasLLC
ghstack dependencies: #121273, #121276
2024-03-08 07:58:18 +00:00
Lucas Pasqualin
96ed37ac13 [DCP] Makes async_save public (#121325)
Makes async_save public

Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
2024-03-08 05:13:13 +00:00
Lucas Pasqualin
9d5dea7812 [DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
as title

Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816
Approved by: https://github.com/fegin
2024-03-01 00:21:05 +00:00
Lucas Pasqualin
1c1028ac49 [DCP] Adds utility for converting torch save to dcp (#119815)
as title

Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815
Approved by: https://github.com/fegin
ghstack dependencies: #119813, #119814
2024-02-22 17:22:11 +00:00
Lucas Pasqualin
1ab441a7dd [DCP] Adds utility for converting dcp to torch save format (#119814)
as title

Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814
Approved by: https://github.com/fegin
ghstack dependencies: #119813
2024-02-22 16:55:58 +00:00
Lucas Pasqualin
ff8e33556e Enables load balancing duplicates in DCP (#116469)
Enables the deduplication of saved entries by load balancing duplicates across ranks.

Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**.  Before this PR, the same operation has been measured at ~19 seconds.

```
def run(local_rank, world_size, param_size, num_params, work_dir):

    os.environ["RANK"] = str(local_rank)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)

    model = Model(param_size=param_size, num_params=num_params)
    model = DistributedDataParallel(model, gradient_as_bucket_view=True)
    _patch_model_state_dict(model)

    sz = sum(t.nelement() * t.element_size() for t in model.parameters())
    rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
    rank_0_print("Saving the model with DCP...")

    checkpointer = _FileSystemCheckpointer(
        f"{args.work_dir}/dcp",
        sync_files=False,
        single_file_per_rank=False,
        thread_count=1
    )

    begin_ts = time.monotonic()
    checkpointer.save(state_dict={"model": model})
    end_ts = time.monotonic()
    rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```

Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
2024-01-26 22:34:14 +00:00
Lucas Pasqualin
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
Lucas Pasqualin
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
Lucas Pasqualin
f073dcd4f7 Stateful Checkpointing for Distributed [1/N] (#113867)
First pass at adding a save/load API, as well as definition of Stateful objects.

Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113867
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-01 19:21:03 +00:00
Chien-Chin Huang
9d0c3e21d0 [state_dict][9/N] Add get and set APIs for model and optimizer state_dict (#112203)
The original get_state_dict and set_state_dict pair is too complicated because of the possible combinations of usages. This PR adds the APIs to get/set model_state_dict and optimizer_state_dict seperately.

Differential Revision: [D50713584](https://our.internmc.facebook.com/intern/diff/D50713584/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112203
Approved by: https://github.com/wz337
ghstack dependencies: #112167
2023-11-02 22:03:57 +00:00
Chien-Chin Huang
19a6487ad4 [state_dict][6/N] Change API names to avoid conflict and simplify the API signatures (#111120)
`state_dict` is a very common variable name people use to represent a local
state_dict and `load_state_dict` conflicts with DCP's `load_state_dict`.

This PR changes `state_dict` to `get_state_dict`. `get_state_dict` is more close to what is this API does -- users use the API to get the current state_dict for saving or for loading (passed to DCP for loading in-place)..

This PR also changes `load_state_dict` to `set_state_dict`. `set_state_dict` is less ideal compared to `get_state_dict` but is symetric. We can still change the API name before it goes to beta.

This PR also simplies the API signatures. `model_only` is removed and `optim_only` only exists for `get_state_dict`.

Differential Revision: [D50213931](https://our.internmc.facebook.com/intern/diff/D50213931/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111120
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275, #111109, #111110
2023-10-17 00:15:31 +00:00
Chien-Chin Huang
7c67139e7b [state_dict][3/N] Cleanup StateDictOptions, make it more readable (#111275)
This is a reland PR for https://github.com/pytorch/pytorch/pull/111108 with the proper docstring fix.

1. Rename DistributedStateDictOptions to StateDictOptions.
2. Remove cpu_offload as we have not yet required this option.
3. Rename save_frozen_parameters to ignore_frozen_params.

Differential Revision: [D50294352](https://our.internmc.facebook.com/intern/diff/D50294352/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111275
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107
2023-10-14 15:34:52 +00:00
Svetlana Karslioglu
d425da8bf3 Replace master with main in links and docs/conf.py (#100176)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100176
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-02 18:20:32 +00:00
Iris
a7698a8260 [DCP] Add DCP FSDP sharded_state_dict checkpoint example to DCP .rst file (#95517)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95517
Approved by: https://github.com/kumpera
2023-03-03 18:09:10 +00:00
Rodrigo Kumpera
9e56378ef2 Add documentation for DCP. (#92813)
This populates the website with some basic documentation.

It's far from ideal as we should include some basic usage example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92813
Approved by: https://github.com/wz337
2023-01-24 17:21:51 +00:00
Iris
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00