Commit Graph

3 Commits

Author SHA1 Message Date
Teja
dd3e7170c2 Add async checkpointing impl to experimental checkpointer and add a builder API (#156927)
1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support.

2. Adds two conveinient functions to create sync/async checkpointers

Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927
Approved by: https://github.com/pradeepfn
2025-07-03 22:49:20 +00:00
Teja
ef6dfa06a9 Create a base Checkpointer and SyncCheckpointer and add dist barrier impl and (#156926)
In preparation to adding async checkpointing, this diff adds
1.  Change Checkpointer to an Abstract base class and adds a sync checkpointer implementation.
2. torch.distributed.barrier() as one of the barrier choices.

Differential Revision: [D77341314](https://our.internmc.facebook.com/intern/diff/D77341314/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156926
Approved by: https://github.com/pradeepfn
2025-06-28 02:48:29 +00:00
Teja Rao
d797038ea9 [dcp_poc] Introduce a new simple rank local checkpointer (#156142)
Summary:
Adds an experimental implementation for a rank local checkpointer with save and load with partial load, blind load and in-place load.

This uses an new API and simpler format.

Plan to add async checkpointing, IO layer, pluggable storage backend, layout customization,  Resharding, deduplication etc are not implemented.

Test Plan: unit tests

Reviewed By: saumishr

Differential Revision: D75426560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156142
Approved by: https://github.com/saumishr
2025-06-25 01:19:40 +00:00