Commit Graph

9 Commits

Author SHA1 Message Date
Rohit Singh Rathaur
2bcd892c86 [distributed] Replace assert statements in distributed checkpoint with explicit checks (#165256)
Fixes partially #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165256
Approved by: https://github.com/albanD
2025-10-17 20:14:35 +00:00
Saiteja Samudrala
2796f31b5e [DCP] OSS Zero Overhead Checkpointing Implementation (#156207)
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing

Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287

Differential Revision: D72391401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207
Approved by: https://github.com/teja-rao
2025-06-29 03:19:48 +00:00
Aaron Orenstein
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
PyTorch MergeBot
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
Lucas Pasqualin
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
Lucas Pasqualin
96ed37ac13 [DCP] Makes async_save public (#121325)
Makes async_save public

Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
2024-03-08 05:13:13 +00:00
Lucas Pasqualin
909d73d8cb [DCP] Removes no_dist and coordinator_rank from public DCP API's (#121317)
[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's

Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317
Approved by: https://github.com/fegin
2024-03-08 02:14:12 +00:00
Chien-Chin Huang
644bc69530 [DCP] Allow users to save and load without creating storage reader and writer (#117772)
Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer.

Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772
Approved by: https://github.com/wz337
ghstack dependencies: #116248
2024-01-26 09:08:35 +00:00
Lucas Pasqualin
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00