Saiteja Samudrala
|
2796f31b5e
|
[DCP] OSS Zero Overhead Checkpointing Implementation (#156207)
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing
Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287
Differential Revision: D72391401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207
Approved by: https://github.com/teja-rao
|
2025-06-29 03:19:48 +00:00 |
|
Aaron Orenstein
|
316808e4e9
|
PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
|
2025-01-19 20:55:59 +00:00 |
|
PyTorch MergeBot
|
0398dc9e8e
|
Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.
Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
|
2024-03-12 17:02:43 +00:00 |
|
Lucas Pasqualin
|
d482614fec
|
[DCP] Makes fsspec public (#121508)
Fixes #118033
Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329
We're also disabling `test_fsdp` since it is failing on random PR's
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
|
2024-03-09 01:14:18 +00:00 |
|
Lucas Pasqualin
|
96ed37ac13
|
[DCP] Makes async_save public (#121325)
Makes async_save public
Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
|
2024-03-08 05:13:13 +00:00 |
|
Lucas Pasqualin
|
909d73d8cb
|
[DCP] Removes no_dist and coordinator_rank from public DCP API's (#121317)
[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's
Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317
Approved by: https://github.com/fegin
|
2024-03-08 02:14:12 +00:00 |
|
Chien-Chin Huang
|
644bc69530
|
[DCP] Allow users to save and load without creating storage reader and writer (#117772)
Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer.
Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772
Approved by: https://github.com/wz337
ghstack dependencies: #116248
|
2024-01-26 09:08:35 +00:00 |
|
Lucas Pasqualin
|
b342286646
|
adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.
The original PR was here: https://github.com/pytorch/pytorch/pull/115864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
|
2023-12-22 05:22:39 +00:00 |
|