Commit Graph

15 Commits

Author SHA1 Message Date
Rohit Singh Rathaur
2bcd892c86 [distributed] Replace assert statements in distributed checkpoint with explicit checks (#165256)
Fixes partially #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165256
Approved by: https://github.com/albanD
2025-10-17 20:14:35 +00:00
Maggie Moss
7457d139c5 Add pyrefly suppressions to torch/distributed (7/n) (#165002)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

One more PR after this one.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
Aaron Orenstein
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
Adrian Wälchli
ad314a2f05 Pass torch.load(weights_only=) internally to avoid FutureWarning (#130663)
Fixes #130658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663
Approved by: https://github.com/malfet, https://github.com/LucasLLC
2024-07-16 01:24:38 +00:00
cdzhan
a89a1ed072 [easy][DCP] make BroadcastingTorchSaveReader device generic (#129231)
Test test/distributed/checkpoint/test_format_utils.py on GPU and othor device pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129231
Approved by: https://github.com/fegin
2024-06-26 02:37:30 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Kevin Yin
534c34b320 Fix copy-pasted docs, reversing the load and save description (#125993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125993
Approved by: https://github.com/kwen2501, https://github.com/fegin
2024-05-14 21:14:16 +00:00
Lucas Pasqualin
bb6ba31250 [DCP] Adds storage metadata, and passes it during the save path (#124772)
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:

1. The creation of save_id and load_id:
    - a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
    - both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)

2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.

*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*

*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*

This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.

One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
2024-05-07 23:53:53 +00:00
Lucas Pasqualin
18c9d46068 Fixes format utils executable (#123407)
Fixes an issue with the format utils executable, which was causing it to run as a no-op. :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123407
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 03:53:22 +00:00
Lucas Pasqualin
bcb6e5aa72 [DCP] Support partial load (#122829)
Adds ability to load a subset of keys directly from a checkpoint, avoiding the need to initialize state dict first

Differential Revision: [D55441391](https://our.internmc.facebook.com/intern/diff/D55441391/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122829
Approved by: https://github.com/fegin
2024-04-02 19:22:22 +00:00
Lucas Pasqualin
909d73d8cb [DCP] Removes no_dist and coordinator_rank from public DCP API's (#121317)
[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's

Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317
Approved by: https://github.com/fegin
2024-03-08 02:14:12 +00:00
Lucas Pasqualin
eb1145436a [DCP] Adds main in format utils (#120128)
Adds main in format utils. Usage:

`python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt`

or

`python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir`

Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128
Approved by: https://github.com/fegin, https://github.com/wz337
2024-03-07 01:18:17 +00:00
Lucas Pasqualin
9d5dea7812 [DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
as title

Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816
Approved by: https://github.com/fegin
2024-03-01 00:21:05 +00:00
Lucas Pasqualin
1c1028ac49 [DCP] Adds utility for converting torch save to dcp (#119815)
as title

Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815
Approved by: https://github.com/fegin
ghstack dependencies: #119813, #119814
2024-02-22 17:22:11 +00:00
Lucas Pasqualin
1ab441a7dd [DCP] Adds utility for converting dcp to torch save format (#119814)
as title

Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814
Approved by: https://github.com/fegin
ghstack dependencies: #119813
2024-02-22 16:55:58 +00:00