Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.
Differential Revision: D70112642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758
Approved by: https://github.com/meetv18
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data
Test Plan:
added a test
Rollback Plan:
Differential Revision: D76135887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
This change will add the ability to support re-sharding for hf safetensors checkpoints.
This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class.
Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519
Approved by: https://github.com/saumishr
Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it
Test Plan:
ensure existing tests pass
tested e2e in a notebook
Differential Revision: D72472490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701
Approved by: https://github.com/joecummings
Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save.
Test Plan:
(lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage
File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py
Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569
Network: Up: 32KiB Down: 1.9KiB (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422)
Executing actions. Remaining 0/4 3.4s exec time total
Command: test. Finished 2 local
Time elapsed: 19.6s
Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0
Reviewed By: saumishr
Differential Revision: D70271943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025
Approved by: https://github.com/saumishr
Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere
Test Plan:
signals pass
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage
Differential Revision: D70324090
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089
Approved by: https://github.com/saumishr
Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage
N6476188 --> able to save and load tensor in hf format
Differential Revision: D68444967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.
Test Plan: N6475361 methods don't throw an error with a string arg like they were previously
Differential Revision: D68713937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:
1. The creation of save_id and load_id:
- a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
- both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)
2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.
*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*
*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*
This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.
One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
Summary:
This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference -
abc.abstractmethod
def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:
Differential Revision: D56480396
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770
Approved by: https://github.com/fegin
Let us try to remove this warning 😄 :
```
[rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]: if tensor.storage().size() != tensor.numel():
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538
Approved by: https://github.com/wz337, https://github.com/fegin
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.
Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.
Before this PR
9.475 s
8 threads
After this PR
1.632 s
8 threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin