Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage
N6476188 --> able to save and load tensor in hf format
Differential Revision: D68444967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.
Test Plan: N6475361 methods don't throw an error with a string arg like they were previously
Differential Revision: D68713937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:
1. The creation of save_id and load_id:
- a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
- both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)
2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.
*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*
*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*
This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.
One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
Summary:
This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference -
abc.abstractmethod
def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:
Differential Revision: D56480396
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770
Approved by: https://github.com/fegin
Let us try to remove this warning 😄 :
```
[rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]: if tensor.storage().size() != tensor.numel():
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538
Approved by: https://github.com/wz337, https://github.com/fegin
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.
Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.
Before this PR
9.475 s
8 threads
After this PR
1.632 s
8 threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.
Instead of doing:
```
DCP.save(
state_dict={...},
storage_writer=StorageWriter(...)
)
DCP.load(
state_dict={...},
storage_reader=StorageReader(...)
)
```
We can now do:
```
checkpointer = Checkpointer(...)
checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
Fixes: #113193
`pydocstyle <all_files_in_issue> --count`
- Before: 345
- After: 130
For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces.
For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation.
```python
from pathlib import Path
import tensorflow as tf
class GFileWrapper(tf.io.gfile.GFile):
def __init__(self, path, mode="r") -> None:
super().__init__(path, mode)
def write(self, data):
return super().write(bytes(data))
# a not quite efficient readinto, but it works
def readinto(self, buffer):
# read up to buffer's length
data = self.read(len(buffer))
length = len(data)
buffer[:length] = data
return length
class HdfsPath(type(Path())):
def __new__(cls, *pathsegments):
return super().__new__(cls, *pathsegments)
@staticmethod
def _fix_path(path):
path = str(path)
if path.startswith("hdfs:/") and not path.startswith("hdfs://"):
path = path.replace("hdfs:/", "hdfs://")
return path
def open(self, mode="r", *args, **kwargs):
return GFileWrapper(HdfsPath._fix_path(self), mode=mode)
def mkdir(self, **kwargs) -> None:
return tf.io.gfile.makedirs(HdfsPath._fix_path(self))
def rename(self, target):
return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target))
```
```python
writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False)
reader = FileSystemReader(HdfsPath("hdfs://..."))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635
Approved by: https://github.com/fduwjj
Moved SlicedBufferedReader to utils and renamed to _ReaderView.
It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's.
Should help with #98386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167
Approved by: https://github.com/wz337
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory".
This is the fix for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553
Approved by: https://github.com/kumpera
This PR includes:
Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:
```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```
test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.
[T134844615]
## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj