Commit Graph

22 Commits

Author SHA1 Message Date
Chien-Chin Huang
494c2ec054 [DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887)
There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam.

Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887
Approved by: https://github.com/wz337
2024-02-03 01:14:13 +00:00
Chien-Chin Huang
644bc69530 [DCP] Allow users to save and load without creating storage reader and writer (#117772)
Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer.

Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772
Approved by: https://github.com/wz337
ghstack dependencies: #116248
2024-01-26 09:08:35 +00:00
Lucas Pasqualin
ea851eb027 Uses Serial Loader for DCP.save when more then one thread is used. (#118114)
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.

Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.

Before this PR
9.475 s
8 threads

After this PR
1.632 s
8 threads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin
2024-01-25 21:11:16 +00:00
Chien-Chin Huang
3d1869d0ae [DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246)
1. Better typing
2. Remove 1-liner function

Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246
Approved by: https://github.com/wz337
ghstack dependencies: #116245
2024-01-11 16:27:21 +00:00
Lucas Pasqualin
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
Chien-Chin Huang
a548ff40de [DCP][BE] Remove unused function (#116006)
As title

Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006
Approved by: https://github.com/wz337
2023-12-21 07:20:08 +00:00
Chien-Chin Huang
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
Lucas Pasqualin
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
Aaron Gokaslan
b7b2178204 [BE]: Remove useless lambdas (#113602)
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
2023-11-14 20:06:48 +00:00
NVS Abhilash
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
Hanzhi Zhou
bb45f89cd9 Hackable distributed filesystem reader and writer (#106635)
I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces.

For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation.

```python
from pathlib import Path
import tensorflow as tf

class GFileWrapper(tf.io.gfile.GFile):
    def __init__(self, path, mode="r") -> None:
        super().__init__(path, mode)

    def write(self, data):
        return super().write(bytes(data))

    # a not quite efficient readinto, but it works
    def readinto(self, buffer):
        # read up to buffer's length
        data = self.read(len(buffer))
        length = len(data)
        buffer[:length] = data
        return length

class HdfsPath(type(Path())):
    def __new__(cls, *pathsegments):
        return super().__new__(cls, *pathsegments)

    @staticmethod
    def _fix_path(path):
        path = str(path)
        if path.startswith("hdfs:/") and not path.startswith("hdfs://"):
          path = path.replace("hdfs:/", "hdfs://")
        return path

    def open(self, mode="r", *args, **kwargs):
        return GFileWrapper(HdfsPath._fix_path(self), mode=mode)

    def mkdir(self, **kwargs) -> None:
        return tf.io.gfile.makedirs(HdfsPath._fix_path(self))

    def rename(self, target):
        return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target))
```

```python
writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False)
reader = FileSystemReader(HdfsPath("hdfs://..."))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635
Approved by: https://github.com/fduwjj
2023-10-31 19:36:18 +00:00
dilililiwhy
ff37f6018d Enable custom device support in fsdp checkpoint (#107289)
Fixes https://github.com/pytorch/pytorch/issues/104390
Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289
Approved by: https://github.com/wz337
2023-08-25 11:50:03 +00:00
Rodrigo Kumpera
4833dc10b8 [DCP] Rewrite read slicing to use a wrapper. (#99167)
Moved SlicedBufferedReader to utils and renamed to _ReaderView.

It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's.

Should help with #98386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167
Approved by: https://github.com/wz337
2023-06-08 13:52:13 +00:00
Aaron Gokaslan
8769fb854d [BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715)
Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715
Approved by: https://github.com/ezyang
2023-05-09 17:28:48 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Iris
bebe58bd71 [DCP] Set single_file_per_rank default to True (#94501)
The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501
Approved by: https://github.com/fegin
2023-02-09 21:45:31 +00:00
Aaron Gokaslan
748bac8757 [BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309)
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
2023-02-07 20:08:58 +00:00
Iris
dd05f028e2 [PT-D][Checkpoint] Rename DCP storage layer init() (#92869)
Rename DCP storage layer init() and update tests accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869
Approved by: https://github.com/kumpera
2023-01-25 23:52:45 +00:00
Iris
eee2869ea7 [PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553)
Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory".

This is the fix for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553
Approved by: https://github.com/kumpera
2023-01-20 21:54:04 +00:00
Iris
0cc0e5ef65 [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987)
This PR includes:

Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
2022-11-30 08:19:41 +00:00
Iris
cefece3726 Fix typo in filesystem.py (#89849)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849
Approved by: https://github.com/H-Huang
2022-11-30 01:06:58 +00:00
Iris
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00