pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Chien-Chin Huang	494c2ec054	[DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887 ) There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam. Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887 Approved by: https://github.com/wz337	2024-02-03 01:14:13 +00:00
Chien-Chin Huang	644bc69530	[DCP] Allow users to save and load without creating storage reader and writer (#117772 ) Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer. Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772 Approved by: https://github.com/wz337 ghstack dependencies: #116248	2024-01-26 09:08:35 +00:00
Lucas Pasqualin	ea851eb027	Uses Serial Loader for DCP.save when more then one thread is used. (#118114 ) The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case. Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks. Before this PR 9.475 s 8 threads After this PR 1.632 s 8 threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114 Approved by: https://github.com/wz337, https://github.com/fegin	2024-01-25 21:11:16 +00:00
Chien-Chin Huang	3d1869d0ae	[DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246 ) 1. Better typing 2. Remove 1-liner function Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246 Approved by: https://github.com/wz337 ghstack dependencies: #116245	2024-01-11 16:27:21 +00:00
Lucas Pasqualin	b342286646	adds async save, makes checkpointer private (#116293 ) Adds Async Save and also makes `Checkpointer` classes private. The original PR was here: https://github.com/pytorch/pytorch/pull/115864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293 Approved by: https://github.com/fegin	2023-12-22 05:22:39 +00:00
Chien-Chin Huang	a548ff40de	[DCP][BE] Remove unused function (#116006 ) As title Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006 Approved by: https://github.com/wz337	2023-12-21 07:20:08 +00:00
Chien-Chin Huang	db8d409d08	[DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302 ) No logic change. Just typing and ufmt. Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #115523	2023-12-13 10:32:36 +00:00
Lucas Pasqualin	5432088098	Adds Checkpointer Wrapper for DCP [3/N] (#114603 ) Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers. Instead of doing: ``` DCP.save( state_dict={...}, storage_writer=StorageWriter(...) ) DCP.load( state_dict={...}, storage_reader=StorageReader(...) ) ``` We can now do: ``` checkpointer = Checkpointer(...) checkpointer.save(state_dict={...}) checkpointer.load(state_dict={...}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603 Approved by: https://github.com/fegin, https://github.com/wz337	2023-12-08 01:03:21 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
NVS Abhilash	44c0521e8c	fix: docstring error in torch/distributed module (#113241 ) Fixes: #113193 `pydocstyle <all_files_in_issue> --count` - Before: 345 - After: 130 For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241 Approved by: https://github.com/kit1980	2023-11-09 19:10:20 +00:00
Hanzhi Zhou	bb45f89cd9	Hackable distributed filesystem reader and writer (#106635 ) I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces. For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation. ```python from pathlib import Path import tensorflow as tf class GFileWrapper(tf.io.gfile.GFile): def __init__(self, path, mode="r") -> None: super().__init__(path, mode) def write(self, data): return super().write(bytes(data)) # a not quite efficient readinto, but it works def readinto(self, buffer): # read up to buffer's length data = self.read(len(buffer)) length = len(data) buffer[:length] = data return length class HdfsPath(type(Path())): def __new__(cls, pathsegments): return super().__new__(cls, pathsegments) @staticmethod def _fix_path(path): path = str(path) if path.startswith("hdfs:/") and not path.startswith("hdfs://"): path = path.replace("hdfs:/", "hdfs://") return path def open(self, mode="r", args, kwargs): return GFileWrapper(HdfsPath._fix_path(self), mode=mode) def mkdir(self, *kwargs) -> None: return tf.io.gfile.makedirs(HdfsPath._fix_path(self)) def rename(self, target): return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target)) ``` ```python writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False) reader = FileSystemReader(HdfsPath("hdfs://...")) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635 Approved by: https://github.com/fduwjj	2023-10-31 19:36:18 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
Rodrigo Kumpera	4833dc10b8	[DCP] Rewrite read slicing to use a wrapper. (#99167 ) Moved SlicedBufferedReader to utils and renamed to _ReaderView. It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's. Should help with #98386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167 Approved by: https://github.com/wz337	2023-06-08 13:52:13 +00:00
Aaron Gokaslan	8769fb854d	[BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715 ) Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715 Approved by: https://github.com/ezyang	2023-05-09 17:28:48 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Iris	bebe58bd71	[DCP] Set single_file_per_rank default to True (#94501 ) The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501 Approved by: https://github.com/fegin	2023-02-09 21:45:31 +00:00
Aaron Gokaslan	748bac8757	[BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309 ) Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309 Approved by: https://github.com/albanD	2023-02-07 20:08:58 +00:00
Iris	dd05f028e2	[PT-D][Checkpoint] Rename DCP storage layer init() (#92869 ) Rename DCP storage layer init() and update tests accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869 Approved by: https://github.com/kumpera	2023-01-25 23:52:45 +00:00
Iris	eee2869ea7	[PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553 ) Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory". This is the fix for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553 Approved by: https://github.com/kumpera	2023-01-20 21:54:04 +00:00
Iris	0cc0e5ef65	[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987 ) This PR includes: Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in args and *kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987 Approved by: https://github.com/fduwjj	2022-11-30 08:19:41 +00:00
Iris	cefece3726	Fix typo in filesystem.py (#89849 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849 Approved by: https://github.com/H-Huang	2022-11-30 01:06:58 +00:00
Iris	aee96bbf5a	[PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698 ) Context in RFC: https://github.com/pytorch/pytorch/issues/86620 .rst file will be finalized in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698 Approved by: https://github.com/wanchaol	2022-11-16 21:06:38 +00:00

22 Commits