pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	c622796cde	Revert "Build a storage reader/writer to write checkpoints in HF format (#147622 )" This reverts commit `6a658d983e`. Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))	2025-02-27 05:14:28 +00:00
Ankita George	6a658d983e	Build a storage reader/writer to write checkpoints in HF format (#147622 ) Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622 Approved by: https://github.com/saumishr	2025-02-26 20:47:54 +00:00
PyTorch MergeBot	3395da7f7c	Revert "Build a storage reader/writer to write checkpoints in HF format (#146352 )" This reverts commit `c615b8c174`. Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))	2025-02-21 07:30:52 +00:00
Ankita George	c615b8c174	Build a storage reader/writer to write checkpoints in HF format (#146352 ) Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage N6476188 --> able to save and load tensor in hf format Differential Revision: D68444967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352 Approved by: https://github.com/saumishr	2025-02-21 03:31:21 +00:00
Ankita George	f16d30137c	[OSS] Update FileSystem methods to properly handle a string argument (#145751 ) Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used. Test Plan: N6475361 methods don't throw an error with a string arg like they were previously Differential Revision: D68713937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751 Approved by: https://github.com/pradeepfn	2025-02-19 01:50:24 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
cassanof	10e4d3aebb	[DCP] Fix fsspec fsync bug on .finish() (#144753 ) Fixes #144752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753 Approved by: https://github.com/Skylion007, https://github.com/saumishr	2025-01-19 03:21:00 +00:00
Marc Horowitz	9c909bf3bb	[dcp] Integrate stream extensions into DCP impl (#143359 ) Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359 Approved by: https://github.com/saumishr ghstack dependencies: #143358	2025-01-17 01:51:37 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Saiteja Samudrala	a5fb07af27	[Torch][#142396 ]Resolve Failure When Uploading To Remote Storage (#143046 ) Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure Test Plan: UT Differential Revision: D67108487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046 Approved by: https://github.com/saumishr	2024-12-12 08:17:15 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Adrian Wälchli	ad314a2f05	Pass `torch.load(weights_only=)` internally to avoid FutureWarning (#130663 ) Fixes #130658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663 Approved by: https://github.com/malfet, https://github.com/LucasLLC	2024-07-16 01:24:38 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Saurabh Mishra	8e4f7f742f	[DCP] Capture reader, writer and planner components in the DCP API logger (#129548 ) Summary: Capture reader, writer and planner components in the DCP API logger Test Plan: logs can be found in scuba pytorch_dcp_logging https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki Differential Revision: D59040866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548 Approved by: https://github.com/wz337, https://github.com/fegin	2024-06-26 18:11:16 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Lucas Pasqualin	e2d18228fe	[DCP] overwrites existing checkpoint by default (#125877 ) Checks for existing checkpoints and overwrites, based on an `overwrite` flag Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125877 Approved by: https://github.com/fegin	2024-05-15 20:12:52 +00:00
albanD	af9acc4168	Fix public binding to actually traverse modules (#126103 ) The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103 Approved by: https://github.com/suo	2024-05-15 19:36:03 +00:00
Lucas Pasqualin	bb6ba31250	[DCP] Adds storage metadata, and passes it during the save path (#124772 ) This PR seeks to increase observability of save/load requests. This is accomplished with two main changes: 1. The creation of save_id and load_id: - a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call. - both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load) 2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading. If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR. One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772 Approved by: https://github.com/fegin	2024-05-07 23:53:53 +00:00
Lucas Pasqualin	4f62494bf9	[DCP] Move async logic into filesystem for better encapsulation (#124944 ) This logic is specific to FilesystemWriter, and now has a better place to live due to the new AsyncStager class Differential Revision: [D56578436](https://our.internmc.facebook.com/intern/diff/D56578436/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124944 Approved by: https://github.com/fegin ghstack dependencies: #122965, #124939	2024-05-02 20:31:33 +00:00
Lucas Pasqualin	799f1460af	[DCP] Provides default AsyncStager (#124939 ) Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939 Approved by: https://github.com/fegin ghstack dependencies: #122965	2024-05-02 19:48:54 +00:00
Aaron Gokaslan	2f3b0befed	[BE]: Apply ruff FURB 118. (#124743 ) Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-26 14:34:52 +00:00
Teja Rao	81740fd1f6	[DCP] minor readability fix: make param name consistent with overriden function (#124770 ) Summary: This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference - abc.abstractmethod def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]: Differential Revision: D56480396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770 Approved by: https://github.com/fegin	2024-04-24 05:31:26 +00:00
Lucas Pasqualin	de7edeea25	[DCP] DCP logger (#121352 ) Adds additional logging for improved observability in DCP. Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352 Approved by: https://github.com/wz337, https://github.com/fegin	2024-04-05 17:50:50 +00:00
Andrew Gu	b0a0850a5c	[DCP] Replaced `storage()` with `untyped_storage()` (#121538 ) Let us try to remove this warning 😄 : ``` [rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() [rank0]: if tensor.storage().size() != tensor.numel(): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538 Approved by: https://github.com/wz337, https://github.com/fegin	2024-03-08 23:46:17 +00:00
chuboning	54c1cf8d8a	add distributed checkpoint support for custom device (#120201 ) Fixes #120200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201 Approved by: https://github.com/fegin, https://github.com/wz337	2024-02-24 19:14:29 +00:00
Chien-Chin Huang	6d8f192fd0	[DCP] Call os.sync if os.fsync does not work for fsspec (#119287 ) Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync() If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call. Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #118888	2024-02-08 17:10:38 +00:00
Chien-Chin Huang	d947534782	[DCP] Enable filesystem/fsspec auto detection (#118888 ) This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id. Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-02-08 16:38:04 +00:00
Chien-Chin Huang	494c2ec054	[DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887 ) There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam. Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887 Approved by: https://github.com/wz337	2024-02-03 01:14:13 +00:00
Chien-Chin Huang	644bc69530	[DCP] Allow users to save and load without creating storage reader and writer (#117772 ) Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer. Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772 Approved by: https://github.com/wz337 ghstack dependencies: #116248	2024-01-26 09:08:35 +00:00
Lucas Pasqualin	ea851eb027	Uses Serial Loader for DCP.save when more then one thread is used. (#118114 ) The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case. Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks. Before this PR 9.475 s 8 threads After this PR 1.632 s 8 threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114 Approved by: https://github.com/wz337, https://github.com/fegin	2024-01-25 21:11:16 +00:00
Chien-Chin Huang	3d1869d0ae	[DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246 ) 1. Better typing 2. Remove 1-liner function Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246 Approved by: https://github.com/wz337 ghstack dependencies: #116245	2024-01-11 16:27:21 +00:00
Lucas Pasqualin	b342286646	adds async save, makes checkpointer private (#116293 ) Adds Async Save and also makes `Checkpointer` classes private. The original PR was here: https://github.com/pytorch/pytorch/pull/115864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293 Approved by: https://github.com/fegin	2023-12-22 05:22:39 +00:00
Chien-Chin Huang	a548ff40de	[DCP][BE] Remove unused function (#116006 ) As title Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006 Approved by: https://github.com/wz337	2023-12-21 07:20:08 +00:00
Chien-Chin Huang	db8d409d08	[DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302 ) No logic change. Just typing and ufmt. Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #115523	2023-12-13 10:32:36 +00:00
Lucas Pasqualin	5432088098	Adds Checkpointer Wrapper for DCP [3/N] (#114603 ) Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers. Instead of doing: ``` DCP.save( state_dict={...}, storage_writer=StorageWriter(...) ) DCP.load( state_dict={...}, storage_reader=StorageReader(...) ) ``` We can now do: ``` checkpointer = Checkpointer(...) checkpointer.save(state_dict={...}) checkpointer.load(state_dict={...}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603 Approved by: https://github.com/fegin, https://github.com/wz337	2023-12-08 01:03:21 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
NVS Abhilash	44c0521e8c	fix: docstring error in torch/distributed module (#113241 ) Fixes: #113193 `pydocstyle <all_files_in_issue> --count` - Before: 345 - After: 130 For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241 Approved by: https://github.com/kit1980	2023-11-09 19:10:20 +00:00
Hanzhi Zhou	bb45f89cd9	Hackable distributed filesystem reader and writer (#106635 ) I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces. For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation. ```python from pathlib import Path import tensorflow as tf class GFileWrapper(tf.io.gfile.GFile): def __init__(self, path, mode="r") -> None: super().__init__(path, mode) def write(self, data): return super().write(bytes(data)) # a not quite efficient readinto, but it works def readinto(self, buffer): # read up to buffer's length data = self.read(len(buffer)) length = len(data) buffer[:length] = data return length class HdfsPath(type(Path())): def __new__(cls, pathsegments): return super().__new__(cls, pathsegments) @staticmethod def _fix_path(path): path = str(path) if path.startswith("hdfs:/") and not path.startswith("hdfs://"): path = path.replace("hdfs:/", "hdfs://") return path def open(self, mode="r", args, kwargs): return GFileWrapper(HdfsPath._fix_path(self), mode=mode) def mkdir(self, *kwargs) -> None: return tf.io.gfile.makedirs(HdfsPath._fix_path(self)) def rename(self, target): return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target)) ``` ```python writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False) reader = FileSystemReader(HdfsPath("hdfs://...")) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635 Approved by: https://github.com/fduwjj	2023-10-31 19:36:18 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
Rodrigo Kumpera	4833dc10b8	[DCP] Rewrite read slicing to use a wrapper. (#99167 ) Moved SlicedBufferedReader to utils and renamed to _ReaderView. It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's. Should help with #98386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167 Approved by: https://github.com/wz337	2023-06-08 13:52:13 +00:00
Aaron Gokaslan	8769fb854d	[BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715 ) Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715 Approved by: https://github.com/ezyang	2023-05-09 17:28:48 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Iris	bebe58bd71	[DCP] Set single_file_per_rank default to True (#94501 ) The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501 Approved by: https://github.com/fegin	2023-02-09 21:45:31 +00:00
Aaron Gokaslan	748bac8757	[BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309 ) Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309 Approved by: https://github.com/albanD	2023-02-07 20:08:58 +00:00
Iris	dd05f028e2	[PT-D][Checkpoint] Rename DCP storage layer init() (#92869 ) Rename DCP storage layer init() and update tests accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869 Approved by: https://github.com/kumpera	2023-01-25 23:52:45 +00:00
Iris	eee2869ea7	[PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553 ) Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory". This is the fix for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553 Approved by: https://github.com/kumpera	2023-01-20 21:54:04 +00:00
Iris	0cc0e5ef65	[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987 ) This PR includes: Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in args and *kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987 Approved by: https://github.com/fduwjj	2022-11-30 08:19:41 +00:00
Iris	cefece3726	Fix typo in filesystem.py (#89849 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849 Approved by: https://github.com/H-Huang	2022-11-30 01:06:58 +00:00

1 2

51 Commits