pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit `321e602692`. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
Saurabh Mishra	6ee175195a	[DCP][OSS] Rank local checkpointing in DCP without collectives (#147758 ) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758 Approved by: https://github.com/meetv18	2025-08-13 16:20:28 +00:00
Ankita George	5c79a55e7e	[oss] Add version to metadata (#155343 ) Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data Test Plan: added a test Rollback Plan: Differential Revision: D76135887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343 Approved by: https://github.com/teja-rao	2025-07-07 20:57:30 +00:00
Ankita George	5dd9652389	Clean up HF components (#155707 ) Differential Revision: [D76427358](https://our.internmc.facebook.com/intern/diff/D76427358/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155707 Approved by: https://github.com/saumishr	2025-06-24 00:07:37 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit `c2f0292bd5`. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Aaron Orenstein	e95e8eed0a	mypy 1.16.0 (#155821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-06-14 18:18:43 +00:00
Ankita George	e6d71f3789	Support re-sharding for safetensors checkpoints (#154519 ) This change will add the ability to support re-sharding for hf safetensors checkpoints. This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class. Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519 Approved by: https://github.com/saumishr	2025-06-12 19:38:29 +00:00
Ankita George	c5de6ff079	Remove ls from filesystem base (#151117 ) Summary: User reported issue where they are inheriting from filesystembase but don't have the ls method which was added in the PR https://github.com/pytorch/pytorch/pull/150701#discussion_r2039840129. Removing the method from the base class but keeping it in derived class Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D72867722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151117 Approved by: https://github.com/Skylion007, https://github.com/lw	2025-04-15 14:45:20 +00:00
Ankita George	78fe079c97	Support having no metadata file for HuggingFaceStorageReader (#150701 ) Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it Test Plan: ensure existing tests pass tested e2e in a notebook Differential Revision: D72472490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701 Approved by: https://github.com/joecummings	2025-04-07 22:10:39 +00:00
Ankita George	861d2cc02c	Add a param for save format in Storage Writer (#150025 ) Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save. Test Plan: (lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569 Network: Up: 32KiB Down: 1.9KiB (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422) Executing actions. Remaining 0/4 3.4s exec time total Command: test. Finished 2 local Time elapsed: 19.6s Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0 Reviewed By: saumishr Differential Revision: D70271943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025 Approved by: https://github.com/saumishr	2025-04-04 17:52:53 +00:00
Ankita George	3a58a04898	Build a storage reader/writer to write checkpoints in HF format (#148089 ) Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere Test Plan: signals pass buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D70324090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089 Approved by: https://github.com/saumishr	2025-02-28 07:38:10 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
PyTorch MergeBot	c622796cde	Revert "Build a storage reader/writer to write checkpoints in HF format (#147622 )" This reverts commit `6a658d983e`. Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))	2025-02-27 05:14:28 +00:00
Ankita George	6a658d983e	Build a storage reader/writer to write checkpoints in HF format (#147622 ) Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622 Approved by: https://github.com/saumishr	2025-02-26 20:47:54 +00:00
PyTorch MergeBot	3395da7f7c	Revert "Build a storage reader/writer to write checkpoints in HF format (#146352 )" This reverts commit `c615b8c174`. Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))	2025-02-21 07:30:52 +00:00
Ankita George	c615b8c174	Build a storage reader/writer to write checkpoints in HF format (#146352 ) Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage N6476188 --> able to save and load tensor in hf format Differential Revision: D68444967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352 Approved by: https://github.com/saumishr	2025-02-21 03:31:21 +00:00
Ankita George	f16d30137c	[OSS] Update FileSystem methods to properly handle a string argument (#145751 ) Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used. Test Plan: N6475361 methods don't throw an error with a string arg like they were previously Differential Revision: D68713937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751 Approved by: https://github.com/pradeepfn	2025-02-19 01:50:24 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
cassanof	10e4d3aebb	[DCP] Fix fsspec fsync bug on .finish() (#144753 ) Fixes #144752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753 Approved by: https://github.com/Skylion007, https://github.com/saumishr	2025-01-19 03:21:00 +00:00
Marc Horowitz	9c909bf3bb	[dcp] Integrate stream extensions into DCP impl (#143359 ) Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359 Approved by: https://github.com/saumishr ghstack dependencies: #143358	2025-01-17 01:51:37 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Saiteja Samudrala	a5fb07af27	[Torch][#142396 ]Resolve Failure When Uploading To Remote Storage (#143046 ) Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure Test Plan: UT Differential Revision: D67108487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046 Approved by: https://github.com/saumishr	2024-12-12 08:17:15 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Adrian Wälchli	ad314a2f05	Pass `torch.load(weights_only=)` internally to avoid FutureWarning (#130663 ) Fixes #130658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663 Approved by: https://github.com/malfet, https://github.com/LucasLLC	2024-07-16 01:24:38 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Saurabh Mishra	8e4f7f742f	[DCP] Capture reader, writer and planner components in the DCP API logger (#129548 ) Summary: Capture reader, writer and planner components in the DCP API logger Test Plan: logs can be found in scuba pytorch_dcp_logging https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki Differential Revision: D59040866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548 Approved by: https://github.com/wz337, https://github.com/fegin	2024-06-26 18:11:16 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Lucas Pasqualin	e2d18228fe	[DCP] overwrites existing checkpoint by default (#125877 ) Checks for existing checkpoints and overwrites, based on an `overwrite` flag Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125877 Approved by: https://github.com/fegin	2024-05-15 20:12:52 +00:00
albanD	af9acc4168	Fix public binding to actually traverse modules (#126103 ) The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103 Approved by: https://github.com/suo	2024-05-15 19:36:03 +00:00
Lucas Pasqualin	bb6ba31250	[DCP] Adds storage metadata, and passes it during the save path (#124772 ) This PR seeks to increase observability of save/load requests. This is accomplished with two main changes: 1. The creation of save_id and load_id: - a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call. - both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load) 2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading. If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR. One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772 Approved by: https://github.com/fegin	2024-05-07 23:53:53 +00:00
Lucas Pasqualin	4f62494bf9	[DCP] Move async logic into filesystem for better encapsulation (#124944 ) This logic is specific to FilesystemWriter, and now has a better place to live due to the new AsyncStager class Differential Revision: [D56578436](https://our.internmc.facebook.com/intern/diff/D56578436/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124944 Approved by: https://github.com/fegin ghstack dependencies: #122965, #124939	2024-05-02 20:31:33 +00:00
Lucas Pasqualin	799f1460af	[DCP] Provides default AsyncStager (#124939 ) Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939 Approved by: https://github.com/fegin ghstack dependencies: #122965	2024-05-02 19:48:54 +00:00
Aaron Gokaslan	2f3b0befed	[BE]: Apply ruff FURB 118. (#124743 ) Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-26 14:34:52 +00:00
Teja Rao	81740fd1f6	[DCP] minor readability fix: make param name consistent with overriden function (#124770 ) Summary: This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference - abc.abstractmethod def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]: Differential Revision: D56480396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770 Approved by: https://github.com/fegin	2024-04-24 05:31:26 +00:00
Lucas Pasqualin	de7edeea25	[DCP] DCP logger (#121352 ) Adds additional logging for improved observability in DCP. Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352 Approved by: https://github.com/wz337, https://github.com/fegin	2024-04-05 17:50:50 +00:00
Andrew Gu	b0a0850a5c	[DCP] Replaced `storage()` with `untyped_storage()` (#121538 ) Let us try to remove this warning 😄 : ``` [rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() [rank0]: if tensor.storage().size() != tensor.numel(): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538 Approved by: https://github.com/wz337, https://github.com/fegin	2024-03-08 23:46:17 +00:00
chuboning	54c1cf8d8a	add distributed checkpoint support for custom device (#120201 ) Fixes #120200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201 Approved by: https://github.com/fegin, https://github.com/wz337	2024-02-24 19:14:29 +00:00
Chien-Chin Huang	6d8f192fd0	[DCP] Call os.sync if os.fsync does not work for fsspec (#119287 ) Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync() If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call. Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #118888	2024-02-08 17:10:38 +00:00
Chien-Chin Huang	d947534782	[DCP] Enable filesystem/fsspec auto detection (#118888 ) This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id. Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-02-08 16:38:04 +00:00
Chien-Chin Huang	494c2ec054	[DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887 ) There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam. Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887 Approved by: https://github.com/wz337	2024-02-03 01:14:13 +00:00
Chien-Chin Huang	644bc69530	[DCP] Allow users to save and load without creating storage reader and writer (#117772 ) Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer. Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772 Approved by: https://github.com/wz337 ghstack dependencies: #116248	2024-01-26 09:08:35 +00:00
Lucas Pasqualin	ea851eb027	Uses Serial Loader for DCP.save when more then one thread is used. (#118114 ) The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case. Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks. Before this PR 9.475 s 8 threads After this PR 1.632 s 8 threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114 Approved by: https://github.com/wz337, https://github.com/fegin	2024-01-25 21:11:16 +00:00
Chien-Chin Huang	3d1869d0ae	[DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246 ) 1. Better typing 2. Remove 1-liner function Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246 Approved by: https://github.com/wz337 ghstack dependencies: #116245	2024-01-11 16:27:21 +00:00
Lucas Pasqualin	b342286646	adds async save, makes checkpointer private (#116293 ) Adds Async Save and also makes `Checkpointer` classes private. The original PR was here: https://github.com/pytorch/pytorch/pull/115864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293 Approved by: https://github.com/fegin	2023-12-22 05:22:39 +00:00

1 2

67 Commits