DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.
This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
Summary:
### Diff Context
- PR introduces Pipes for multiprocess comms with checkpointer process.
- Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults).
Test Plan: CI
Differential Revision: D74668559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488
Approved by: https://github.com/saumishr
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.
This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
Unblocks #153020 which accidentally improves the CircularImportLinter to check all Python files. It doesn't set a logname so it errors, there is another FSDP script which already defaults LOGNAME to '' if not specified, this does the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153324
Approved by: https://github.com/awgu
Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys
Test Plan: unit test and e2e test reading from hf
Differential Revision: D74181513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856
Approved by: https://github.com/joecummings
Summary: The default action for ls for the local filesystem is with details=False, but this isn't the case for all filesystems (eg. huggingface), so setting details=False explicitly so that the return type of ls is a list of strings, and not a list of dictionaries, which is what it would be with details=True.
Test Plan: tested in notebook
Differential Revision: D74080572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152693
Approved by: https://github.com/joecummings
Summary:
### Diff Context
- Sometime background process can be stuck processing async checkpoint request, and trainer shutdown can occur before the background process completes.
- Fix, timeout the thread while reading the IPC queue for a response from background process.
Differential Revision: D74017700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152629
Approved by: https://github.com/saumishr
Summary: The HF storage reader/writer component can work for any back-end in theory, so we shouldn't enforce the token to be passed into fsspecreader/writer, because the specific fsspec implementation may not handle tokens. Specifically, manifold doesn't accept a token arg, but we're passing one in always, which is throwing
Test Plan: signals
Differential Revision: D73130679
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151464
Approved by: https://github.com/Skylion007
Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it
Test Plan:
ensure existing tests pass
tested e2e in a notebook
Differential Revision: D72472490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701
Approved by: https://github.com/joecummings
Summary: Introduce barrier util in the DistWrapper for rank local checkpointing. This barrier will be used at the end of the rank local checkpointing to ensure all ranks synchronize.
Test Plan: UTs
Differential Revision: D72541431
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150748
Approved by: https://github.com/MeetVadakkanchery
Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save.
Test Plan:
(lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage
File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py
Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569
Network: Up: 32KiB Down: 1.9KiB (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422)
Executing actions. Remaining 0/4 3.4s exec time total
Command: test. Finished 2 local
Time elapsed: 19.6s
Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0
Reviewed By: saumishr
Differential Revision: D70271943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025
Approved by: https://github.com/saumishr
Summary:
The _load_state_dict_from_keys method specifies that `Loads any key specified in this set. If no keys are specified, the entire checkpoint is loaded.`
But this isn't happening right now, because an empty keys arg is passed in as a set() to `_load_state_dict` and keys is expected to be None for it to actually be included in the state_dict https://fburl.com/code/l8yzojyx. So with the set() argument, the state_dict is always going to be empty
Test Plan: ensure existing tests pass
Differential Revision: D71930712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150058
Approved by: https://github.com/saumishr
Summary: This diff adds the ability for HF reader/writer to read/write in a distributed way. We do this by sending all the tensors meant for the same file to the same rank.
Test Plan:
ensure existing tests pass
I also ran a full end to end test on my devserver to read/write from my HF repo
Differential Revision: D70096439
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148189
Approved by: https://github.com/joecummings, https://github.com/saumishr
Summary:
Cache save plan metadata to reduce the collective overhead.
Global plan dedupe and metadata creation are the main overheads on Rank 0. This change saves all this cost for the subsequent saves if the plans do not change. A quick experiment with the 256 rank job, Global step overhead drops by ~99%, from 90s+ to mere 1.5s. 1.5s was mostly spent on creating the checkpoint module directories and near empty collective.
Differential Revision: D71631441
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149785
Approved by: https://github.com/MeetVadakkanchery
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
first do a lookup in to the actual tensor/bytes
then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.
Test Plan: Existing UT.
Differential Revision: D71599725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699
Approved by: https://github.com/MeetVadakkanchery
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
- first do a lookup in to the actual tensor/bytes
- then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.
Reviewed By: daulet-askarov
Differential Revision: D70497442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434
Approved by: https://github.com/MeetVadakkanchery
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:
#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s
Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s
Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822
Differential Revision: D71245218
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
# Fix typo errors across PyTorch codebase
This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.
## Changes Made
### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
- `setup.py`: Build system documentation
- `torch/_library/triton.py`: AOT compilation comments
- `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
- `torch/export/_unlift.py`: Pass population comments
- `torch/export/exported_program.py`: Decomposition table notes
### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
- `test/mobile/test_lite_script_module.py`: Exception handling comments
- `torch/export/_draft_export.py`: Error message text
- `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
- `torch/csrc/utils/python_numbers.h`: Overflow handling comment
- `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
- `torch/_dynamo/symbolic_convert.py`: Error explanation
### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
- `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
- `torch/distributed/distributed_c10d.py`
## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.
## Test Plan
No testing required as these changes only affect comments and documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Summary:
title - Add new hf storage class to torch.distributed package so that it can be imported by customers.
The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage.
Test Plan: ensure signals pass
Differential Revision: D70495399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361
Approved by: https://github.com/MeetVadakkanchery
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:
In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.
### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.
Test Plan: Added E2E UTs for process based async save.
Differential Revision: D69272583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere
Test Plan:
signals pass
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage
Differential Revision: D70324090
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089
Approved by: https://github.com/saumishr
Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds
Differential Revision: D70286926
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030
Approved by: https://github.com/hl475
Summary:
This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach:
- Create the local plan and cache the same.
- In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective.
- Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones.
- Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided.
Test Plan: UTs
Differential Revision: D69224491
## How to enable the caching:
DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now.
https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77
Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343
Approved by: https://github.com/MeetVadakkanchery
Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage
N6476188 --> able to save and load tensor in hf format
Differential Revision: D68444967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.
Test Plan: N6475361 methods don't throw an error with a string arg like they were previously
Differential Revision: D68713937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
In Fusion model, users might change the state_dict keys by state_dict_hook
The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys.
The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default
During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook.
For example:
There's a state_dict_hook:
```
def _state_dict_hook(self, destination, prefix, keep_vars):
"""Remove "embedding" from the original embedding in the state_dict
name. This keeps the orginal state dict name for the embedding
from before fusing with the FusionEmbedding.
[!Note] This update changes the order of the OrderedDict
"""
key = prefix + "embedding.weight"
new_key = prefix + "weight"
destination[new_key] = destination[key]
del destination[key]
```
In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module
```
def fqn_modifiers(self) -> Dict[str, str]:
return {
"weight": "embedding",
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557
Approved by: https://github.com/fegin