mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary:
As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
- as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now
- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.
Test Plan: test_hf_storage.py
Reviewed By: saumishr
Differential Revision: D75099862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
|
||
|---|---|---|
| .. | ||
| _composable | ||
| _shard | ||
| _sharded_tensor | ||
| _sharding_spec | ||
| _symmetric_memory | ||
| _tensor | ||
| _tools | ||
| algorithms | ||
| autograd | ||
| benchmarks | ||
| checkpoint | ||
| elastic | ||
| examples | ||
| fsdp | ||
| launcher | ||
| nn | ||
| optim | ||
| pipelining | ||
| rpc | ||
| tensor | ||
| __init__.py | ||
| _checkpointable.py | ||
| _composable_state.py | ||
| _functional_collectives_impl.py | ||
| _functional_collectives.py | ||
| _serialization.py | ||
| _state_dict_utils.py | ||
| argparse_util.py | ||
| c10d_logger.py | ||
| collective_utils.py | ||
| constants.py | ||
| CONTRIBUTING.md | ||
| device_mesh.py | ||
| distributed_c10d.py | ||
| launch.py | ||
| logging_handlers.py | ||
| remote_device.py | ||
| rendezvous.py | ||
| run.py | ||
| utils.py | ||