pytorch/torch/distributed
Ankita George dbec08bc1c Changes to HFStorageWriter to support saving shards of tensors (#154742) (#155566)
Summary:

As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
    -  as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now

- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.

Test Plan: test_hf_storage.py

Reviewed By: saumishr

Differential Revision: D75099862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
2025-06-10 23:37:47 +00:00
..
_composable [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_shard Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022) 2025-05-27 14:10:00 +00:00
_sharded_tensor [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_sharding_spec
_symmetric_memory [Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595) 2025-05-15 21:44:57 +00:00
_tensor [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_tools [BE]: Update ruff to 0.11.8 (#153249) 2025-05-12 18:30:52 +00:00
algorithms Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)" 2025-05-16 08:29:26 +00:00
autograd
benchmarks [BE][CI] bump ruff to 0.8.4 (#143753) 2024-12-24 12:24:10 +00:00
checkpoint Changes to HFStorageWriter to support saving shards of tensors (#154742) (#155566) 2025-06-10 23:37:47 +00:00
elastic [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268) 2025-06-09 10:43:52 +00:00
examples [BE]: Add PEP621 project section to pyproject.toml (#153055) 2025-05-12 02:16:07 +00:00
fsdp [FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319) 2025-06-06 20:29:31 +00:00
launcher [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268) 2025-06-09 10:43:52 +00:00
nn [BE]: Update ruff to 0.11.8 (#153249) 2025-05-12 18:30:52 +00:00
optim [BE][Ez]: Remove unneeded mypy suppressions (#154800) 2025-06-01 06:10:41 +00:00
pipelining [BE][Ez]: Optimize unnecessary lambda with operator (#154722) 2025-05-30 23:47:10 +00:00
rpc Make torch importable if compiled without TensorPipe (#154382) 2025-05-27 18:13:38 +00:00
tensor [dtensor] fix simplefsdp mixed-precision training bugs (#154975) 2025-06-03 14:47:36 +00:00
__init__.py c10d/Store: add nonblocking mode to queue_pop (#151485) 2025-04-18 02:14:50 +00:00
_checkpointable.py [BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130) 2025-06-06 13:28:05 +00:00
_composable_state.py
_functional_collectives_impl.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_functional_collectives.py Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022) 2025-05-27 14:10:00 +00:00
_serialization.py PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
_state_dict_utils.py fix numpy compatibility for 2d small list indices (#154806) 2025-06-04 01:58:52 +00:00
argparse_util.py
c10d_logger.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
collective_utils.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
constants.py
CONTRIBUTING.md
device_mesh.py Revert "[inductor] Add typing to _inductor/ir.py (#149958)" 2025-06-06 15:19:16 +00:00
distributed_c10d.py [c10d][gloo] Integrate vendor generic FR into gloo (#152614) 2025-06-03 16:12:54 +00:00
launch.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
logging_handlers.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
remote_device.py
rendezvous.py Fix tcp init when using port 0 (#154156) 2025-05-23 21:41:58 +00:00
run.py [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268) 2025-06-09 10:43:52 +00:00
utils.py Refactor to use torch.accelerator.device_index instead of torch.cuda.device for generic device context manager (#148880) 2025-04-25 09:45:25 +00:00