pytorch/torch/distributed/checkpoint
Saurabh Mishra 381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320)
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
..
examples
__init__.py Add new hf storage class to torch.distributed package (#148361) 2025-03-05 21:52:06 +00:00
_async_executor.py [DCP] Introduce process based async checkpointing (#147039) 2025-03-04 13:33:28 +00:00
_async_process_executor.py [DCP] Introduce process based async checkpointing (#147039) 2025-03-04 13:33:28 +00:00
_async_thread_executor.py [DCP] Introduce process based async checkpointing (#147039) 2025-03-04 13:33:28 +00:00
_checkpointer.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_dedup_save_plans.py [DCP] Avoid in-place update and deepcopy during dudpe (#149320) 2025-03-18 16:08:40 +00:00
_dedup_tensors.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_extension.py PEP585: Missed conversions (#145342) 2025-01-29 05:24:36 +00:00
_fsspec_filesystem.py Revert "Build a storage reader/writer to write checkpoints in HF format (#146352)" 2025-02-21 07:30:52 +00:00
_hf_storage.py Build a storage reader/writer to write checkpoints in HF format (#148089) 2025-02-28 07:38:10 +00:00
_nested_dict.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_sharded_tensor_utils.py
_storage_utils.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_traverse.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_version.py
api.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
default_planner.py [DCP] Avoid in-place update and deepcopy during dudpe (#149320) 2025-03-18 16:08:40 +00:00
filesystem.py Build a storage reader/writer to write checkpoints in HF format (#148089) 2025-02-28 07:38:10 +00:00
format_utils.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
logger.py [DCP] Introduce process based async checkpointing (#147039) 2025-03-04 13:33:28 +00:00
logging_handlers.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
metadata.py [DCP] Introduce modules metadata in the storage_meta (#146654) 2025-02-13 17:44:30 +00:00
optimizer.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
planner_helpers.py [DCP] Cache save plans in default planner (#147343) 2025-02-25 20:59:25 +00:00
planner.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
resharding.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
staging.py Fix staging for CPU tensors in OSS DCP async_save (#145408) 2025-01-23 12:49:26 -08:00
state_dict_loader.py Typo Errors fixed in multiple files (#148262) 2025-03-09 12:21:40 +00:00
state_dict_saver.py [DCP] Introduce process based async checkpointing (#147039) 2025-03-04 13:33:28 +00:00
state_dict.py [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257) 2025-03-18 00:46:07 +00:00
stateful.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
storage.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
utils.py [DCP] fix dcp gather_object/scatter_object_list (#147675) 2025-03-06 21:20:38 +00:00