pytorch/torch/distributed
Saurabh Mishra 381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320)
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
..
_composable [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_shard [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_sharded_tensor [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_sharding_spec
_symmetric_memory [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_tensor [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_tools Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566) 2025-03-08 18:00:49 +00:00
algorithms [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
autograd
benchmarks [BE][CI] bump ruff to 0.8.4 (#143753) 2024-12-24 12:24:10 +00:00
checkpoint [DCP] Avoid in-place update and deepcopy during dudpe (#149320) 2025-03-18 16:08:40 +00:00
elastic Expose the rendezvous keepalive arguments (#145228) 2025-03-03 19:11:56 +00:00
examples
fsdp [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257) 2025-03-18 00:46:07 +00:00
launcher PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
nn [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
optim [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190) 2025-03-06 20:37:06 +00:00
pipelining [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
rpc [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
tensor Add batch dim sharding rule to sdpa (#149253) 2025-03-18 07:54:02 +00:00
__init__.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_checkpointable.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_composable_state.py
_functional_collectives_impl.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_functional_collectives.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_serialization.py PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
_state_dict_utils.py Create and send full_tensor on ProcessGroup-supported device in _broadcast_tensors (#148865) 2025-03-12 20:56:31 +00:00
argparse_util.py
c10d_logger.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
collective_utils.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364) 2025-03-01 00:57:37 +00:00
distributed_c10d.py Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)" 2025-03-17 22:43:15 +00:00
launch.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
logging_handlers.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
remote_device.py
rendezvous.py Fix dist.init_process_group on windows (#148266) 2025-03-05 00:07:56 +00:00
run.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
utils.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00