pytorch/test/distributed
Saurabh Mishra 381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320)
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
..
_composable [FSDP2] Update ignored_params docstring and add unit test (#149074) 2025-03-15 00:23:09 +00:00
_shard PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
_tools Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566) 2025-03-08 18:00:49 +00:00
algorithms Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
bin
checkpoint [DCP] Avoid in-place update and deepcopy during dudpe (#149320) 2025-03-18 16:08:40 +00:00
elastic Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
flight_recorder [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439) 2025-02-19 22:09:16 +00:00
fsdp Enable FSDP tests on XPU device (#147518) 2025-03-04 23:49:37 +00:00
launcher PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
nn/jit PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
optim Enable ASAN in CUDA tests (#147512) 2025-02-25 02:58:39 +00:00
pipelining Refactoring pipeline parallelism test cases to be device agnostic [1/n] (#146472) 2025-02-11 00:13:23 +00:00
rpc
tensor Support subclass constructor capturing in export (#147014) 2025-03-16 18:19:19 +00:00
argparse_util_test.py
test_backends.py API to retrieve default distributed backend from device (#140536) 2024-11-22 11:01:53 +00:00
test_c10d_common.py Skip distributed subprocess test internally as they don't work (#148909) 2025-03-11 02:07:45 +00:00
test_c10d_functional_native.py Revert "[AOTI] Update test runner to use the new APIs (#147105)" 2025-03-18 15:25:40 +00:00
test_c10d_gloo.py [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673) 2025-01-09 05:18:57 +00:00
test_c10d_logger.py [c10d] Switch all timer logging in c10d to wait_counter (#141154) 2024-11-21 01:10:11 +00:00
test_c10d_nccl.py Remove outdated CUDA version check (#148142) 2025-03-04 03:33:44 +00:00
test_c10d_object_collectives.py Update test_c10d_object_collectives.py with DistributedTestBase class (#145056) 2025-02-13 03:57:59 +00:00
test_c10d_ops_nccl.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_c10d_pypg.py c10d/ProcessGroup: cleanup abort and shutdown (#148798) 2025-03-08 18:33:18 +00:00
test_c10d_spawn_gloo.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_nccl.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_ucc.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn.py Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
test_c10d_ucc.py XFAIL test_save_load_checkpoint (#144927) 2025-01-16 07:31:56 +00:00
test_collective_utils.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_composability.py composability test cleanup (#145011) 2025-01-18 04:37:12 +00:00
test_compute_comm_reordering.py [CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178) 2024-10-20 19:38:18 +00:00
test_control_collectives.py
test_data_parallel.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_device_mesh.py [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364) 2025-03-01 00:57:37 +00:00
test_distributed_spawn.py Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
test_dynamo_distributed.py dynamo: fsdp throw unimplemented vs attribute error (#146188) 2025-02-04 21:45:55 +00:00
test_fake_pg.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_functional_api.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_inductor_collectives.py [partitioner] always ban compiler-driven recompute of collectives by default (#147561) 2025-03-13 03:36:13 +00:00
test_launcher.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_multi_threaded_pg.py
test_nccl.py [Pytorch][ATEN] Enable FP8 NCCL in Pytorch ATEN (#138776) 2024-10-25 21:56:47 +00:00
test_pg_wrapper.py
test_serialization.py distributed/serialization: add experimental streaming torch.save/load methods (#146555) 2025-02-07 18:08:11 +00:00
test_store.py TCPStore: soft fail bind when agent store active (#147465) 2025-02-21 03:02:26 +00:00
test_symmetric_memory.py [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886) 2025-01-28 01:43:37 +00:00