pytorch/torch/testing/_internal/distributed
Teja dd3e7170c2 Add async checkpointing impl to experimental checkpointer and add a builder API (#156927)
1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support.

2. Adds two conveinient functions to create sync/async checkpointers

Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927
Approved by: https://github.com/pradeepfn
2025-07-03 22:49:20 +00:00
..
_shard [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114) 2025-05-08 14:01:49 +00:00
_tensor Add TEST_HPU flag to set device type (#153461) 2025-05-14 19:31:40 +00:00
nn [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114) 2025-05-08 14:01:49 +00:00
rpc [BE][6/16] fix typos in torch/ (#156316) 2025-06-23 02:57:34 +00:00
__init__.py
checkpoint_utils.py Add async checkpointing impl to experimental checkpointer and add a builder API (#156927) 2025-07-03 22:49:20 +00:00
common_state_dict.py [BE][6/16] fix typos in torch/ (#156316) 2025-06-23 02:57:34 +00:00
ddp_under_dist_autograd_test.py [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114) 2025-05-08 14:01:49 +00:00
distributed_test.py [BE][6/16] fix typos in torch/ (#156316) 2025-06-23 02:57:34 +00:00
distributed_utils.py [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114) 2025-05-08 14:01:49 +00:00
fake_pg.py Register hpu device to fake backend (#156076) 2025-06-23 16:08:08 +00:00
multi_threaded_pg.py [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114) 2025-05-08 14:01:49 +00:00
rpc_utils.py [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114) 2025-05-08 14:01:49 +00:00