pytorch/docs
Meet Vadakkanchery fdee60769a [DCP] Introduce process based async checkpointing (#147039)
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
2025-03-04 13:33:28 +00:00
..
cpp docs: get rid of copyright year (#144562) 2025-01-10 19:57:25 +00:00
source [DCP] Introduce process based async checkpointing (#147039) 2025-03-04 13:33:28 +00:00
.gitignore
libtorch.rst Add ROCm documentation to libtorch (C++) reST. (#136378) 2024-09-25 02:30:56 +00:00
make.bat
Makefile [ONNX] Update images and APIs to onnx_dynamo.rst (#144358) 2025-01-08 21:44:43 +00:00
README.md
requirements.txt Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347)" 2025-01-23 20:06:07 +00:00

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.