pytorch/docs at f2f25a5444b8daaf24c913179b4d90844a12d2d6 - pytorch - Carlos Sousa's Git

OSSForks/pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Meet Vadakkanchery fdee60769a [DCP] Introduce process based async checkpointing (#147039 ) Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: Added E2E UTs for process based async save. Differential Revision: D69272583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039 Approved by: https://github.com/saumishr		2025-03-04 13:33:28 +00:00
..
cpp	docs: get rid of copyright year (#144562 )	2025-01-10 19:57:25 +00:00
source	[DCP] Introduce process based async checkpointing (#147039 )	2025-03-04 13:33:28 +00:00
.gitignore
libtorch.rst	Add ROCm documentation to libtorch (C++) reST. (#136378 )	2024-09-25 02:30:56 +00:00
make.bat
Makefile	[ONNX] Update images and APIs to onnx_dynamo.rst (#144358 )	2025-01-08 21:44:43 +00:00
README.md
requirements.txt	Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347 )"	2025-01-23 20:06:07 +00:00

README.md

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.