pytorch/docs/source/notes
Mikayla Gawarecki 8e483654cb Add config.save.use_pinned_memory_for_d2h to serialization config (#143342)
This was benchmarked with two separate scripts on my A100
(A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save``
(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save`
Timings are an average of 5 runs and benchmark scripts + results are attached

Under both scenarios, we see **~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``)** compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False``

(A)  Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 28.54s | 20.76s |
| `compute_crc_32 = False` | 22.57s |  **14.51s** |

(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 8.38s | 5.53s |
| `compute_crc_32 = False` | 6.94s |  **3.99s** |

Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False`
<img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" />

Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True`
<img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342
Approved by: https://github.com/albanD
ghstack dependencies: #143324
2024-12-20 21:01:18 +00:00
..
amp_examples.rst Update document for autocast on CPU (#135299) 2024-09-13 09:11:47 +00:00
autograd.rst Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307) 2024-08-25 22:14:02 +00:00
broadcasting.rst
cpu_threading_runtimes.svg
cpu_threading_torchscript_inference.rst
cpu_threading_torchscript_inference.svg
cuda.rst [PyTorch Pinned Allocator] Add support of background thread to process events (#135524) 2024-09-17 21:08:10 +00:00
custom_operators.rst Redirect the custom ops landing page :D (#139634) 2024-11-04 22:25:15 +00:00
ddp.rst
extending.func.rst
extending.rst [doc] fix grammar in "Extending Torch" (#140209) 2024-11-13 05:34:43 +00:00
faq.rst
fsdp.rst
get_start_xpu.rst update get start xpu (#137479) 2024-10-16 17:36:29 +00:00
gradcheck.rst
hip.rst [ROCm] set hipblas workspace (#138791) 2024-10-29 01:37:55 +00:00
large_scale_deployments.rst
modules.rst Fix to modules.rst: indent line with activation functions (#139667) 2024-11-08 01:12:52 +00:00
mps.rst
multiprocessing.rst
numerical_accuracy.rst Add option to configure reduced precision math backend for SDPA (#135964) 2024-09-24 07:11:38 +00:00
randomness.rst Fix typo in Reproducibility docs (#141341) 2024-11-26 16:53:26 +00:00
serialization.rst Add config.save.use_pinned_memory_for_d2h to serialization config (#143342) 2024-12-20 21:01:18 +00:00
windows.rst