pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Jaewon Lee 11ea09effc [CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks without synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2 (cherry picked from commit 05780f1ed4b176f05e765b2411c9eaa2eaeb48b0)		2022-03-21 18:46:02 +00:00
..
amp_examples.rst	Update the documentation for AMP with DataParallel (#69218 )	2021-12-03 14:58:47 -08:00
autograd.rst	[Doc] Better formatting in autograd.rst (#72586 )	2022-02-11 22:46:10 +00:00
broadcasting.rst	Fixes docs (#51439 )	2021-01-31 22:00:26 -08:00
cpu_threading_runtimes.svg	Update CPU threading doc (#33083 )	2020-02-11 14:13:51 -08:00
cpu_threading_torchscript_inference.rst	Upgrade MKL-DNN to DNNL v1.2 (#32422 )	2020-03-26 22:07:59 -07:00
cpu_threading_torchscript_inference.svg	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
cuda.rst	[CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261 )	2022-03-21 18:46:02 +00:00
ddp.rst	[Docs][BE] DDP doc fix (#71363 )	2022-01-18 22:24:51 +00:00
extending.rst	MAINT, DOC: Trivial spellings and warnings (#72745 )	2022-02-14 21:55:19 +00:00
faq.rst	Update faq.rst so OOM section mentions checkpoint (#62709 )	2021-08-05 07:40:08 -07:00
gradcheck.rst	Add first draft of gradcheck note (#55966 )	2021-04-27 14:33:42 -07:00
hip.rst	Add note on ifdefing based on CUDA_VERSION for ROCm path (#62850 )	2021-08-25 15:02:03 -07:00
large_scale_deployments.rst	Move ThreadLocalDebugInfo to c10 (#37774 )	2020-05-11 19:27:41 -07:00
modules.rst	Update link to tutorial on defining NN modules (#65534 )	2021-09-23 11:26:50 -07:00
multiprocessing.rst	Update docs for master to remove Python 2 references (#36336 )	2020-04-16 10:15:48 -07:00
numerical_accuracy.rst	Add an option to disable reduced precision reductions for FP16 GEMM (#67946 )	2021-11-09 17:27:20 -08:00
randomness.rst	add comma to prevent syntax errors (#62492 )	2021-08-16 12:27:31 -07:00
serialization.rst	docs: reference links to serialization.html (#54659 )	2021-03-29 10:15:07 -07:00
windows.rst	Remove remaining THC code (#69039 )	2021-12-08 12:18:08 -08:00