pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

History

Jaewon Lee 11ea09effc [CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks without synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2 (cherry picked from commit 05780f1ed4b176f05e765b2411c9eaa2eaeb48b0)		2022-03-21 18:46:02 +00:00
..
_static	clarify the documentation of `torch.meshgrid` (#62977 )	2021-08-18 04:01:22 -07:00
_templates	DOC: Merge extraheader block from theme instead of override (#70187 )	2022-01-05 06:42:38 -08:00
community	Update persons of interest for ONNX (#72072 )	2022-02-16 23:01:13 +00:00
elastic	(torchelastic) make --max_restarts explicit in the quickstart and runner docs (#65838 )	2021-09-29 19:29:01 -07:00
notes	[CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261 )	2022-03-21 18:46:02 +00:00
rpc	Support Union in TorchScript (#64234 )	2021-09-03 06:12:24 -07:00
scripts	[docs] Add images to some activation functions (#65415 )	2021-09-22 11:05:29 -07:00
amp.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
autograd.rst	Targeted documentation updates in autograd.functional (#72111 )	2022-02-02 03:19:31 +00:00
backends.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
benchmark_utils.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
bottleneck.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
checkpoint.rst
complex_numbers.rst	Grammatical update of tech docs (#61547 )	2021-07-14 14:01:59 -07:00
conf.py	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
config_mod.rst	rename config module file to work with gh pages better	2022-03-10 20:41:44 +00:00
cpp_extension.rst	Check clang++/g++ version when compiling CUDA extensions (#63230 )	2022-02-24 08:32:32 +00:00
cpp_index.rst
cuda.rst	Document `torch.cuda.ExternalStream`, `torch.cuda.caching_allocator_alloc` and `torch.cuda.caching_allocator_delete` (#70126 )	2022-01-12 15:44:40 -08:00
cudnn_persistent_rnn.rst	Remove orphan from cuDNN persistent note (#65160 )	2021-09-21 11:09:47 -07:00
cudnn_rnn_determinism.rst
data.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
ddp_comm_hooks.rst	[DDP Comm Hook] Add debugging communication hooks to ddp_comm_hooks.rst (#64352 )	2021-09-01 17:37:19 -07:00
deploy.rst	[deploy] docs (#69251 )	2021-12-01 21:55:18 -08:00
distributed.algorithms.join.rst	Add tutorial link (#62785 )	2021-08-05 17:28:02 -07:00
distributed.elastic.rst	[1/n][torch/elastic] Move torchelastic docs *.rst (#148 )	2021-05-04 00:57:56 -07:00
distributed.optim.rst	[distributed][docs] Delete distributed optimimzer section from RPC and add reference to namespace docs page (#68068 )	2021-11-09 15:01:54 -08:00
distributed.rst	[PyTorch Distributed] Update documentation about NCCL environment variables (#74006 )	2022-03-11 23:57:17 +00:00
distributions.rst	[Reinstate] Wishart distribution (#70377 )	2021-12-30 11:41:46 -08:00
dlpack.rst
docutils.conf
fft.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
fsdp.rst	make fsdp folder to be public (#72084 )	2022-02-02 15:50:14 +00:00
futures.rst	Update docs to mention CUDA support for Future (#50048 )	2021-05-11 08:26:33 -07:00
fx.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
hub.rst	Add more details to the known limitations section of torchhub docs (#69970 )	2021-12-16 02:43:48 -08:00
index.rst	rename config module file to work with gh pages better	2022-03-10 20:41:44 +00:00
jit_builtin_functions.rst
jit_language_reference_v2.rst	Add Union type to TorchScript Language Ref (#69514 )	2021-12-07 12:53:54 -08:00
jit_language_reference.rst	fix typos in jit_language_reference.rst (#68706 )	2021-11-22 19:09:06 -08:00
jit_python_reference.rst	[JIT] improve documentation (#57991 )	2021-05-19 11:47:32 -07:00
jit_unsupported.rst
jit.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
linalg.rst	[Array API] Add linalg.diagonal (#70599 )	2022-01-26 08:08:32 +00:00
math-quantizer-equation.png
mobile_optimizer.rst
model_zoo.rst
monitor.rst	torch/monitor: merge Interval and FixedCount stats (#72009 )	2022-01-30 23:21:59 +00:00
multiprocessing.rst
name_inference.rst	Abladawood patch 1 (#58496 )	2021-05-20 10:32:18 -07:00
named_tensor.rst
nested.rst	Minimal NestedTensor (#72881 )	2022-03-02 16:31:51 +00:00
nn.functional.rst	Revert D34154832: [pytorch][PR] Add `multi_head_attention_forward` to functional rst docs	2022-02-11 05:08:46 +00:00
nn.init.rst
nn.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
onnx.rst	[ONNX] ONNX Exporter logging (#71342 )	2022-03-17 19:40:03 +00:00
optim.rst	To add SequentialLR to PyTorch Core Schedulers (#64037 )	2021-09-09 09:36:32 -07:00
package.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
pipeline.rst	Minor changes in documentation (#68557 )	2021-11-18 17:57:16 -08:00
profiler.rst	Add low level torch.profiler.kineto_profile base class (#63302 )	2021-12-14 14:47:43 -08:00
quantization-support.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
quantization.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
random.rst
rpc.rst	Add note in RPC docs about retries. (#73601 )	2022-03-03 00:29:31 +00:00
sparse.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
special.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
storage.rst
tensor_attributes.rst	fix wrong indexing of class names in docs	2022-03-02 22:21:21 +00:00
tensor_view.rst	Correcting a minor typo: "Users should pay" instead of "Users should be pay" (#72500 )	2022-02-08 23:08:25 +00:00
tensorboard.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
tensors.rst	pytorch: fix typo in quantization docs (#73511 )	2022-02-28 23:11:52 +00:00
testing.rst	promote torch.testing to stable (#73348 )	2022-02-25 06:30:31 +00:00
torch.ao.ns._numeric_suite_fx.rst	Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380 )	2021-10-11 18:47:58 -07:00
torch.ao.ns._numeric_suite.rst	Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380 )	2021-10-11 18:47:58 -07:00
torch.overrides.rst
torch.rst	Cleanup all module references in doc (#73983 )	2022-03-10 22:26:29 +00:00
type_info.rst	[Docs] Mention `torch.bfloat16` in `torch.finfo` (#68496 )	2021-11-18 17:52:41 -08:00