pytorch/torch/utils
Michael Wootton 2f3be2735f Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: zou3519

Differential Revision: D29186394

Pulled By: ezyang

fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
2021-06-21 11:46:08 -07:00
..
backcompat
benchmark irange for PyTorch sans jit (#59481) 2021-06-09 14:46:11 -07:00
bottleneck
data remove unused type: ignore directives (#60006) 2021-06-18 07:23:31 -07:00
ffi
hipify External stream (#59527) 2021-06-14 13:46:11 -07:00
model_dump remove unused type: ignore directives (#60006) 2021-06-18 07:23:31 -07:00
tensorboard remove unused type: ignore directives (#60006) 2021-06-18 07:23:31 -07:00
__init__.py Fix breakpad build and add to more images (#59236) 2021-06-01 22:47:14 -07:00
_cpp_extension_versioner.py Add option for cpp_extensions to compile standalone executable (#47862) 2020-12-01 20:03:08 -08:00
_crash_handler.py Fix breakpad build and add to more images (#59236) 2021-06-01 22:47:14 -07:00
_pytree.py [FX] Adds PyTree support to FX through concrete_args (#55888) 2021-05-07 04:48:35 -07:00
bundled_inputs.py [Pytorch] Remove run_on_bundled_input (#58344) 2021-05-17 12:44:00 -07:00
checkpoint.py Fix typo in checkpoint docs (#59646) 2021-06-09 12:48:18 -07:00
collect_env.py Don't split oversize cached blocks (#44742) 2021-06-21 11:46:08 -07:00
cpp_extension.py use importlib instead of imp as it support python 3.5+ (#57160) 2021-05-03 05:56:25 -07:00
dlpack.py
file_baton.py Fix version comparisons for Python 3.6, 3.10 and 4 (#32389) 2020-10-21 11:52:50 -07:00
hooks.py Fixes: register_full_backward_hook crash if first argument don't require a gradient (#57944) (#57945) 2021-05-11 15:07:35 -07:00
mkldnn.py BFloat16: enable prepacked weights's inference (#48922) 2021-02-17 11:20:00 -08:00
mobile_optimizer.py Small fix type hints in mobile optimizer (#59282) 2021-06-02 15:32:16 -07:00
model_zoo.py
show_pickle.py show_pickle/model_dump: Handle invalid UTF-8 in pickles (#57661) 2021-06-04 19:42:25 -07:00
throughput_benchmark.py