mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6 |
||
|---|---|---|
| .. | ||
| _static/img | ||
| _templates | ||
| community | ||
| notes | ||
| scripts | ||
| __config__.rst | ||
| autograd.rst | ||
| bottleneck.rst | ||
| checkpoint.rst | ||
| conf.py | ||
| cpp_extension.rst | ||
| cuda_deterministic_backward.rst | ||
| cuda_deterministic.rst | ||
| cuda.rst | ||
| cudnn_deterministic.rst | ||
| cudnn_persistent_rnn.rst | ||
| data.rst | ||
| distributed.rst | ||
| distributions.rst | ||
| dlpack.rst | ||
| hub.rst | ||
| index.rst | ||
| jit_builtin_functions.rst | ||
| jit.rst | ||
| model_zoo.rst | ||
| multiprocessing.rst | ||
| nn.functional.rst | ||
| nn.init.rst | ||
| nn.rst | ||
| onnx.rst | ||
| optim.rst | ||
| random.rst | ||
| sparse.rst | ||
| storage.rst | ||
| tensor_attributes.rst | ||
| tensorboard.rst | ||
| tensors.rst | ||
| torch.rst | ||
| type_info.rst | ||