pytorch/docs/source/notes
Jerry Ma 1610ea8ef8 Comprehensive-ish instrumentation for CUDA memory allocator (#27361)
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.

# Counters

Added comprehensive instrumentation for the following stats:
  - Allocation requests (`allocation`)
  - Allocated memory (`allocated_bytes`)
  - Reserved segments from cudaMalloc (`segment`)
  - Reserved memory (`reserved_bytes`)
  - Active memory blocks (`active`)
  - Active memory (`active_bytes`)
  - Inactive, non-releasable blocks (`inactive_split`)
  - Inactive, non-releasable memory (`inactive_split_bytes`)
  - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
  - Number of OOMs (`num_ooms`)

Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.

# Snapshots

Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.

# Implementation: major changes

- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq

# Implementation: minor changes

- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)

# Testing

- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)

# Performance

Running the following speed benchmark: https://pastebin.com/UNndQg50

- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361

Differential Revision: D17758747

Pulled By: jma127

fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
2019-10-08 15:42:48 -07:00
..
autograd.rst [docs] Update autograd notes (#6769) 2018-04-19 13:34:14 -04:00
broadcasting.rst [docs] Update broadcasting and cuda semantics notes (#6904) 2018-04-24 13:41:24 -04:00
cpu_threading_torchscript_inference.rst Threading and CPU Inference note 2019-07-29 15:45:49 -07:00
cpu_threading_torchscript_inference.svg Threading and CPU Inference note 2019-07-29 15:45:49 -07:00
cuda.rst Comprehensive-ish instrumentation for CUDA memory allocator (#27361) 2019-10-08 15:42:48 -07:00
extending.rst Update extension docs, fix Fold/Unfold docs (#9239) 2018-07-08 19:09:39 -07:00
faq.rst Use "length of the RNN input" instead of "length of the RNN" 2019-05-24 09:03:50 -07:00
large_scale_deployments.rst Thread local debug info 2019-08-12 14:53:57 -07:00
multiprocessing.rst Add IterableDataset (#19228) 2019-06-20 20:12:44 -07:00
randomness.rst Update randomness.rst (#21337) 2019-06-04 07:38:00 -07:00
serialization.rst code syntax error in document (serialization.rst) (#937) 2017-03-06 10:06:04 -05:00
windows.rst Add magma for CUDA 10.1 to Windows docs 2019-04-29 10:13:21 -07:00