Fixes#112591
Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and methods within each module (see details below).
pydocstyle torch/cuda/_memory_viz.py --count
before: 7
after: 4
**remaining errors:**
```
torch/cuda/_memory_viz.py:77 in public function `format_flamegraph`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:121 in public function `segments`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:128 in public function `memory`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:135 in public function `compare`:
D103: Missing docstring in public function
```
pydocstyle torch/cuda/streams.py --count
before: 29
after: 8
**remaining errors:**
```
torch/cuda/streams.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/streams.py:31 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:105 in public method `__eq__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:110 in public method `__hash__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:113 in public method `__repr__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:135 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:163 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:237 in public method `__repr__`:
D105: Missing docstring in magic method
```
pydocstyle torch/cuda/__init__.py --count
before: 100
after: 46
**remaining errors:**
```
torch/cuda/__init__.py:251 in public class `DeferredCudaCallError`:
D101: Missing docstring in public class
torch/cuda/__init__.py:327 in public function `cudart`:
D103: Missing docstring in public function
torch/cuda/__init__.py:332 in public class `cudaStatus`:
D101: Missing docstring in public class
torch/cuda/__init__.py:337 in public class `CudaError`:
D101: Missing docstring in public class
torch/cuda/__init__.py:338 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:343 in public function `check_error`:
D103: Missing docstring in public function
torch/cuda/__init__.py:369 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:373 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:376 in public method `__exit__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:391 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:473 in public class `StreamContext`:
D204: 1 blank line required after class docstring (found 0)
torch/cuda/__init__.py:485 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:499 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:514 in public method `__exit__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:541 in public function `set_stream`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:838 in public function `current_blas_handle`:
D400: First line should end with a period (not 'e')
torch/cuda/__init__.py:894 in public function `memory_usage`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:894 in public function `memory_usage`:
D400: First line should end with a period (not ')')
torch/cuda/__init__.py:913 in public function `utilization`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:913 in public function `utilization`:
D400: First line should end with a period (not 'r')
torch/cuda/__init__.py:949 in public function `power_draw`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:949 in public function `power_draw`:
D400: First line should end with a period (not ')')
torch/cuda/__init__.py:1089 in public class `ByteStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1091 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1100 in public class `DoubleStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1102 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1111 in public class `FloatStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1113 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1122 in public class `HalfStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1124 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1133 in public class `LongStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1135 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1144 in public class `IntStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1146 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1155 in public class `ShortStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1157 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1166 in public class `CharStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1168 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1177 in public class `BoolStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1179 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1188 in public class `BFloat16Storage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1190 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1199 in public class `ComplexDoubleStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1201 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1210 in public class `ComplexFloatStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1212 in public method `dtype`:
D102: Missing docstring in public method
```
@mikaylagawarecki @albanD @svekars @jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113233
Approved by: https://github.com/malfet
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.
This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.
None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.
This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
It turns out that jsdelivr, which is used to access the MemoryViz.js
source from generated files, doesn't work unless a version is specified.
This wasn't able to be tested until the PR actually landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103741
Approved by: https://github.com/aaronenyeshi
This replaces the invidual visualization routines in _memory_viz.py with
a single javascript application.
The javascript application can load pickled snapshot dumps directly using
drag/drop, requesting them via fetch, or by embedding them in a webpage.
The _memory_viz.py commands use the embedding approach.
We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g.
https://zdevito.github.io/assets/viz/
(eventually this should be hosted with the pytorch docs).
All views/multiple cuda devices are supported on one page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565
Approved by: https://github.com/eellison, https://github.com/albanD
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.
However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.
This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.
Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.
See inline comments for information about the implementation and its limitations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
When there are > 15000 polygons trace_plot starts to get really slow.
So order the allocations and take the smallest allocations beyond the 15000
limit and put them into a single summarized polygon.
A slider allows this limit to be adjusted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98865
Approved by: https://github.com/yf225
Previously we only plotted memory if it was allocated or freed while
trace recording was active. This change also adds any pre-existing blocks
to the visualization. This helps because it is common to enable trace recording
later and then not realize that there is a lot of allocated memory in
the trace eventhough a lot was allocated beforehad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97590
Approved by: https://github.com/eellison
This refactors the stack trace facility specific to memory profiling
in python+cuda to make a generic facility to generate combined stack
traces.
The generic facility (combined_traceback.h) does not require
python to be around to work, but will return python stacks if it is
present.
This facility is then used to add support for stack trace gathering in memory profiling that
happens directly from C++.
It is also used to expose a python API for gathering and symbolizing
combineds stacks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.
This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).
The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
This adds a d3-based interactive visualization for exploring the memory
allocation traces that the caching allocator can capture. This visualization
code can also be attached to kineto trace information in the future to also
provide visualization for the memory events captured there, which come with
addition information about the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90348
Approved by: https://github.com/robieta
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.
We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.
As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).
This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.
Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.
Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD