Commit Graph

13 Commits

Author SHA1 Message Date
Zachary DeVito
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1c.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
PyTorch MergeBot
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
Zachary DeVito
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Zachary DeVito
d6d8d3484e _memory_viz.py: Visualize how blocks fit into segments. (#91336)
Add a segment_plot command that visualizes how blocks are allocated into segments.
This is similar to the 'stats' command but produces an interactive html viewer rather
than text dump, allowing exploration of stack traces.

It also adds the ability to see the layout at any point in the trace by starting from the
snapshot and then apply the events backwards to reconstruct what memory would have looked like.

Example:
![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Zachary DeVito
71f369092d Revert "Revert "memory viz: Add colors for categories and a legend (#90587)"" (#96133)
This reverts commit b38b39c441.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Eli Uriegas
b38b39c441 Revert "memory viz: Add colors for categories and a legend (#90587)"
This reverts commit ee43842505.
2023-03-06 11:38:58 -08:00
Zachary DeVito
ee43842505 memory viz: Add colors for categories and a legend (#90587)
Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto.

Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587
Approved by: https://github.com/aaronenyeshi
2023-03-03 20:42:22 +00:00
Aaron Gokaslan
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
Zachary DeVito
bf2668a899 Add support for kineto in memory viz (#90567)
This is just rudimentary initial support that does the same stuff as the trace profile. Follow will add category encodings to the tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90567
Approved by: https://github.com/robieta
2022-12-13 21:31:16 +00:00
Zachary DeVito
3b3ed25109 Add a way to visualize memory snapshot traces (#90348)
This adds a d3-based interactive visualization for exploring the memory
allocation traces that the caching allocator can capture. This visualization
code can also be attached to kineto trace information in the future to also
provide visualization for the memory events captured there, which come with
addition information about the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90348
Approved by: https://github.com/robieta
2022-12-10 02:45:11 +00:00
Zachary DeVito
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
Zachary DeVito
726d040692 annotated allocator snapshots (#82146)
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.

Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.

Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD
2022-08-09 17:21:35 +00:00