Summary: Return early if we can easily determine the operator qualified name is invalid before attempting to retrieve the schema. In particular "::" should always be present. Quick estimate shows that this is >50x faster (100 us -> 2 us).
Test Plan: CI
Differential Revision: D47562587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105495
Approved by: https://github.com/aaronenyeshi
Summary:
Rather than processing the events into a time and sizes plot, dump the actual events as (timestamp, action, num of bytes, category) when output file ends in `raw.json.gz`.
This can allow downstream analysis tools to process these events. It also avoids having to control the granularity of the previous json.gz in memory profiler.
Test Plan: CI Tests
Differential Revision: D47416544
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105094
Approved by: https://github.com/davidberard98
Summary:
Support the file extension .html, which will include a PNG image of the plot embedded into an HTML file.
This allows users to avoid processing the timeline manually in their own frontend UI.
Test Plan:
CI Tests
Ran on resnet50 model and generated this html file w/ plot:
See attached html file: {F954232276}
Screenshot: {F954232469}
Differential Revision: D45152735
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99751
Approved by: https://github.com/davidberard98
Summary: Rather than starting the timeline at t=0, keep the actual timestamps of the memory events.
Test Plan: CI Tests
Reviewed By: leitian, chaekit
Differential Revision: D43807624
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96535
Approved by: https://github.com/davidberard98
Summary: Added the functionality to export the memory timeline plot as a list of times and sizes, which the post processing visualization can parse and plot.
Test Plan: CI Tests
Reviewed By: leitian, fengxizhou
Differential Revision: D43680760
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96137
Approved by: https://github.com/chaekit
There are various Tensors created in the backward pass which do not correspond to parameters. We don't want to mark these as gradients, but we do still want to convey as much information as possible. Thus, this PR introduces an AUTOGRAD_DETAIL category. (Which can be grouped with GRADIENT in visualization if one wishes to take a coarse grained view of the world.)
Differential Revision: [D40868661](https://our.internmc.facebook.com/intern/diff/D40868661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88926
Approved by: https://github.com/chaekit
Following the pattern of earlier PRs, we use two methods to extract parameters. The primary one is the Python tracer; both nn.Module and optim.Optimizer collect parameters and in most cases that is sufficient. As a fallback we can analyze the data flow graph and deduce likely parameters based on gradient computation and updates.
Parameter identification has a circular interaction with input identification. Inputs are defined as "not part of the core forward-backward-update loop", but we need inputs for the parameter identification fallback to give us a proxy for the forward pass. Thus, we mark parameters from the python tracer which limits which Tensors get marked as inputs. While not necessary, it adds a bit of robustness. (As shown by the strengthening of the input unit tests.)
Differential Revision: [D40238619](https://our.internmc.facebook.com/intern/diff/D40238619/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87568
Approved by: https://github.com/chaekit
It is surprisingly difficult to identify the leaves of the data flow graph. The issue is that inputs and pre-existing parameters look identical until parameter identification takes place. It's not too bad for training since Autograd lets us differentiate between them however I still want the tool to do something reasonable in inference.
Some of this will be ameliorated when a later PR pulls in parameters from python tracing. The current approach is passable, but I will continue to mull over refinements.
Differential Revision: [D40220388](https://our.internmc.facebook.com/intern/diff/D40220388/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87567
Approved by: https://github.com/chaekit
Semantic assignment will be built up as a series of passes which gradually pin down the regions of a trace. For this reason it is important to be very meticulous in the assignment of categories.
We begin with gradients as they are both straightforward to identify and foundational to subsequent analysis. There are two mechanisms that the profiler can use to tag gradients, each with their own advantages and limitations. The first is direct inspection of the op graph which is generic but predicated on certain features of the Autograd engine. (And therefore not necessarily exhaustive.) The second approach is direct instrumentation via the python tracer. This method relies requires that gradients be attached to an nn.Module parameter and can miss corner cases such as `set_to_none=True` due to the cache structure of the python tracer. Combined these two approaches provide very high coverage.
Temporaries are more straightforward; we can easily add them by trivial local inspection of a data flow node.
Because this is the first PR in the end-to-end section most of the code is building the scaffolding for category bookkeeping and unit testing. (The actual gradient extraction was covered in an earlier PR.)
Differential Revision: [D40220389](https://our.internmc.facebook.com/intern/diff/D40220389/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87566
Approved by: https://github.com/chaekit
The semantic meaning of a Tensor is tightly coupled to its lineage. The data flow graph allows us to identify temporary Tensors, masks, inputs, activations, and more. However one important nuance is that Tensors must be versioned; operations which mutate their inputs can also change the semantic meaning of said inputs.
It is challenging to assemble a complete picture of the data flow in a PyTorch model because ops can, and often do, recursively call into other ops. For the purpose of memory profiling this is an implementation detail, so instead we traverse the op tree to identify top level ops and allocations and then coalesce their children, folding inputs and outputs into the top level Node.
Differential Revision: [D40220391](https://our.internmc.facebook.com/intern/diff/D40220391/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87006
Approved by: https://github.com/chaekit
The appropriate annotation for a block of memory is a function of time: an input can be mutated in-place to become an activation, a clever kernel might steal the memory of a detached input (such as a mask) to use as output memory, etc.
We could pessimistically assume that all ops mutate all of their inputs, however inspection of schema allows us to significantly narrow that assumption with minimal effort. Checking schemas also allows us to distinguish between dispatcher ops (which have load bearing semantics) and user annotations with reasonably high precision.
Differential Revision: [D40220390](https://our.internmc.facebook.com/intern/diff/D40220390/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86854
Approved by: https://github.com/chaekit