1. `torch.autograd.profiler` interface parameters changed. (use `self.use_device` instead of `self.use_cuda` facilitates access by other devices and integrate it in subsequent pr)
2. Modify `ProfilerEventStub`(aka `std::shared_ptr<CUevent_st>`) to `ProfilerVoidEventStub`(aka `std::shared_ptr<void>`) so that `ProfilerStubs` can be inherited by any `{device}Methods`.
In addition, `cuda_event_start_` is renamed to `device_event_start_` , cuda and other devices can use this event pointer if needed.
4. custom device support using legacy profiling(add `ProfilerState::KINETO_PRIVATEUSE1_FALLBACK` option)
5. add `privateuse1Stubs` register
(parse results and test cases are added in subsequent pr)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101554
Approved by: https://github.com/aaronenyeshi
**TL;DR:** This re-introduces links between backward kernels and their corresponding forward kernels.
<img width="1020" alt="Screenshot 2023-05-26 at 7 25 22 PM" src="https://github.com/pytorch/pytorch/assets/5067123/02571b59-859c-4c9e-b3ef-121ef3159812">
In the example above, you can see there are two such flows - one for aten::add, and one for aten::binary_cross_entropy
### Details
Forward/backward links were added in https://github.com/pytorch/pytorch/pull/62553, but then disabled in https://github.com/pytorch/pytorch/pull/72904 due to segfaults (e.g. https://github.com/pytorch/pytorch/issues/69443).
Between now and when the fwd-bwd links were disabled, there's been a lot of refactoring; so this PR updates the implementation:
* Use a raw profiler::impl::Result instead of a KinetoEvent
* Move the implementation to collection.cpp, where the TraceWrapper is currently handled.
* Sort the events before processing, because they aren't always in chronological order
* There can now be more than one event in the backward pass that matches the sequenceNr-threadID pair. The implementation needed to be updated to avoid showing multiple endpoints for a given sequenceNr-threadID pair ([ptr to where the bwd sequenceNr-threadID pair is duplicated](6e3e3dd477/torch/csrc/profiler/collection.cpp (L398-L399))).
Next, we need to verify that https://github.com/pytorch/pytorch/issues/69443 is fixed. Running the repro no longer errors. Looking further into the details of the issue it seems like the handling of the [raw linkedActivity pointer (old code from 2021)](6089dcac48/libkineto/src/output_json.cpp (L283)) resulted in the segfault. Now, it doesn't look like the linked activity is used anywhere in output_json.cpp so the issue should be fixed.
### Testing
#### 1. unit test
`test_profiler_fwd_bwd_link` was un-skipped. It was modified to match the new implementation.
#### 2. https://github.com/pytorch/pytorch/issues/69443
I ran the repro in https://github.com/pytorch/pytorch/issues/69443 and verified there were no segfaults.
#### 3. Duplicate flow IDs
When forward-backward connections were first introduced, gpu-cpu async links had not been introduced. There's a possibility that gpu-cpu links and fwd-bwd links could interfere if their IDs overlap.
I manually tested this in chrome://tracing; I edited a file so that a gpu-cpu link had the same ID as one of the fwd-bwd connections. The chrome tracing UI continued showing both types of links.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102424
Approved by: https://github.com/aaronenyeshi
Many ops take as inputs scalars or scalar lists which are important to understand the properties of the op. For example, convolution ops' behavior and output shapes often depend on padding and strides, which are provided as scalars of lists of scalars. This will record scalar lists when record_inputs=True.
Details:
During collection (and this was true before this PR as well), we serialize values and tensor metadata into an InputOutputEncoder. After collection occurs, we deserialize these values to attach the information to each of the events.
This PR does this:
- Adds support for serializing scalar lists during collection / serialization
- Adds an extra field called "Concrete Args"
- Splits up the deserialization process into two steps - one for generating "input shapes" and one for generating "concrete args". We split up input shapes and concrete args to avoid interrupting any previous workflows that relied on the specific data in the input shapes category; additionally, it's just a better description. Note that single scalars will remain in the "input shapes" category as they were already in that category in the past.
Differential Revision: [D45798431](https://our.internmc.facebook.com/intern/diff/D45798431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100593
Approved by: https://github.com/aaronenyeshi
Summary:
There are two variables for profiler input shapes:
- In C++ interface: report_input_shapes
- In Python interface: record_shapes
Therefore record_input_shapes is a typo. We should also look to reducing redundant naming between the two.
Test Plan: CI
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99430
Approved by: https://github.com/davidberard98
Previously we only plotted memory if it was allocated or freed while
trace recording was active. This change also adds any pre-existing blocks
to the visualization. This helps because it is common to enable trace recording
later and then not realize that there is a lot of allocated memory in
the trace eventhough a lot was allocated beforehad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97590
Approved by: https://github.com/eellison
This refactors the stack trace facility specific to memory profiling
in python+cuda to make a generic facility to generate combined stack
traces.
The generic facility (combined_traceback.h) does not require
python to be around to work, but will return python stacks if it is
present.
This facility is then used to add support for stack trace gathering in memory profiling that
happens directly from C++.
It is also used to expose a python API for gathering and symbolizing
combineds stacks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
This patch is aimed to add support to XPU profiler which will co-work with Kineto. After this PR, kineto will follow these API to fit itself. Also, the development of interface in python is near done.
Signed-off-by: Huang, Xunsong <xunsong.huang@intel.com>
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94502
Approved by: https://github.com/ezyang
@bypass-github-export-checks
This diff enables passing processing events in the profiler. Passing the events from QueryPool, and making sure vulkan events align with parent CPU events correctly will be handled later in this diff stack.
This diff was made by forking Taylor's scaffolding diff, D39779878, with a few changes:
- Rebasing + resolving merge conflicts
- Fixing (i.e. removing) auto import of profiler/containers.h
- Changing the activity type to CPU_OP which makes the vulkan events appear on chrometrace
- Moving timestamp adjustment scaffolding to D39893109
Differential Revision: [D39834805](https://our.internmc.facebook.com/intern/diff/D39834805/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90852
Approved by: https://github.com/mcr229
The appropriate annotation for a block of memory is a function of time: an input can be mutated in-place to become an activation, a clever kernel might steal the memory of a detached input (such as a mask) to use as output memory, etc.
We could pessimistically assume that all ops mutate all of their inputs, however inspection of schema allows us to significantly narrow that assumption with minimal effort. Checking schemas also allows us to distinguish between dispatcher ops (which have load bearing semantics) and user annotations with reasonably high precision.
Differential Revision: [D40220390](https://our.internmc.facebook.com/intern/diff/D40220390/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86854
Approved by: https://github.com/chaekit
This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata.
I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through.
Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825
Approved by: https://github.com/slgong-fb, https://github.com/chaekit
Part of the current ID assingment algorithm groups any Storages which are associated with the same TensorImpl*. This isn't sound (which I knew but deferred until it actually became a problem) because pointers can be reused by different objects. (ABA problem)
ABA is easy to handle for Storage because we see allocations and frees, but ~TensorImpl is very hot and cannot tolerate profiling code without significant increases in overhead.
This PR narrows the conditions under which ID assignment will join on TensorImpl*. Two storages which are associated with the same TensorImpl* are grouped IFF they were live at the same time. (Note that this still allows storages with disjoint lifetimes to be joined transitively through a third storage which overlaps with both.)
The need for this PR arose in memory profiling. The Python argument parser creates short lived Tensors for (some) scalar arguments which triggers this issue. (Which is stochastic and platform dependent since optimizations like reusing recently freed allocations is implementation defined.) Spurious connections can lead to confusing and long range interactions when building up the memory profile, so it makes sense to harden ID assignment to avoid any issues.
Differential Revision: [D40445121](https://our.internmc.facebook.com/intern/diff/D40445121/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87133
Approved by: https://github.com/slgong-fb, https://github.com/chaekit
A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)
Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.
This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.
This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.
Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87244
Approved by: https://github.com/slgong-fb, https://github.com/albanD
While optimizer can store state however it likes, in practice most optimizer state corresponds to a particular parameter. (This is the case for all `torch.optim` optimizers.) Thus, it turns out to be ergonomic to collect using that structure. Note that this doesn't lock us into anything; we can always collect state with non Tensor keys if the use case arises.
One simplification that arises is that Module and Optimizer collection has very similar structure. So similar, in fact, that it is possible to use a common template for config. I also found that a lot of the `check_and_store` logic could be simplified and inlined by this joining of collected optimizer state.
Differential Revision: [D40210703](https://our.internmc.facebook.com/intern/diff/D40210703/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86753
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT)
They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder.
Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511
Approved by: https://github.com/albanD
Summary:
- catch .grad tensor info
- update data type and `check_and_store`, etc
- update unit test case
Test Plan: buck run mode/opt //caffe2/test:profiler
Reviewed By: chaekit
Differential Revision: D39711295
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86355
Approved by: https://github.com/chaekit
Summary:
- Added config option to remove 'Call stack' field from trace file (#84982)
- Change default value to `false`
Test Plan:
- `experimental_config=_ExperimentalConfig(verbose=true),` will add 'Call stack' field back in the trace file.
- CI tests
Differential Revision: D40092377
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86263
Approved by: https://github.com/aaronenyeshi
This is necessary for memory profiling because we need to know how to interpret an allocation. However there is a slight wrinkle: we don't know if an allocation is for a Tensor's StorageImpl until we see it used in a later call. (We could record outputs, however we're not willing to incur the overhead.) So we instead treat all allocations as relevant and then filter out some later. Otherwise the change to the ID assignment algorithm is minimal.
Differential Revision: [D39788870](https://our.internmc.facebook.com/intern/diff/D39788870/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85719
Approved by: https://github.com/chaekit
I want to start using `TensorMetadata` elsewhere in profiler so we have a common representation of Tensor. The main changes in this PR are:
1) Replace raw pointers with strong typedefs and create a custom type caster to handle moving them to Python.
2) Adding a `device()` method to handle reassembling type and index.
Differential Revision: [D39563965](https://our.internmc.facebook.com/intern/diff/D39563965/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85161
Approved by: https://github.com/chaekit
Summary: `Call stack` field increases trace file size exponentially for Python stack tracing (need to be deprecated carefully). Added a config option to avoid this increase.
Test Plan:
`experimental_config=_ExperimentalConfig(no_callstack_trace=True),` will remove the field.
+ CI tests
Differential Revision: D39489828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84982
Approved by: https://github.com/robieta
Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache & NNModuleInfo to save parameters
- python binding and unit test case
Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule
Differential Revision: D38379717
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta