The legacy profiler is an eyesore in the autograd folder. At this point the implementation is almost completely decoupled from the rest of profiler, and it is in maintaince mode pending deprecation.
As a result, I'm moving it to `torch/csrc/profiler/standalone`. Unfortuantely BC requires that the symbols remain in `torch::autograd::profiler`, so I've put some basic forwarding logic in `torch/csrc/autograd/profiler.h`.
One strange bit is that `profiler_legacy.h` forward declares `torch::autograd::Node`, but doesn't seem to do anything with it. I think we can delete it, but I want to test to make sure.
(Note: this should not land until https://github.com/pytorch/torchrec/pull/595 is landed.)
Differential Revision: [D39108648](https://our.internmc.facebook.com/intern/diff/D39108648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85512
Approved by: https://github.com/aaronenyeshi
There is a concept in profiler of a stub that wraps a profiling API. It was introduced for CUDA profiling before Kineto, and ITT has adopted it to call into VTune APIs. However for the most part we don't really interact with them when developing the PyTorch profiler.
Thus it makes sense to unify the fallback registration mechanism and create a subfolder to free up real estate in the top level `torch/csrc/profiler` directory.
Differential Revision: [D39108647](https://our.internmc.facebook.com/intern/diff/D39108647/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85510
Approved by: https://github.com/aaronenyeshi
Right now the profiler is capible of leaking callback handles if a client does not call `at::removeCallback`. (As well as a double free if two clients handle it.) This modestly improves the situation by pulling removal into a single method and calling that removal code in the dtor unless explicitly opted out. Once we deprecate the legacy profiler we can further simplify by making the ProfilerThreadLocalStateBase own the handle outright.
Differential Revision: [D38920537](https://our.internmc.facebook.com/intern/diff/D38920537/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83892
Approved by: https://github.com/slgong-fb
`ProfilerState::Disabled` and `ProfilerState::KINETO_ONDEMAND` have special semantics. The former is somewhat intuitive, but the degree of behavior branching on the latter (and why the branching is necessary) is less clear. By factoring the enum checks into methods, we can both clairify intent and future proof in case we ever add other global profiling contexts.
Differential Revision: [D38917980](https://our.internmc.facebook.com/intern/diff/D38917980/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83891
Approved by: https://github.com/slgong-fb
`ProfilerState::Disabled` and `ProfilerState::KINETO_ONDEMAND` have special semantics. The former is somewhat intuitive, but the degree of behavior branching on the latter (and why the branching is necessary) is less clear. By factoring the enum checks into methods, we can both clairify intent and future proof in case we ever add other global profiling contexts.
Differential Revision: [D38917980](https://our.internmc.facebook.com/intern/diff/D38917980/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83891
Approved by: https://github.com/slgong-fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76078
templatize `pushProfilingCallbacks` to support `RecordFunction` global callback support. The reason for templatizing is to
1. squeeze out performance on hot path
2. work around the capture-less lambdas
Test Plan:
## Global Callback
These were tested in conjunction with e2e subsequent diffs in both `trace_tester` and `sigrid`
sample trace: https://fburl.com/perfdoctor/tzgtw2ln
## Local Callback
https://fburl.com/perfdoctor/l58nfiyp
Reviewed By: robieta
Differential Revision: D35457300
fbshipit-source-id: 9d587ec68bfd405e565cc8956b0afa2cdaf95b94
(cherry picked from commit 9d8a9063d7525972d5364307c95ed50f6bafe3ec)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75616
Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API)
Here we are adding support to configure this Kineto range profiler mode
Example
```
with torch.profiler.profile(
activities=[ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=trace_handler,
experimental_config=torch.profiler._ExperimentalConfig(
profiler_metrics=[
"kineto__tensor_core_insts",
"dram__bytes_read.sum",
"dram__bytes_write.sum"],
profiler_measure_per_kernel=False),
) as prof:
res = train_batch(modeldef)
prof.step()
```
## Details
* Introduce a new structure `KinetoProfilerConfig` so users can configure Kineto specific options, keeps profiler API consistent.
* Populate configuration options for Kineto.
Test Plan: CI and tested on resnet50
Reviewed By: robieta
Differential Revision: D34489487
fbshipit-source-id: 8ef82d2593f4f4d5824ca634f7d25507bc572caa
(cherry picked from commit 4a2af70629db55a605d4b8d0a54d41df2b247183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71135
The NVTX profiler is quite different from the other Kineto cases, so it's worth it to peel it off early so that later logic can assume either KINETO or KINETO_GPU_FALLBACK. This is more important since we're going to change the Kineto internals. (You can see the python tracer was unnecessarily coupled to NVTX just because the control logic was intermingled.)
There's also no reason to put the legacy observer state in the header rather than the cpp file now that the kineto profiler doesn't need it, so we should shield it from prying eyes.
The recent headaches with TLS downcasting and RPC integration (D32678163 (7ea86dfdb1), D33283314 (681e78bace), D33437773 (7d6535cab3)) have made crystal clear that we need a lot more safety in the profiler, particularly as we shift things around.
Test Plan: Unit tests. This is no longer a performance PR.
Reviewed By: aaronenyeshi
Differential Revision: D32710829
fbshipit-source-id: f9138598b3cfeba71872905a7afab3c03c0d56e7
(cherry picked from commit 059a39d8e3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D33336948
Pulled By: albanD
fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70327
After D32678163 (7ea86dfdb1), test_rpc_profiler began failing. This was surprising, because it should have been a no-op refactor. However, one change is that a Kineto profiler is no longer also an autograd profiler; the RPC framework was assuming a legacy profiler but when a kineto profiler was active things still kind of worked due to that implementation detail. (But crashed after the class split.)
This diff tidys up a couple of things:
1) Move `getProfilerConfig` into `api.cpp`, since it is no longer correct to static_cast a `KinetoThreadLocalState` to a `ProfilerLegacyThreadLocalState`. (And really the class we want is `ProfilerThreadLocalStateBase` anyway.)
2) Add a mechanism for callers to check if the active profiler is a legacy or kineto profiler. (So callers like RPC can adjust or provide a nice error message.)
3) Fix the RPC test to create a legacy profiler.
Test Plan: `caffe2/torch/fb/training_toolkit/backend/tests:test_rpc_profiler` now passes, and before the fix to `test_rpc_profiler.py`, I verified that the test failed with the error message added to `utils.cpp` rather than just crashing.
Reviewed By: suphoff
Differential Revision: D33283314
fbshipit-source-id: e4fc5b5cfc9ca3b91b8f5e09adea36f38611f90d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69459
This change breaks the dependency between the kineto and legacy profiler; instead of `profiler_kineto.h` including `profiler_legacy.h`, they both include `profiler/api.h`. As part of this refactor, I injected some intermediate classes to keep legacy behavior from leaking into the kineto profiler:
1) ProfilerThreadLocalState has become ProfilerThreadLocalStateBase which just handles config and callback handle. Legacy and Kineto profilers inherit this and implement their own very disjoint set of logic.
2) CUDAStubs is a pure virtual class to make the interface more readable, and the "always fail" behavior has been moved to a `DefaultCUDAStubs` class in `api.cpp`.
Test Plan: Ran the overhead ubenchmark.
Reviewed By: aaronenyeshi
Differential Revision: D32678163
fbshipit-source-id: 9b733283e4eae2614db68147de81b72f6094ce6c