Pull Request resolved: https://github.com/pytorch/pytorch/pull/78164
This PR finishes moving over the python tracer to use the unified event type. Things that changed:
1) The hacky after-the-fact splicing of python events in profiler_kineto.cpp is gone and python events now simply fold into the rest. (Yay!!!) This is a major BE win.
2) Added `ExtraFields<EventType::PyCall>` and `ExtraFields<EventType::PyCCall>`
3) The enter events (time + TraceKey) are now handled by RecordQueue for performance.
4) Python tracing now uses TSC for lower overhead.
Simplifications in profiler_python WRT part 1:
1) Rather than ValueCache emitting an intermediate value_t that gets further converted, load methods can now directly emit ExtraFields<...>
2) The complicated replay in profiler_python.cpp is replaced with a much simpler (and safer) pass to just pair start and end times.
3) During post processing we can now use `CallTypeHelper::map` to automatically pull in all events instead of having to loop over each the entries for each type manually. This will make it simpler to add new types of Python event later.
Differential Revision: [D36515869](https://our.internmc.facebook.com/intern/diff/D36515869/)
Approved by: https://github.com/aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78163
The python function tracer is complicated and separate from the other profile types, so I've chosen to break the change into two diff. The first (this one) reworks the cache structure to make it amenable to integration (as well as some other nice tweaks) and the next one actually moves it over.
The old cache scheme worked very hard to pack all the information about an event into a small struct via bit packing, with a couple secondary caches for things like names. Because of the space constraints on that struct (and the fact that it had to represent all call and return types) there were a lot of subtle invariants swirling around that made it hard to offload anything to a different component. The new cache system is more modular and also, as it turns out, a bit faster. (Benchmarks in part 2)
There is a more detailed description of the cache hierarchy in the PR, but the gist is that I use various specializations to handle the different event types (python call, nn module, c function) and lean on the type system to keep everything safe and organized. (One nice thing about using unique IDs is that they also implicitly encode the event type. They implicitly encode everything!) Given that we are going to want to expand the semantics (e.g. torch ops, DataLoader, etc) this will give a nice way to capture richer semantics without significantly increasing the complexity of the profiler.
Differential Revision: [D36379147](https://our.internmc.facebook.com/intern/diff/D36379147/)
Approved by: https://github.com/aaronenyeshi
Seems like it should be one. This will make it possible to register
meta implementations even when there is a CompositeImplicitAutograd
registration already. It also paves the way for sparse meta, etc.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78469
Approved by: https://github.com/ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77698
This PR adds tree building to the post processing of profiler. The basic algorithm is to sort the events, maintain a stack and a priority queue of event ends, and push/pop accordingly. The logic for merging Python events is still separate in `profiler_kineto.cpp`. That can be removed when Python events have an `EventType`.
Differential Revision: [D36321105](https://our.internmc.facebook.com/intern/diff/D36321105/)
Approved by: https://github.com/aaronenyeshi
Summary:
I think this has to be set in all before() calls. Because by default inputs_valid_ = false;. For RecordFunction without any input parameters (so this interface is not used), calling record_func.inputs() will cause an assert:
```
fbcode/caffe2/aten/src/ATen/record_function.h
322
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inputs_valid_, "Called inputs() outside RecordFunction start callback");
```
I suppose, an alternative is to require users to call num_inputs() to check before calling inputs(). But I think the intent of inputs_valid_ is for verifying inputs are being requested within the lifetime of the start callback.
Test Plan:
After this diff fix:
```
=> buck build mode/dev-nosan caffe2/test:profiler --show-output
=> buck-out/gen/caffe2/test/profiler#binary.par test_profiler.TestRecordFunction.test_record_function
test_record_function (test_profiler.TestRecordFunction) ... Log file: /tmp/libkineto_activities_763488.json
INFO:2022-05-20 13:02:33 763488:763488 Config.cpp:470] Trace start time: 2022-05-20 13:02:48
Trace duration: 500ms
Warmup duration: 5s
Max GPU buffer size: 128MB
Enabled activities: cpu_op,user_annotation,external_correlation,cuda_runtime,cpu_instant_event,python_function
Manifold bucket: gpu_traces
Manifold object: tree/traces/clientAPI/0/1653076953/devgpu040.ftw6/libkineto_activities_763488.json
Trace compression enabled: 1
INFO:2022-05-20 13:02:33 763488:763488 CuptiActivityProfiler.cpp:559] Enabling GPU tracing
INFO:2022-05-20 13:02:33 763488:763488 CuptiActivityProfiler.cpp:486] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2022-05-20 13:02:33 763488:763488 CuptiActivityProfiler.cpp:593] Tracing starting in 14s
INFO:2022-05-20 13:02:33 763488:763488 CuptiActivityProfiler.cpp:596] Tracing will end in 15s
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0520 13:02:34.013141 763488 Logger.cpp:2273] Dropping logs in unit tests. Set shouldLogDuringTests=True in your CBLC to fix this
STAGE:2022-05-20 13:02:33 763488:763488 ActivityProfilerController.cpp:269] Completed Stage: Warm Up
STAGE:2022-05-20 13:02:34 763488:763488 ActivityProfilerController.cpp:275] Completed Stage: Collection
INFO:2022-05-20 13:02:34 763488:763488 CuptiActivityProfiler.cpp:134] Processing 1 CPU buffers
INFO:2022-05-20 13:02:34 763488:763488 CuptiActivityProfiler.cpp:771] Traces Recorded:
INFO:2022-05-20 13:02:34 763488:763488 CuptiActivityProfiler.cpp:774] PyTorch Profiler: 1 iterations
ok
----------------------------------------------------------------------
Ran 1 test in 0.060s
OK
```
New test failure case:
```
=> buck-out/gen/caffe2/test/profiler#binary.par test_profiler.TestRecordFunction.test_record_function
test_record_function (test_profiler.TestRecordFunction) ... Log file: /tmp/libkineto_activities_808629.json
INFO:2022-05-20 13:04:46 808629:808629 Config.cpp:470] Trace start time: 2022-05-20 13:05:01
Trace duration: 500ms
Warmup duration: 5s
Max GPU buffer size: 128MB
Enabled activities: cpu_op,user_annotation,external_correlation,cuda_runtime,cpu_instant_event,python_function
Manifold bucket: gpu_traces
Manifold object: tree/traces/clientAPI/0/1653077086/devgpu040.ftw6/libkineto_activities_808629.json
Trace compression enabled: 1
INFO:2022-05-20 13:04:46 808629:808629 CuptiActivityProfiler.cpp:559] Enabling GPU tracing
INFO:2022-05-20 13:04:46 808629:808629 CuptiActivityProfiler.cpp:486] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2022-05-20 13:04:46 808629:808629 CuptiActivityProfiler.cpp:593] Tracing starting in 14s
INFO:2022-05-20 13:04:46 808629:808629 CuptiActivityProfiler.cpp:596] Tracing will end in 15s
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0520 13:04:46.853253 808629 Logger.cpp:2273] Dropping logs in unit tests. Set shouldLogDuringTests=True in your CBLC to fix this
STAGE:2022-05-20 13:04:46 808629:808629 ActivityProfilerController.cpp:269] Completed Stage: Warm Up
W0520 13:04:48.126065 808629 record_function.cpp:470] Exception in RecordFunction callback: inputs_valid_ INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/record_function.h":322, please report a bug to PyTorch. Called inputs() outside RecordFunction start callback
Exception raised from inputs at caffe2/aten/src/ATen/record_function.h:322 (most recent call first):
# 0 c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool)
# 1 std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::GetFetchStackTrace()::$_1>::_M_invoke(std::_Any_data const&)
# 2 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
# 3 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 4 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*)
# 5 torch::profiler::impl::ThreadLocalSubqueue::begin_op(at::RecordFunction const&, unsigned long)
# 6 std::unique_ptr<at::ObserverContext, std::default_delete<at::ObserverContext> > torch::autograd::profiler::(anonymous namespace)::onFunctionEnter<false>(at::RecordFunction const&)
# 7 at::RecordFunction::runStartCallbacks()
```
Differential Revision: D36556512
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78002
Approved by: https://github.com/swolchok
Summary: It's currently possible for C++ callers to check if there is an active profiler. This adds Python API parity. For now we just use `torch._C._autograd` namespace, as this is mostly for first party frameworks like RPC. (We can always move to public API if there is demand.)
Test Plan: Added unit test
Differential Revision: D35602425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75754
Approved by: https://github.com/rohan-varma
This PR is trying to solve the problem that delegate the API from the profiler layer to the `Iterator` returned from `IterDataPipe`.
We need this for internal usage `limit`, `resume`, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75275
Approved by: https://github.com/NivekT
Summary:
in the NVTX markers. This feature adds additional information
to the NVTX marker string eg seq_ids=[101, 102, 103]. This indicates
the sequence id of the op which produced the input tensor based on its
position index in the array. In the above example input tensor 0 was produced by
the node with sequence id 101, input tensor 1 is from node 102, input tensor 2 is from
node with sequence id 103. This is the same way the sizes array is
organized. If you know the sequence id of the node and the sequence ids
of the input edges, then you have enough information to construct the
network graph.
Fixes https://github.com/pytorch/pytorch/issues/66105
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70264
Reviewed By: chaekit
Differential Revision: D34792707
Pulled By: robieta
fbshipit-source-id: 4407b853c929a737505803b0db77a8ecd966cce2
(cherry picked from commit cd3c0c8c9d4d63d7897f60521c407883240d1d5b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71417
I accidentally changed CPU_INSTANT_EVENT to CPU_OP, which broke TensorBoard.
Test Plan: Make memory profiling unit test check this case.
Reviewed By: aaronenyeshi
Differential Revision: D33637286
fbshipit-source-id: c95945f6b85cd4168820bd4d2a9203274a0a5bd6
(cherry picked from commit b1e258672a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65736
We ran into some limitations to extract PyTorch operator parameters through hooks or the execution graph. Some of these limitations are not due to the operator not exposing them, rather the inputs for these operators are already fused/processed in some cases (like embedding table). We want to be able to attach some metadata to the user scope record functions allowing the profilers to later extract these information.
The record function C++ API already supports taking inputs and outputs information. The corresponding Python interface does not support them and only allows a string name as record function parameter.
This diff adds support for user to optionally to add additional arguments to the record function in two ways.
1. to remain backward compatible with `record_function_op`, we have added an optional string arg to the interface: `with record_function(name, arg_str)`.
2. to support data dependency graph, we also have the new `torch.autograd._record_function_with_args_enter` and `torch.autograd._record_function_with_args_exit` functions to provide an interface where we can give additional tensor arguments. For now we imagine this can be used for debugging or analysis purpose. In this form, we currently support some basic data types as inputs: scalars, string, list, and tensor.
Example usage:
```
# record_function operator with a name and optionally, a string for arguments.
with record_function("## TEST 1 ##", "[1, 2, 3]"):
<actual module or operator>
# more general form of record_function
a = _record_function_with_args_enter("## TEST 2 ##", 1, False, 2.5, [u, u], "hello", u)
<actual module or operator>
_record_function_with_args_exit(a)
```
Corresponding outputs in execution graph:
```
{
"name": "## TEST 2 ##", "id": 7, "parent": 3, "fw_parent": 0, "scope": 5, "tid": 1, "fw_tid": 0,
"inputs": [1,false,2.5,[6,6],"hello",6], "input_shapes": [[],[],[],[[3,4,5],[3,4,5]],[],[3,4,5]], "input_types": ["Int","Bool","Double","GenericList[Tensor(float),Tensor(float)]","String","Tensor(float)"],
"outputs": [], "output_shapes": [], "output_types": []
},
{
"name": "## TEST 1 ##", "id": 3, "parent": 2, "fw_parent": 0, "scope": 5, "tid": 1, "fw_tid": 0,
"inputs": ["1, 2, 3"], "input_shapes": [[]], "input_types": ["String"],
"outputs": [], "output_shapes": [], "output_types": []
},
```
Test Plan:
```
=> buck build caffe2/test:profiler --show-output
=> buck-out/gen/caffe2/test/profiler#binary.par test_profiler.TestRecordFunction
test_record_function (test_profiler.TestRecordFunction) ... Log file: /tmp/libkineto_activities_1651304.json
Net filter:
Target net for iteration count:
Net Iterations: 3
INFO:2021-09-27 01:10:15 1651304:1651304 Config.cpp:424] Trace start time: 2021-09-27 01:10:30
Trace duration: 500ms
Warmup duration: 5s
Net size threshold: 0
GPU op count threshold: 0
Max GPU buffer size: 128MB
Enabled activities: cpu_op,user_annotation,external_correlation,cuda_runtime,cpu_instant_event
Manifold bucket: gpu_traces
Manifold object: tree/traces/clientAPI/0/1632730215/devvm2060.ftw0/libkineto_activities_1651304.json
Trace compression enabled: 1
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:536] Tracing starting in 14s
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:48] Target net for iterations not specified - picking first encountered that passes net filter
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:57] Tracking net PyTorch Profiler for 3 iterations
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:126] Processing 1 CPU buffers
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:686] Recorded nets:
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:689] PyTorch Profiler: 1 iterations
ok
----------------------------------------------------------------------
Ran 1 test in 0.021s
OK
```
Reviewed By: gdankel
Differential Revision: D31165259
fbshipit-source-id: 15920aaef7138c666e5eca2a71c3bf33073eadc4
Summary:
Use startThreadId+seqNumber of forward-op and fwdThreadId+seqNumber of backward-op to correlate pair of them.
third_party/kineto should be updated accordingly: https://github.com/pytorch/kineto/pull/372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62553
Reviewed By: malfet
Differential Revision: D30125728
Pulled By: gdankel
fbshipit-source-id: 9877a54392ba043d0eac56ce5b7bbf244277fa7e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62779
Change from floating point operations per second to total floating point operations. This requires removing the division by executing time from the Kineto computed FLOPs and updating necessary documentation
Test Plan:
Running the following script:
```
import torch
from torch.profiler import profile
import torchvision.models as models
model = models.resnet18().eval()
inputs = torch.randn(5, 3, 224, 224)
with torch.no_grad():
with profile(record_shapes=True, with_flops=True) as prof:
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total"))
```
Before diff results in:
{F636640118}
And after diff should be about `(27.78 * 10^9) FLOP/s * .652838 seconds =18135839640 FLOP = 18.136 GFLOP`. Running the script again yields this answer:
{F636655686}
------------------------------------
Reviewed By: gdankel
Differential Revision: D29972997
fbshipit-source-id: 0f8d9f264b7d9f8f6bb3f10ab7c2c9794291e28b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61792
KinetoEvent
This PR adds module hierarchy information to events.
What is module hierarchy information attached to events?
During profiling a TorchScript module, when events are added, we ask JIT
what is the module hierarchy associated with the node being
executed. At the time of execution of that node, there might be multiple
frames in the stack of interpreter. For each frame, we find
corresponding node and the corresponding module hierarchy is queried.
Module hierarchy corresponding to the node is associated with node's
InlinedCallStack. InlinedCallStack of node tracks the path via which the
node is inlined. Thus during the inlining process we annotate
module information corresponding to the CallMethod nodes being inlined.
With this PR, chrome trace will contain additional metadata:
"Module Hierarchy". This can look like this:
TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward
It contains module instance, type name and the method name in the
callstack.
Test Plan:
test_profiler
Imported from OSS
Reviewed By: raziel, ilia-cher
Differential Revision: D29745442
fbshipit-source-id: dc8dfaf7c5b8ab256ff0b2ef1e5ec265ca366528
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.
`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282
Reviewed By: ejguan
Differential Revision: D29796282
Pulled By: chaekit
fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58013
Add a test case and a fix (legacy profiler) for empty trace handling
Test Plan: python test/test_profiler.py
Reviewed By: gdankel
Differential Revision: D28345388
Pulled By: ilia-cher
fbshipit-source-id: 4727589ab83367ac8b506cc0f186e5292d974671
Summary:
Add an ability to use new profiler API even if Kineto is not compiled
in, by falling back to the legacy profiler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57612
Test Plan:
compiled
USE_KINETO=0 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python
setup.py develop install --cmake
and with USE_KINETO=1
and ran
python test/test_profiler.py -v
Reviewed By: gdankel
Differential Revision: D28217680
Pulled By: ilia-cher
fbshipit-source-id: ec81fb527eb69bb0a3e0bd6aad13592200d7fe70
Summary:
Expanding support to all builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56323
Test Plan: CI
Reviewed By: malfet
Differential Revision: D28171478
Pulled By: ilia-cher
fbshipit-source-id: 16bc752d1be3cbaeda5316f5d8a687ae05a83d22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53174
Enable Kineto also in the CPU builds (non-mobile, non-Windows(atm))
Test Plan: CI
Reviewed By: gdankel
Differential Revision: D26776112
Pulled By: ilia-cher
fbshipit-source-id: 8733f65c2993105136c853f2a7b6e497d0fa53bf
Summary:
The new profiler API was added in PR#48280. This PR is to add FLOPS
support to the new profiler API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51734
Test Plan:
```python
python test/test_profiler.py -k test_flops
```
Reviewed By: xuzhao9
Differential Revision: D26261851
Pulled By: ilia-cher
fbshipit-source-id: dbeba4c197e6f51a9a8e640e8bb60ec38df87f73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51421
Mark memory events that did not happen within an operator context
explicitly in the profiler output.
Test Plan: python test/test_profiler.py -k test_memory_profiler
Reviewed By: ngimel
Differential Revision: D26166518
Pulled By: ilia-cher
fbshipit-source-id: 3c14d3ac25a7137733ea7cc65f0eb48693a98f5e
Summary:
Add the FLOPS metric computation to the experimental Kineto profiler.
This includes saving necessary extra arguments and compute flops in the C++ code,
and extract the FLOPS value from the Python frontend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51503
Test Plan:
Build PyTorch with USE_KINETO option, then run the unit test:
```python
python test/test_profiler.py -k test_flops
```
Reviewed By: ilia-cher
Differential Revision: D26202711
Pulled By: xuzhao9
fbshipit-source-id: 7dab7c513f454355a220b72859edb3ccbddcb3ff