Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
aten::matmul 0.17% 890.805us 99.05% 523.401ms 5.234ms 49.91% 791.184ms 7.912ms 100
aten::mm 98.09% 518.336ms 98.88% 522.511ms 5.225ms 49.89% 790.885ms 7.909ms 100
aten::t 0.29% 1.530ms 0.49% 2.588ms 25.882us 0.07% 1.058ms 10.576us 100
aten::view 0.46% 2.448ms 0.46% 2.448ms 12.238us 0.06% 918.936us 4.595us 200
aten::transpose 0.13% 707.204us 0.20% 1.058ms 10.581us 0.03% 457.802us 4.578us 100
aten::empty 0.14% 716.056us 0.14% 716.056us 7.161us 0.01% 185.694us 1.857us 100
aten::as_strided 0.07% 350.935us 0.07% 350.935us 3.509us 0.01% 156.380us 1.564us 100
aten::stride 0.65% 3.458ms 0.65% 3.458ms 11.527us 0.03% 441.258us 1.471us 300
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s
Recorded timeit time: 789.0814 ms
```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler
After
```
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::matmul 0.15% 802.716us 99.06% 523.548ms 5.235ms 302.451us 0.04% 791.151ms 7.912ms 100
aten::mm 98.20% 519.007ms 98.91% 522.745ms 5.227ms 790.225ms 99.63% 790.848ms 7.908ms 100
aten::t 0.27% 1.406ms 0.49% 2.578ms 25.783us 604.964us 0.08% 1.066ms 10.662us 100
aten::view 0.45% 2.371ms 0.45% 2.371ms 11.856us 926.281us 0.12% 926.281us 4.631us 200
aten::transpose 0.15% 783.462us 0.22% 1.173ms 11.727us 310.016us 0.04% 461.282us 4.613us 100
aten::empty 0.11% 591.603us 0.11% 591.603us 5.916us 176.566us 0.02% 176.566us 1.766us 100
aten::as_strided 0.07% 389.270us 0.07% 389.270us 3.893us 151.266us 0.02% 151.266us 1.513us 100
aten::stride 0.60% 3.147ms 0.60% 3.147ms 10.489us 446.451us 0.06% 446.451us 1.488us 300
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms
Recorded timeit time: 788.9832 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209
Reviewed By: zou3519
Differential Revision: D23925491
Pulled By: ngimel
fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
|
||
|---|---|---|
| .. | ||
| functions | ||
| utils | ||
| anomaly_mode.cpp | ||
| anomaly_mode.h | ||
| autograd.cpp | ||
| autograd.h | ||
| cpp_hook.cpp | ||
| cpp_hook.h | ||
| custom_function.cpp | ||
| custom_function.h | ||
| edge.h | ||
| engine.cpp | ||
| engine.h | ||
| function_hook.cpp | ||
| function_hook.h | ||
| function.cpp | ||
| function.h | ||
| FunctionsManual.cpp | ||
| FunctionsManual.h | ||
| grad_mode.h | ||
| init.cpp | ||
| input_buffer.cpp | ||
| input_buffer.h | ||
| input_metadata.h | ||
| profiler_cuda.cpp | ||
| profiler.cpp | ||
| profiler.h | ||
| python_anomaly_mode.cpp | ||
| python_anomaly_mode.h | ||
| python_autograd.h | ||
| python_cpp_function.cpp | ||
| python_cpp_function.h | ||
| python_engine.cpp | ||
| python_engine.h | ||
| python_fft_functions.h | ||
| python_function.cpp | ||
| python_function.h | ||
| python_hook.cpp | ||
| python_hook.h | ||
| python_legacy_variable.cpp | ||
| python_legacy_variable.h | ||
| python_linalg_functions.h | ||
| python_nn_functions.h | ||
| python_variable_indexing.cpp | ||
| python_variable_indexing.h | ||
| python_variable.cpp | ||
| python_variable.h | ||
| README.md | ||
| record_function_ops.cpp | ||
| record_function_ops.h | ||
| saved_variable.cpp | ||
| saved_variable.h | ||
| symbolic.h | ||
| TraceTypeManual.cpp | ||
| variable.cpp | ||
| variable.h | ||
| VariableTypeManual.cpp | ||
| VariableTypeUtils.h | ||
Autograd
Autograd is a hotspot for PyTorch performance, so most of the heavy lifting is implemented in C++. This implies that we have to do some shuffling between Python and C++; and in general, we want data to be in a form that is convenient to manipulate from C++.
Our general model is that for any key data type that autograd manipulates,
there are two implementations: a C++ type and a Python object type. For
example, consider variables in autograd: we have both Variable in variable.h
(the C++ type) and THPVariable in python_variable.h (the Python type.)
(By the way, THP stands for TorcH Python, not to be confused with THPP, TorcH
C++). Variable contains the payload of a variable, while THPVariable just
contains a shared_ptr reference to Variable, as well as references to other
Python objects which the Python runtime needs to know about. A lot of
data accessor implementations in python_variable.cpp simply reach through
to the underlying Variable and return the appropriate value.
The most complicated application of this principle is Function, which also supports users implementing custom behavior in Python. We have the following classes:
Nodeinfunction.h, the C++ type.THPFunctioninpython_function.h, the Python object type. Inpython_function.cpp, you can see the boilerplate that tells the Python interpreter about this object.PyNodeinpython_function.h, a subclass ofNodewhich forwardsapplyto a PythonTHPFunction. (NOT a Python object, despite its name!)
Outside of PyNode, the C++ objects largely avoid referencing Python
objects (there are a few exceptions, like pyobj in Variable, and
PyNode, whose whole point is to let C++ call into Python). And pyobj
in Node to ensure uniqueness of the associated python wrapper (if it exists).