pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Natalia Gimelshein 50b91103a9 add self cuda time to avoid double/quadruple counting (#45209 ) Summary: In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before: ``` -------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- aten::matmul 0.17% 890.805us 99.05% 523.401ms 5.234ms 49.91% 791.184ms 7.912ms 100 aten::mm 98.09% 518.336ms 98.88% 522.511ms 5.225ms 49.89% 790.885ms 7.909ms 100 aten::t 0.29% 1.530ms 0.49% 2.588ms 25.882us 0.07% 1.058ms 10.576us 100 aten::view 0.46% 2.448ms 0.46% 2.448ms 12.238us 0.06% 918.936us 4.595us 200 aten::transpose 0.13% 707.204us 0.20% 1.058ms 10.581us 0.03% 457.802us 4.578us 100 aten::empty 0.14% 716.056us 0.14% 716.056us 7.161us 0.01% 185.694us 1.857us 100 aten::as_strided 0.07% 350.935us 0.07% 350.935us 3.509us 0.01% 156.380us 1.564us 100 aten::stride 0.65% 3.458ms 0.65% 3.458ms 11.527us 0.03% 441.258us 1.471us 300 -------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 528.437ms CUDA time total: 1.585s Recorded timeit time: 789.0814 ms ``` Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler After ``` -------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::matmul 0.15% 802.716us 99.06% 523.548ms 5.235ms 302.451us 0.04% 791.151ms 7.912ms 100 aten::mm 98.20% 519.007ms 98.91% 522.745ms 5.227ms 790.225ms 99.63% 790.848ms 7.908ms 100 aten::t 0.27% 1.406ms 0.49% 2.578ms 25.783us 604.964us 0.08% 1.066ms 10.662us 100 aten::view 0.45% 2.371ms 0.45% 2.371ms 11.856us 926.281us 0.12% 926.281us 4.631us 200 aten::transpose 0.15% 783.462us 0.22% 1.173ms 11.727us 310.016us 0.04% 461.282us 4.613us 100 aten::empty 0.11% 591.603us 0.11% 591.603us 5.916us 176.566us 0.02% 176.566us 1.766us 100 aten::as_strided 0.07% 389.270us 0.07% 389.270us 3.893us 151.266us 0.02% 151.266us 1.513us 100 aten::stride 0.60% 3.147ms 0.60% 3.147ms 10.489us 446.451us 0.06% 446.451us 1.488us 300 -------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 528.498ms CUDA time total: 793.143ms Recorded timeit time: 788.9832 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209 Reviewed By: zou3519 Differential Revision: D23925491 Pulled By: ngimel fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0		2020-09-28 21:51:13 -07:00
..
functions	[reland] Make grad point to bucket buffer in DDP to save memory usage (#44344 )	2020-09-24 20:54:51 -07:00
utils	Add __torch_function__ for methods (#37091 )	2020-08-05 20:44:13 -07:00
anomaly_mode.cpp
anomaly_mode.h	Print all traceback for nested backwards in detect_anomaly (#43626 )	2020-08-31 08:23:07 -07:00
autograd.cpp	Fix error message in autograd (#39729 )	2020-06-09 13:54:49 -07:00
autograd.h
cpp_hook.cpp
cpp_hook.h
custom_function.cpp	Don't materialize output grads (#41821 )	2020-08-11 04:27:07 -07:00
custom_function.h	Don't materialize output grads (#41821 )	2020-08-11 04:27:07 -07:00
edge.h	Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503 )	2020-08-29 17:47:00 -07:00
engine.cpp	Preserve python backtrace in autograd engine errors. (#43684 )	2020-09-01 01:28:47 -07:00
engine.h	Preserve python backtrace in autograd engine errors. (#43684 )	2020-09-01 01:28:47 -07:00
function_hook.cpp
function_hook.h
function.cpp	Print all traceback for nested backwards in detect_anomaly (#43626 )	2020-08-31 08:23:07 -07:00
function.h	Print all traceback for nested backwards in detect_anomaly (#43626 )	2020-08-31 08:23:07 -07:00
FunctionsManual.cpp	Fixed handling of nan for evenly_distribute_backward (#45280 )	2020-09-28 15:57:02 -07:00
FunctionsManual.h	adding a beta parameter to the smooth_l1 loss fn (#44433 )	2020-09-25 16:36:28 -07:00
grad_mode.h
init.cpp	[RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664 )	2020-09-25 13:19:26 -07:00
input_buffer.cpp
input_buffer.h
input_metadata.h
profiler_cuda.cpp	Destroy CUDA events after profiling (#39962 )	2020-06-23 10:44:39 -07:00
profiler.cpp	add self cuda time to avoid double/quadruple counting (#45209 )	2020-09-28 21:51:13 -07:00
profiler.h	[RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664 )	2020-09-25 13:19:26 -07:00
python_anomaly_mode.cpp	Print all traceback for nested backwards in detect_anomaly (#43626 )	2020-08-31 08:23:07 -07:00
python_anomaly_mode.h	Print all traceback for nested backwards in detect_anomaly (#43626 )	2020-08-31 08:23:07 -07:00
python_autograd.h
python_cpp_function.cpp
python_cpp_function.h
python_engine.cpp	Use ivalue::Future in autograd engine and DistEngine. (#43676 )	2020-08-29 02:15:26 -07:00
python_engine.h	Use ivalue::Future in autograd engine and DistEngine. (#43676 )	2020-08-29 02:15:26 -07:00
python_fft_functions.h	Adds fft namespace (#41911 )	2020-08-06 00:20:50 -07:00
python_function.cpp	Print all traceback for nested backwards in detect_anomaly (#43626 )	2020-08-31 08:23:07 -07:00
python_function.h	Don't materialize output grads (#41821 )	2020-08-11 04:27:07 -07:00
python_hook.cpp
python_hook.h
python_legacy_variable.cpp	Fix return value of PyErr_WarnEx ignored (SystemError) (#44371 )	2020-09-10 10:15:21 -07:00
python_legacy_variable.h
python_linalg_functions.h	Adds torch.linalg namespace (#42664 )	2020-08-07 10:18:30 -07:00
python_nn_functions.h	Adds fft namespace (#41911 )	2020-08-06 00:20:50 -07:00
python_variable_indexing.cpp	Add __torch_function__ for methods (#37091 )	2020-08-05 20:44:13 -07:00
python_variable_indexing.h
python_variable.cpp	Fix return value of PyErr_WarnEx ignored (SystemError) (#44371 )	2020-09-10 10:15:21 -07:00
python_variable.h
README.md
record_function_ops.cpp	RecordFunction in Dispatcher (#37587 )	2020-07-17 22:20:05 -07:00
record_function_ops.h	[RFC] Profile rpc_async call from JIT (#40652 )	2020-07-03 15:17:16 -07:00
saved_variable.cpp	Change C++ frontend to take optional<Tensor> arguments (#41947 )	2020-07-31 16:11:55 -07:00
saved_variable.h	Change C++ frontend to take optional<Tensor> arguments (#41947 )	2020-07-31 16:11:55 -07:00
symbolic.h
TraceTypeManual.cpp	Allow Tensor& in the unboxing logic (#42712 )	2020-08-12 17:33:23 -07:00
variable.cpp	Reland split (#41567 )	2020-07-21 08:06:27 -07:00
variable.h	Reland split (#41567 )	2020-07-21 08:06:27 -07:00
VariableTypeManual.cpp	Byte-for-byte compatibility fixes in codegen (#44879 )	2020-09-25 08:06:50 -07:00
VariableTypeUtils.h	Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531 )	2020-08-28 14:34:46 -07:00

README.md

Autograd

Autograd is a hotspot for PyTorch performance, so most of the heavy lifting is implemented in C++. This implies that we have to do some shuffling between Python and C++; and in general, we want data to be in a form that is convenient to manipulate from C++.

Our general model is that for any key data type that autograd manipulates, there are two implementations: a C++ type and a Python object type. For example, consider variables in autograd: we have both Variable in variable.h (the C++ type) and THPVariable in python_variable.h (the Python type.) (By the way, THP stands for TorcH Python, not to be confused with THPP, TorcH C++). Variable contains the payload of a variable, while THPVariable just contains a shared_ptr reference to Variable, as well as references to other Python objects which the Python runtime needs to know about. A lot of data accessor implementations in python_variable.cpp simply reach through to the underlying Variable and return the appropriate value.

The most complicated application of this principle is Function, which also supports users implementing custom behavior in Python. We have the following classes:

Node in function.h, the C++ type.
THPFunction in python_function.h, the Python object type. In python_function.cpp, you can see the boilerplate that tells the Python interpreter about this object.
PyNode in python_function.h, a subclass of Node which forwards apply to a Python THPFunction. (NOT a Python object, despite its name!)

Outside of PyNode, the C++ objects largely avoid referencing Python objects (there are a few exceptions, like pyobj in Variable, and PyNode, whose whole point is to let C++ call into Python). And pyobj in Node to ensure uniqueness of the associated python wrapper (if it exists).