pytorch/torch/csrc/autograd
Natalia Gimelshein 50b91103a9 add self cuda time to avoid double/quadruple counting (#45209)
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::matmul          0.17%            890.805us        99.05%           523.401ms        5.234ms          49.91%           791.184ms        7.912ms          100
aten::mm              98.09%           518.336ms        98.88%           522.511ms        5.225ms          49.89%           790.885ms        7.909ms          100
aten::t               0.29%            1.530ms          0.49%            2.588ms          25.882us         0.07%            1.058ms          10.576us         100
aten::view            0.46%            2.448ms          0.46%            2.448ms          12.238us         0.06%            918.936us        4.595us          200
aten::transpose       0.13%            707.204us        0.20%            1.058ms          10.581us         0.03%            457.802us        4.578us          100
aten::empty           0.14%            716.056us        0.14%            716.056us        7.161us          0.01%            185.694us        1.857us          100
aten::as_strided      0.07%            350.935us        0.07%            350.935us        3.509us          0.01%            156.380us        1.564us          100
aten::stride          0.65%            3.458ms          0.65%            3.458ms          11.527us         0.03%            441.258us        1.471us          300
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s

Recorded timeit time:  789.0814 ms

```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler

After
```
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
        aten::matmul         0.15%     802.716us        99.06%     523.548ms       5.235ms     302.451us         0.04%     791.151ms       7.912ms           100
            aten::mm        98.20%     519.007ms        98.91%     522.745ms       5.227ms     790.225ms        99.63%     790.848ms       7.908ms           100
             aten::t         0.27%       1.406ms         0.49%       2.578ms      25.783us     604.964us         0.08%       1.066ms      10.662us           100
          aten::view         0.45%       2.371ms         0.45%       2.371ms      11.856us     926.281us         0.12%     926.281us       4.631us           200
     aten::transpose         0.15%     783.462us         0.22%       1.173ms      11.727us     310.016us         0.04%     461.282us       4.613us           100
         aten::empty         0.11%     591.603us         0.11%     591.603us       5.916us     176.566us         0.02%     176.566us       1.766us           100
    aten::as_strided         0.07%     389.270us         0.07%     389.270us       3.893us     151.266us         0.02%     151.266us       1.513us           100
        aten::stride         0.60%       3.147ms         0.60%       3.147ms      10.489us     446.451us         0.06%     446.451us       1.488us           300
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms

Recorded timeit time:  788.9832 ms

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209

Reviewed By: zou3519

Differential Revision: D23925491

Pulled By: ngimel

fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
2020-09-28 21:51:13 -07:00
..
functions [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344) 2020-09-24 20:54:51 -07:00
utils Add __torch_function__ for methods (#37091) 2020-08-05 20:44:13 -07:00
anomaly_mode.cpp
anomaly_mode.h Print all traceback for nested backwards in detect_anomaly (#43626) 2020-08-31 08:23:07 -07:00
autograd.cpp Fix error message in autograd (#39729) 2020-06-09 13:54:49 -07:00
autograd.h
cpp_hook.cpp
cpp_hook.h
custom_function.cpp Don't materialize output grads (#41821) 2020-08-11 04:27:07 -07:00
custom_function.h Don't materialize output grads (#41821) 2020-08-11 04:27:07 -07:00
edge.h Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503) 2020-08-29 17:47:00 -07:00
engine.cpp Preserve python backtrace in autograd engine errors. (#43684) 2020-09-01 01:28:47 -07:00
engine.h Preserve python backtrace in autograd engine errors. (#43684) 2020-09-01 01:28:47 -07:00
function_hook.cpp
function_hook.h
function.cpp Print all traceback for nested backwards in detect_anomaly (#43626) 2020-08-31 08:23:07 -07:00
function.h Print all traceback for nested backwards in detect_anomaly (#43626) 2020-08-31 08:23:07 -07:00
FunctionsManual.cpp Fixed handling of nan for evenly_distribute_backward (#45280) 2020-09-28 15:57:02 -07:00
FunctionsManual.h adding a beta parameter to the smooth_l1 loss fn (#44433) 2020-09-25 16:36:28 -07:00
grad_mode.h
init.cpp [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664) 2020-09-25 13:19:26 -07:00
input_buffer.cpp
input_buffer.h
input_metadata.h
profiler_cuda.cpp Destroy CUDA events after profiling (#39962) 2020-06-23 10:44:39 -07:00
profiler.cpp add self cuda time to avoid double/quadruple counting (#45209) 2020-09-28 21:51:13 -07:00
profiler.h [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664) 2020-09-25 13:19:26 -07:00
python_anomaly_mode.cpp Print all traceback for nested backwards in detect_anomaly (#43626) 2020-08-31 08:23:07 -07:00
python_anomaly_mode.h Print all traceback for nested backwards in detect_anomaly (#43626) 2020-08-31 08:23:07 -07:00
python_autograd.h
python_cpp_function.cpp
python_cpp_function.h
python_engine.cpp Use ivalue::Future in autograd engine and DistEngine. (#43676) 2020-08-29 02:15:26 -07:00
python_engine.h Use ivalue::Future in autograd engine and DistEngine. (#43676) 2020-08-29 02:15:26 -07:00
python_fft_functions.h Adds fft namespace (#41911) 2020-08-06 00:20:50 -07:00
python_function.cpp Print all traceback for nested backwards in detect_anomaly (#43626) 2020-08-31 08:23:07 -07:00
python_function.h Don't materialize output grads (#41821) 2020-08-11 04:27:07 -07:00
python_hook.cpp
python_hook.h
python_legacy_variable.cpp Fix return value of PyErr_WarnEx ignored (SystemError) (#44371) 2020-09-10 10:15:21 -07:00
python_legacy_variable.h
python_linalg_functions.h Adds torch.linalg namespace (#42664) 2020-08-07 10:18:30 -07:00
python_nn_functions.h Adds fft namespace (#41911) 2020-08-06 00:20:50 -07:00
python_variable_indexing.cpp Add __torch_function__ for methods (#37091) 2020-08-05 20:44:13 -07:00
python_variable_indexing.h
python_variable.cpp Fix return value of PyErr_WarnEx ignored (SystemError) (#44371) 2020-09-10 10:15:21 -07:00
python_variable.h
README.md
record_function_ops.cpp RecordFunction in Dispatcher (#37587) 2020-07-17 22:20:05 -07:00
record_function_ops.h [RFC] Profile rpc_async call from JIT (#40652) 2020-07-03 15:17:16 -07:00
saved_variable.cpp Change C++ frontend to take optional<Tensor> arguments (#41947) 2020-07-31 16:11:55 -07:00
saved_variable.h Change C++ frontend to take optional<Tensor> arguments (#41947) 2020-07-31 16:11:55 -07:00
symbolic.h
TraceTypeManual.cpp Allow Tensor& in the unboxing logic (#42712) 2020-08-12 17:33:23 -07:00
variable.cpp Reland split (#41567) 2020-07-21 08:06:27 -07:00
variable.h Reland split (#41567) 2020-07-21 08:06:27 -07:00
VariableTypeManual.cpp Byte-for-byte compatibility fixes in codegen (#44879) 2020-09-25 08:06:50 -07:00
VariableTypeUtils.h Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531) 2020-08-28 14:34:46 -07:00

Autograd

Autograd is a hotspot for PyTorch performance, so most of the heavy lifting is implemented in C++. This implies that we have to do some shuffling between Python and C++; and in general, we want data to be in a form that is convenient to manipulate from C++.

Our general model is that for any key data type that autograd manipulates, there are two implementations: a C++ type and a Python object type. For example, consider variables in autograd: we have both Variable in variable.h (the C++ type) and THPVariable in python_variable.h (the Python type.) (By the way, THP stands for TorcH Python, not to be confused with THPP, TorcH C++). Variable contains the payload of a variable, while THPVariable just contains a shared_ptr reference to Variable, as well as references to other Python objects which the Python runtime needs to know about. A lot of data accessor implementations in python_variable.cpp simply reach through to the underlying Variable and return the appropriate value.

The most complicated application of this principle is Function, which also supports users implementing custom behavior in Python. We have the following classes:

  • Node in function.h, the C++ type.
  • THPFunction in python_function.h, the Python object type. In python_function.cpp, you can see the boilerplate that tells the Python interpreter about this object.
  • PyNode in python_function.h, a subclass of Node which forwards apply to a Python THPFunction. (NOT a Python object, despite its name!)

Outside of PyNode, the C++ objects largely avoid referencing Python objects (there are a few exceptions, like pyobj in Variable, and PyNode, whose whole point is to let C++ call into Python). And pyobj in Node to ensure uniqueness of the associated python wrapper (if it exists).