pytorch/torch
Tristan Rice 758d7dea9c torch.monitor - Initial C++ Stats (#68074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68074

This is the first step of many PRs towards implementing the `torch.monitor` RFC https://github.com/pytorch/rfcs/pull/30

This defines the aggregation types, the `Stat` class and provides some simple collection of the stats.

This doesn't match the RFC exactly as it incorporates some of the comments on the RFC as well as a few changes for performance.

Changes:
* added window_size to the stats. If specified it will always compute the stat using the `window_size` number of values. If there aren't enough values within that window it reports the previous stats.
* This doesn't include the push metrics yet (will be coming).
  After more discussion it looks like the best way to handle this is to support a hybrid where the metric can set how frequently it'll be logged. For fixed window_size metrics it'll be logged each time it hits the window size. This will allow performant counters as well as lower frequency push counters (window_size=1).

Performance considerations:
* Updating the stats acquires a lock on that Stat object. This should be performant unless there's many-many threads writing to the same stat. Single thread will typically use futex so should be quite fast.
* Adding/removing/fetching all stats sets a global lock on the stat list -- this shouldn't be an issue since these events happen infrequently.
* Fetching stats accesses one stat at a time instead of a global lock. This means the exported values are linearizable but not serializable across multiple stats but I don't expect this to be an issue.

Next steps:
1. Add StatCollector interface for push style metrics
1. Add pybind interfaces to expose to Python
1. Add default metric providers
1. Integrate into Kineto trace view

Test Plan:
buck test //caffe2/test/cpp/monitor:monitor

CI

Reviewed By: kiukchung

Differential Revision: D32266032

fbshipit-source-id: dab8747b4712f5dba5644387817a3a0fda18b66a
2021-11-18 21:46:23 -08:00
..
_C [c10d] Fix object-based collectives for debug mode (#68223) 2021-11-13 04:18:31 -08:00
_masked Strided masked reduction: mean (2nd try) (#67088) 2021-11-01 16:12:07 -07:00
ao [quant][embedding qat] eager mode QAT for Embeddings (#66429) 2021-11-18 05:57:11 -08:00
autograd Stop warning spamming about vmap in gradcheck (#68586) 2021-11-18 07:00:36 -08:00
backends Add an option to disable reduced precision reductions for FP16 GEMM (#67946) 2021-11-09 17:27:20 -08:00
contrib
cpu Add fp16/fp32 autocasting to JIT/TorchScript (#63939) 2021-10-27 12:11:36 -07:00
csrc torch.monitor - Initial C++ Stats (#68074) 2021-11-18 21:46:23 -08:00
cuda Update __init__.py (#67900) 2021-11-08 08:56:38 -08:00
distributed [reland] simplify init_from_local_shards API (#68021) 2021-11-17 23:20:37 -08:00
distributions Implement Entropy methods for Binomial and Multinomial distributions (#67609) 2021-11-11 09:16:28 -08:00
fft
for_onnx
futures
fx [const_fold] Fix call_module const folding (#68614) 2021-11-18 20:56:06 -08:00
jit Update Freezing Logic and add new passes (#68024) 2021-11-09 13:21:52 -08:00
legacy
lib [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746) 2021-11-03 12:23:14 -07:00
linalg Revert D32283178: Add linalg.solve_triangular 2021-11-18 14:46:10 -08:00
multiprocessing
nn [BC-breaking] Change dtype of softmax to support TorchScript and MyPy (#68336) 2021-11-18 11:26:14 -08:00
onnx Added antialias flag to interpolate (CPU only, bilinear) (#65142) 2021-11-17 09:10:15 -08:00
optim Adds an optimizer instance variable to ChainedScheduler (#68010) 2021-11-10 01:31:47 -08:00
package
profiler [Reland] Python tracer. (#68325) 2021-11-15 23:32:49 -08:00
quantization
sparse
special
testing Add native_dropout (#63937) 2021-11-18 19:41:10 -08:00
utils Fix DLPack CUDA stream convention (#67618) 2021-11-18 08:36:05 -08:00
__config__.py
__future__.py
__init__.py Add set_deterministic_debug_mode and get_deterministic_debug_mode (#67778) 2021-11-11 12:48:29 -08:00
_appdirs.py
_classes.py
_deploy.py [deploy] fix TypedStorage serialization (#67499) 2021-10-28 22:33:04 -07:00
_jit_internal.py [package] fix torchscript classes in package (#68028) 2021-11-16 10:01:40 -08:00
_linalg_utils.py
_lobpcg.py torch.lobpcg.backward: do not save non-Variable types with ctx.save_for_backward. (#67994) 2021-11-08 10:02:09 -08:00
_lowrank.py
_namedtensor_internals.py
_ops.py
_python_dispatcher.py
_six.py
_sources.py Disallow annotations on instance attributes outside __init__ (#67051) 2021-10-25 16:20:47 -07:00
_storage_docs.py
_tensor_docs.py [numpy] Alias arctan2 to atan2 (#67010) 2021-11-16 09:41:09 -08:00
_tensor_str.py
_tensor.py Sparse CSR: add convert_indices_from_csr_to_coo (#66774) 2021-11-17 22:28:30 -08:00
_torch_docs.py Revert D32283178: Add linalg.solve_triangular 2021-11-18 14:46:10 -08:00
_utils_internal.py
_utils.py
_VF.py
_vmap_internals.py More aggressively market functorch.vmap when torch.vmap gets called (#67347) 2021-11-12 16:10:16 -08:00
abi-check.cpp
autocast_mode.py Add fp16/fp32 autocasting to JIT/TorchScript (#63939) 2021-10-27 12:11:36 -07:00
CMakeLists.txt codegen: Split up source, header and Declarations.yaml generation (#67497) 2021-11-03 13:20:54 -07:00
custom_class_detail.h [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746) 2021-11-03 12:23:14 -07:00
custom_class.h [NOOP][clangformat][codemod] Enable CLANGFORMAT (#67854) 2021-11-04 14:07:57 -07:00
deploy.h
extension.h
functional.py [lint] small pass to make lint clean (#68367) 2021-11-16 10:27:00 -08:00
hub.py making import_module private and deprecating public method (#67990) 2021-11-09 07:27:57 -08:00
library.h [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746) 2021-11-03 12:23:14 -07:00
overrides.py Add native_dropout (#63937) 2021-11-18 19:41:10 -08:00
py.typed
quasirandom.py
random.py
README.txt
script.h [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746) 2021-11-03 12:23:14 -07:00
serialization.py Throw error when saving storages that view same data with different type (#66949) 2021-11-16 08:44:44 -08:00
storage.py
torch_version.py
types.py

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.