pytorch/test/cpp
Tristan Rice 758d7dea9c torch.monitor - Initial C++ Stats (#68074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68074

This is the first step of many PRs towards implementing the `torch.monitor` RFC https://github.com/pytorch/rfcs/pull/30

This defines the aggregation types, the `Stat` class and provides some simple collection of the stats.

This doesn't match the RFC exactly as it incorporates some of the comments on the RFC as well as a few changes for performance.

Changes:
* added window_size to the stats. If specified it will always compute the stat using the `window_size` number of values. If there aren't enough values within that window it reports the previous stats.
* This doesn't include the push metrics yet (will be coming).
  After more discussion it looks like the best way to handle this is to support a hybrid where the metric can set how frequently it'll be logged. For fixed window_size metrics it'll be logged each time it hits the window size. This will allow performant counters as well as lower frequency push counters (window_size=1).

Performance considerations:
* Updating the stats acquires a lock on that Stat object. This should be performant unless there's many-many threads writing to the same stat. Single thread will typically use futex so should be quite fast.
* Adding/removing/fetching all stats sets a global lock on the stat list -- this shouldn't be an issue since these events happen infrequently.
* Fetching stats accesses one stat at a time instead of a global lock. This means the exported values are linearizable but not serializable across multiple stats but I don't expect this to be an issue.

Next steps:
1. Add StatCollector interface for push style metrics
1. Add pybind interfaces to expose to Python
1. Add default metric providers
1. Integrate into Kineto trace view

Test Plan:
buck test //caffe2/test/cpp/monitor:monitor

CI

Reviewed By: kiukchung

Differential Revision: D32266032

fbshipit-source-id: dab8747b4712f5dba5644387817a3a0fda18b66a
2021-11-18 21:46:23 -08:00
..
api [easy][PyTorch] Use at::native::is_nonzero (#67195) 2021-10-26 12:40:32 -07:00
c10d Enable desync root cause analysis for NCCL (#68310) 2021-11-17 20:29:03 -08:00
common
dist_autograd Fix distributed autograd gradients synchronization (#57792) 2021-05-09 17:32:59 -07:00
jit Refactor saving jit::Module to mobile .pt in 2 steps: (#66494) 2021-11-17 12:02:20 -08:00
lazy [LT] Upstream TsNode, TsNodeLowering, TsLoweringContext (#68154) 2021-11-12 12:57:20 -08:00
lite_interpreter_runtime Back out "Revert D30710710: [Pytorch Edge] Support profiling kineto events from external source" (#66421) 2021-10-12 10:55:29 -07:00
monitor torch.monitor - Initial C++ Stats (#68074) 2021-11-18 21:46:23 -08:00
rpc Remove ProcessGroup from TensorPipeAgent initialization (#68128) 2021-11-11 12:28:55 -08:00
tensorexpr [TensorExpr] Remove non-determinism in iterating over unordered_set of intermediate buffers. (#68277) 2021-11-13 00:50:57 -08:00
__init__.py remediation of S205607 2020-07-17 17:19:47 -07:00