pytorch/docs
Zizeng Meng 861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
..
cpp Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
source [Kineto] Enable OOM observer (#152160) 2025-04-27 15:56:44 +00:00
.gitignore
libtorch.rst Add ROCm documentation to libtorch (C++) reST. (#136378) 2024-09-25 02:30:56 +00:00
make.bat
Makefile Add scripts to generate plots of LRSchedulers (#149189) 2025-04-14 09:53:38 +00:00
README.md
requirements.txt Update docs dependencies for local build (#151796) 2025-04-24 18:40:42 +00:00

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.