pytorch/torch/_C
Zizeng Meng 861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
..
_dynamo [ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987) 2025-03-27 05:05:34 +00:00
__init__.pyi.in [Kineto] Enable OOM observer (#152160) 2025-04-27 15:56:44 +00:00
_aoti.pyi [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269) 2024-12-10 05:05:08 +00:00
_autograd.pyi Add overload names to profiler trace (#143114) 2025-03-05 01:00:29 +00:00
_cpu.pyi [CPUInductor] Fix SVE256 detection (#146207) 2025-02-01 18:51:34 +00:00
_cudnn.pyi Improve typing in torch/types.py (#145237) 2025-01-28 05:29:12 +00:00
_cusparselt.pyi
_distributed_autograd.pyi remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369) 2024-12-17 18:09:28 +00:00
_distributed_c10d.pyi c10d/Store: add nonblocking mode to queue_pop (#151485) 2025-04-18 02:14:50 +00:00
_distributed_rpc_testing.pyi
_distributed_rpc.pyi
_export.pyi [export] Implement cpp deserializer. (#136398) 2024-11-14 16:34:59 +00:00
_functions.pyi PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102) 2025-01-18 20:47:12 +00:00
_functorch.pyi [BE] Upgrade to mypy 1.14 (#145966) 2025-03-04 20:58:26 +00:00
_instruction_counter.pyi
_itt.pyi
_lazy_ts_backend.pyi
_lazy.pyi remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370) 2024-12-17 17:18:10 +00:00
_monitor.pyi PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
_nn.pyi.in Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00
_nvtx.pyi Inductor annotations (#130429) 2024-12-10 08:53:39 +00:00
_onnx.pyi
_profiler.pyi [Profiler] Add profiler activity for HPU devices (#148182) 2025-03-05 01:37:48 +00:00
_VariableFunctions.pyi.in Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00
_verbose.pyi
build.bzl
return_types.pyi.in Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00