pytorch/torch/_C
Zizeng Meng 861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
..
_dynamo [ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987) 2025-03-27 05:05:34 +00:00
__init__.pyi.in [Kineto] Enable OOM observer (#152160) 2025-04-27 15:56:44 +00:00
_aoti.pyi [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269) 2024-12-10 05:05:08 +00:00
_autograd.pyi Add overload names to profiler trace (#143114) 2025-03-05 01:00:29 +00:00
_cpu.pyi [CPUInductor] Fix SVE256 detection (#146207) 2025-02-01 18:51:34 +00:00
_cudnn.pyi Improve typing in torch/types.py (#145237) 2025-01-28 05:29:12 +00:00
_cusparselt.pyi [sparse] Add cuSPARSELt as a backend (#128534) 2024-08-21 22:06:07 +00:00
_distributed_autograd.pyi remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369) 2024-12-17 18:09:28 +00:00
_distributed_c10d.pyi c10d/Store: add nonblocking mode to queue_pop (#151485) 2025-04-18 02:14:50 +00:00
_distributed_rpc_testing.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_distributed_rpc.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_export.pyi [export] Implement cpp deserializer. (#136398) 2024-11-14 16:34:59 +00:00
_functions.pyi PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102) 2025-01-18 20:47:12 +00:00
_functorch.pyi [BE] Upgrade to mypy 1.14 (#145966) 2025-03-04 20:58:26 +00:00
_instruction_counter.pyi Add compile time instruction count metric (#133834) 2024-08-27 23:29:02 +00:00
_itt.pyi
_lazy_ts_backend.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_lazy.pyi remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370) 2024-12-17 17:18:10 +00:00
_monitor.pyi PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
_nn.pyi.in Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00
_nvtx.pyi Inductor annotations (#130429) 2024-12-10 08:53:39 +00:00
_onnx.pyi [1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675) 2024-06-17 21:25:59 +00:00
_profiler.pyi [Profiler] Add profiler activity for HPU devices (#148182) 2025-03-05 01:37:48 +00:00
_VariableFunctions.pyi.in Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00
_verbose.pyi
build.bzl
return_types.pyi.in Use Python 3.9 typing (#148157) 2025-03-04 03:09:55 +00:00