pytorch/docs/source/mtia.rst
Zizeng Meng 861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00

44 lines
774 B
ReStructuredText

torch.mtia
===================================
The MTIA backend is implemented out of the tree, only interfaces are be defined here.
.. automodule:: torch.mtia
.. currentmodule:: torch.mtia
.. autosummary::
:toctree: generated
:nosignatures:
StreamContext
current_device
current_stream
default_stream
device_count
init
is_available
is_initialized
memory_stats
get_device_capability
empty_cache
record_memory_history
snapshot
attach_out_of_memory_observer
set_device
set_stream
stream
synchronize
device
set_rng_state
get_rng_state
DeferredMtiaCallError
Streams and events
------------------
.. autosummary::
:toctree: generated
:nosignatures:
Event
Stream