pytorch/docs/source
Zizeng Meng 861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
..
_static Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
_templates Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
community Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
elastic DOC: add docstring to construct_and_record_rdzv_event() (#128189) 2024-06-10 22:17:33 +00:00
notes Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
rpc Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
scripts Add scripts to generate plots of LRSchedulers (#149189) 2025-04-14 09:53:38 +00:00
accelerator.rst Add torch.accelerator.device_index as accelerator's device switch context (#148864) 2025-04-25 09:45:25 +00:00
amp.rst Update document for autocast on CPU (#135299) 2024-09-13 09:11:47 +00:00
autograd.rst Add torch.library.register_autograd (#124071) 2024-04-18 12:47:59 +00:00
backends.rst Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505) 2025-01-23 18:50:59 +00:00
benchmark_utils.rst Adding Compare in torch.utils.benchmark documentation (#125009) 2024-05-03 00:50:54 +00:00
bottleneck.rst
checkpoint.rst [checkpoint] Clean up selective activation checkpoint and make public (#125795) 2024-06-18 18:18:50 +00:00
complex_numbers.rst Document complex optimizer semantic behavior (#121667) 2024-03-16 00:43:47 +00:00
cond.rst [Doc] fix some typos (found by codespell and typos) (#132544) 2024-08-05 17:21:56 +00:00
conf.py Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
config_mod.rst
cpp_extension.rst xpu: support sycl with torch.utils.cpp_extension APIs (#132945) 2025-02-16 16:50:59 +00:00
cpp_index.rst Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
cpu.rst
cuda_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
cuda._sanitizer.rst
cuda.rst Add Missing Communication collectives (#147379) 2025-03-19 06:59:04 +00:00
cuda.tunable.rst [Docs][TunableOp] TunableOp documentation update (#148384) 2025-03-07 21:02:49 +00:00
cudnn_persistent_rnn.rst
cudnn_rnn_determinism.rst Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
data.rst
ddp_comm_hooks.rst
debugging_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
deploy.rst Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
deterministic.rst Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377) 2023-11-01 16:10:09 +00:00
distributed.algorithms.join.rst
distributed.checkpoint.rst Supporting non-tensor-data write_size in planner write items. (#149699) 2025-03-21 18:09:14 +00:00
distributed.elastic.rst Reapply "distributed debug handlers (#126601)" (#127805) 2024-06-04 19:44:30 +00:00
distributed.fsdp.fully_shard.rst [FSDP2] Move to public torch.distributed.fsdp (#141868) 2024-12-07 01:24:28 +00:00
distributed.optim.rst
distributed.pipelining.rst [pipelining] Update tutorials and documentation (#143045) 2024-12-12 18:42:17 +00:00
distributed.rst Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
distributed.tensor.parallel.rst [dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372) 2025-04-01 19:15:43 +00:00
distributed.tensor.rst [dtensor] expose the __create_chunk_list__ in the doc (#144100) 2025-01-03 20:06:23 +00:00
distributions.rst add generalized pareto distribution (GPD) (#135968) 2025-04-17 18:51:02 +00:00
dlpack.rst
docutils.conf
export.ir_spec.rst [export] Update docs (#142011) 2024-12-05 03:44:46 +00:00
export.programming_model.rst fix formatting in programming model doc (#143587) 2024-12-20 07:09:19 +00:00
export.rst infer dynamic shapes through additional inputs (#150144) 2025-04-01 21:13:39 +00:00
fft.rst
fsdp.rst
func.api.rst Add torch.func.debug_unwrap (#146528) 2025-02-06 18:48:09 +00:00
func.batch_norm.rst
func.migrating.rst
func.rst
func.ux_limitations.rst
func.whirlwind_tour.rst
future_mod.rst Add swap_tensors path to nn.Module._apply (#117167) 2024-02-07 18:55:44 +00:00
futures.rst
fx.experimental.rst [dynamic shapes] user-code friendly statically_known_true, has_static_value (#151601) 2025-04-24 02:53:59 +00:00
fx.rst Fix the invalid link for FX (#149289) 2025-03-19 14:03:18 +00:00
hub.rst
index.md Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
jit_builtin_functions.rst
jit_language_reference_v2.rst [Doc] fix some typos (found by codespell and typos) (#132544) 2024-08-05 17:21:56 +00:00
jit_language_reference.rst [Doc] fix some typos (found by codespell and typos) (#132544) 2024-08-05 17:21:56 +00:00
jit_python_reference.rst
jit_unsupported.rst Add support for torch.Generator type in TorchScript (#110413) 2023-11-21 23:07:21 +00:00
jit_utils.rst
jit.rst
library.rst [Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588) 2025-01-27 19:22:43 +00:00
linalg.rst
logging.rst Change classification to beta for TORCH_LOGS (#118682) 2024-01-31 21:50:55 +00:00
masked.rst Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262) 2024-09-06 19:06:23 +00:00
math-quantizer-equation.png
meta.rst Fix typos in meta.rst (#151979) 2025-04-24 01:25:09 +00:00
miscellaneous_environment_variables.rst Add environment variable to force no weights_only load (#138225) 2024-10-21 23:26:15 +00:00
mobile_optimizer.rst Add ExecuTorch warning to mobile_optimizer (#134697) 2024-09-04 17:47:14 +00:00
model_zoo.rst
module_tracker.rst Add module tracker (#125352) 2024-05-04 18:33:35 +00:00
monitor.rst
mps_environment_variables.rst [MPS] Add mps profiler env vars to docs (#129552) 2024-07-04 06:44:48 +00:00
mps.rst [MPS] Make torch.mps.compile_shader public (#148972) 2025-03-11 20:20:58 +00:00
mtia.memory.rst Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347)" 2024-12-21 04:04:16 +00:00
mtia.rst [Kineto] Enable OOM observer (#152160) 2025-04-27 15:56:44 +00:00
multiprocessing.rst
name_inference.rst
named_tensor.rst
nested.rst Update OSS nested tensor docs to focus on NJT (#145402) 2025-01-25 04:08:19 +00:00
nn.attention.bias.rst Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689) 2024-01-24 22:28:04 +00:00
nn.attention.experimental.rst [Flex Attention] Paged Attention (#137164) 2024-10-29 17:05:22 +00:00
nn.attention.flex_attention.rst FlexAttention support for NJT (#136792) 2024-10-28 20:01:27 +00:00
nn.attention.rst [Flex Attention] Paged Attention (#137164) 2024-10-29 17:05:22 +00:00
nn.functional.rst Add RMSNorm module (#121364) 2024-03-29 18:05:28 +00:00
nn.init.rst
nn.rst Add APIs to separate norm calculation and gradient scaling in nn.utils.clip_grad_norm_ (#139662) 2024-11-07 23:13:23 +00:00
notes.md Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
onnx_dynamo_memory_usage.rst Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139) 2025-01-03 20:41:36 +00:00
onnx_dynamo_onnxruntime_backend.rst
onnx_dynamo.rst Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
onnx_ops.rst [ONNX] Create onnx_symbolic (#148905) 2025-03-18 21:32:06 +00:00
onnx_torchscript_supported_aten_ops.rst
onnx_torchscript.rst Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
onnx_verification.rst [ONNX] Expose verification utilities (#148603) 2025-03-18 02:10:34 +00:00
onnx.rst [ONNX] Create onnx_symbolic (#148905) 2025-03-18 21:32:06 +00:00
optim.rst Ensure SWA boundary conditions w.r.t. definition (#133773) 2024-10-31 18:24:08 +00:00
package.rst
profiler.rst
pytorch-api.md Migrate to new theme (#149331) 2025-04-16 21:35:19 +00:00
quantization-accuracy-debugging.rst
quantization-backend-configuration.rst
quantization-support.rst [Quant][PT2E] add a lowering pass for x86 backend (#149708) 2025-04-01 17:32:41 +00:00
quantization.rst [Quant][PT2E] add a lowering pass for x86 backend (#149708) 2025-04-01 17:32:41 +00:00
random.rst
rpc.rst
signal.rst
size.rst Added a docstring for torch.Size.numel. (#124186) 2024-04-19 09:23:02 +00:00
sparse.rst SparseCsrCUDA: cuDSS backend for linalg.solve (#129856) 2024-08-22 07:57:30 +00:00
special.rst
storage.rst Super tiny fix typo (#151212) 2025-04-14 16:47:40 +00:00
tensor_attributes.rst [docs] Add 32-bit complex to the list of dtypes (#144590) 2025-04-09 13:10:21 +00:00
tensor_view.rst [docs] fix numpy docs reference (#147697) 2025-02-26 01:30:03 +00:00
tensorboard.rst
tensors.rst add xpu to torch.tensors (#127280) 2024-06-11 18:13:01 +00:00
testing.rst
threading_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
torch_cuda_memory.rst Document non-pytorch CUDA memory allocation and how to query it (#150880) 2025-04-18 03:48:54 +00:00
torch_environment_variables.rst [Docs][MPS] Add mps environment variable table (#129008) 2024-06-20 03:30:35 +00:00
torch_nccl_environment_variables.rst [c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920) 2024-08-09 21:08:20 +00:00
torch.ao.ns._numeric_suite_fx.rst
torch.ao.ns._numeric_suite.rst
torch.compiler_aot_inductor_minifier.rst Aoti minifier flatten (#141156) 2024-12-06 07:12:45 +00:00
torch.compiler_aot_inductor.rst update aotinductor doc for XPU support (#149299) 2025-03-21 04:40:31 +00:00
torch.compiler_api.rst [export] add is_exporting flag (#142425) 2024-12-18 21:36:28 +00:00
torch.compiler_best_practices_for_backends.rst
torch.compiler_cudagraph_trees.rst Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)" 2025-02-13 18:04:26 +00:00
torch.compiler_custom_backends.rst [pt2, docs] Add new PT2 troubleshooting doc (#138620) 2024-11-09 01:17:39 +00:00
torch.compiler_dynamic_shapes.rst feat: Add min, max ranges to mark_dynamic API (#119737) 2024-03-07 23:26:03 +00:00
torch.compiler_dynamo_deepdive.rst fix typo in torch.compiler_dynamo_deepdive.rst (#140871) 2024-11-19 14:42:36 +00:00
torch.compiler_dynamo_overview.rst Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431) 2024-05-07 05:07:00 +00:00
torch.compiler_fake_tensor.rst [doc] improve code in fake tensor doc (#140329) 2024-11-13 05:14:56 +00:00
torch.compiler_faq.rst Rename cache limit to recompile limit in configs (#143709) 2024-12-22 10:03:57 +00:00
torch.compiler_fine_grain_apis.rst [export] add is_exporting flag (#142425) 2024-12-18 21:36:28 +00:00
torch.compiler_get_started.rst [Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458) 2024-10-14 20:20:29 +00:00
torch.compiler_inductor_profiling.rst
torch.compiler_ir.rst
torch.compiler_nn_module.rst
torch.compiler_performance_dashboard.rst
torch.compiler_profiling_torch_compile.rst Update Doc for Intel XPU Profiling (#134515) 2025-03-27 09:15:35 +00:00
torch.compiler_transformations.rst
torch.compiler_troubleshooting_old.rst Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
torch.compiler_troubleshooting.rst Rename cache limit to recompile limit in configs (#143709) 2024-12-22 10:03:57 +00:00
torch.compiler.config.rst Profile guided optimization for automatic_dynamic (#139001) 2024-11-03 06:29:57 +00:00
torch.compiler.rst Profile guided optimization for automatic_dynamic (#139001) 2024-11-03 06:29:57 +00:00
torch.overrides.rst
torch.rst Move get accelerator to use build time flags when possible (#146098) 2025-03-10 13:17:58 +00:00
type_info.rst Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
utils.rst New swap function (#111747) 2023-12-08 18:49:35 +00:00
xpu.rst Add get_stream_from_external API for XPU backend (#141123) 2024-12-31 11:15:52 +00:00