pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Zizeng Meng 861945100e [Kineto] Enable OOM observer (#152160 ) Summary: # Context: When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret. On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging. In this diff, we want to implement the feature on MTIA side Test Plan: Run this test with last diff in the stack. ``` buck run @//mode/opt kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test ``` As shown, the memory_snapshot is generated when oom happens Log: P1794792326 Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355} Differential Revision: D71993315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160 Approved by: https://github.com/sraikund16		2025-04-27 15:56:44 +00:00
..
_static	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
_templates	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
community	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
elastic	DOC: add docstring to construct_and_record_rdzv_event() (#128189 )	2024-06-10 22:17:33 +00:00
notes	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
rpc	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
scripts	Add scripts to generate plots of LRSchedulers (#149189 )	2025-04-14 09:53:38 +00:00
accelerator.rst	Add torch.accelerator.device_index as accelerator's device switch context (#148864 )	2025-04-25 09:45:25 +00:00
amp.rst	Update document for autocast on CPU (#135299 )	2024-09-13 09:11:47 +00:00
autograd.rst	Add torch.library.register_autograd (#124071 )	2024-04-18 12:47:59 +00:00
backends.rst	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 )	2025-01-23 18:50:59 +00:00
benchmark_utils.rst	Adding Compare in torch.utils.benchmark documentation (#125009 )	2024-05-03 00:50:54 +00:00
bottleneck.rst
checkpoint.rst	[checkpoint] Clean up selective activation checkpoint and make public (#125795 )	2024-06-18 18:18:50 +00:00
complex_numbers.rst	Document complex optimizer semantic behavior (#121667 )	2024-03-16 00:43:47 +00:00
cond.rst	[Doc] fix some typos (found by codespell and typos) (#132544 )	2024-08-05 17:21:56 +00:00
conf.py	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
config_mod.rst
cpp_extension.rst	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 )	2025-02-16 16:50:59 +00:00
cpp_index.rst	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
cpu.rst
cuda_environment_variables.rst	Add doc page for environment variables that effect PyTorch Runtime (#119087 )	2024-02-15 21:41:38 +00:00
cuda._sanitizer.rst
cuda.rst	Add Missing Communication collectives (#147379 )	2025-03-19 06:59:04 +00:00
cuda.tunable.rst	[Docs][TunableOp] TunableOp documentation update (#148384 )	2025-03-07 21:02:49 +00:00
cudnn_persistent_rnn.rst
cudnn_rnn_determinism.rst	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
data.rst
ddp_comm_hooks.rst
debugging_environment_variables.rst	Add doc page for environment variables that effect PyTorch Runtime (#119087 )	2024-02-15 21:41:38 +00:00
deploy.rst	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
deterministic.rst	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 )	2023-11-01 16:10:09 +00:00
distributed.algorithms.join.rst
distributed.checkpoint.rst	Supporting non-tensor-data write_size in planner write items. (#149699 )	2025-03-21 18:09:14 +00:00
distributed.elastic.rst	Reapply "distributed debug handlers (#126601 )" (#127805 )	2024-06-04 19:44:30 +00:00
distributed.fsdp.fully_shard.rst	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )	2024-12-07 01:24:28 +00:00
distributed.optim.rst
distributed.pipelining.rst	[pipelining] Update tutorials and documentation (#143045 )	2024-12-12 18:42:17 +00:00
distributed.rst	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
distributed.tensor.parallel.rst	[dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372 )	2025-04-01 19:15:43 +00:00
distributed.tensor.rst	[dtensor] expose the __create_chunk_list__ in the doc (#144100 )	2025-01-03 20:06:23 +00:00
distributions.rst	add generalized pareto distribution (GPD) (#135968 )	2025-04-17 18:51:02 +00:00
dlpack.rst
docutils.conf
export.ir_spec.rst	[export] Update docs (#142011 )	2024-12-05 03:44:46 +00:00
export.programming_model.rst	fix formatting in programming model doc (#143587 )	2024-12-20 07:09:19 +00:00
export.rst	infer dynamic shapes through additional inputs (#150144 )	2025-04-01 21:13:39 +00:00
fft.rst
fsdp.rst
func.api.rst	Add torch.func.debug_unwrap (#146528 )	2025-02-06 18:48:09 +00:00
func.batch_norm.rst
func.migrating.rst
func.rst
func.ux_limitations.rst
func.whirlwind_tour.rst
future_mod.rst	Add swap_tensors path to nn.Module._apply (#117167 )	2024-02-07 18:55:44 +00:00
futures.rst
fx.experimental.rst	[dynamic shapes] user-code friendly statically_known_true, has_static_value (#151601 )	2025-04-24 02:53:59 +00:00
fx.rst	Fix the invalid link for FX (#149289 )	2025-03-19 14:03:18 +00:00
hub.rst
index.md	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
jit_builtin_functions.rst
jit_language_reference_v2.rst	[Doc] fix some typos (found by codespell and typos) (#132544 )	2024-08-05 17:21:56 +00:00
jit_language_reference.rst	[Doc] fix some typos (found by codespell and typos) (#132544 )	2024-08-05 17:21:56 +00:00
jit_python_reference.rst
jit_unsupported.rst	Add support for `torch.Generator` type in TorchScript (#110413 )	2023-11-21 23:07:21 +00:00
jit_utils.rst
jit.rst
library.rst	[Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588 )	2025-01-27 19:22:43 +00:00
linalg.rst
logging.rst	Change classification to beta for TORCH_LOGS (#118682 )	2024-01-31 21:50:55 +00:00
masked.rst	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )	2024-09-06 19:06:23 +00:00
math-quantizer-equation.png
meta.rst	Fix typos in meta.rst (#151979 )	2025-04-24 01:25:09 +00:00
miscellaneous_environment_variables.rst	Add environment variable to force no weights_only load (#138225 )	2024-10-21 23:26:15 +00:00
mobile_optimizer.rst	Add ExecuTorch warning to mobile_optimizer (#134697 )	2024-09-04 17:47:14 +00:00
model_zoo.rst
module_tracker.rst	Add module tracker (#125352 )	2024-05-04 18:33:35 +00:00
monitor.rst
mps_environment_variables.rst	[MPS] Add mps profiler env vars to docs (#129552 )	2024-07-04 06:44:48 +00:00
mps.rst	[MPS] Make `torch.mps.compile_shader` public (#148972 )	2025-03-11 20:20:58 +00:00
mtia.memory.rst	Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 )"	2024-12-21 04:04:16 +00:00
mtia.rst	[Kineto] Enable OOM observer (#152160 )	2025-04-27 15:56:44 +00:00
multiprocessing.rst
name_inference.rst
named_tensor.rst
nested.rst	Update OSS nested tensor docs to focus on NJT (#145402 )	2025-01-25 04:08:19 +00:00
nn.attention.bias.rst	Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689 )	2024-01-24 22:28:04 +00:00
nn.attention.experimental.rst	[Flex Attention] Paged Attention (#137164 )	2024-10-29 17:05:22 +00:00
nn.attention.flex_attention.rst	FlexAttention support for NJT (#136792 )	2024-10-28 20:01:27 +00:00
nn.attention.rst	[Flex Attention] Paged Attention (#137164 )	2024-10-29 17:05:22 +00:00
nn.functional.rst	Add RMSNorm module (#121364 )	2024-03-29 18:05:28 +00:00
nn.init.rst
nn.rst	Add APIs to separate norm calculation and gradient scaling in `nn.utils.clip_grad_norm_` (#139662 )	2024-11-07 23:13:23 +00:00
notes.md	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
onnx_dynamo_memory_usage.rst	Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139 )	2025-01-03 20:41:36 +00:00
onnx_dynamo_onnxruntime_backend.rst
onnx_dynamo.rst	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
onnx_ops.rst	[ONNX] Create onnx_symbolic (#148905 )	2025-03-18 21:32:06 +00:00
onnx_torchscript_supported_aten_ops.rst
onnx_torchscript.rst	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
onnx_verification.rst	[ONNX] Expose verification utilities (#148603 )	2025-03-18 02:10:34 +00:00
onnx.rst	[ONNX] Create onnx_symbolic (#148905 )	2025-03-18 21:32:06 +00:00
optim.rst	Ensure SWA boundary conditions w.r.t. definition (#133773 )	2024-10-31 18:24:08 +00:00
package.rst
profiler.rst
pytorch-api.md	Migrate to new theme (#149331 )	2025-04-16 21:35:19 +00:00
quantization-accuracy-debugging.rst
quantization-backend-configuration.rst
quantization-support.rst	[Quant][PT2E] add a lowering pass for x86 backend (#149708 )	2025-04-01 17:32:41 +00:00
quantization.rst	[Quant][PT2E] add a lowering pass for x86 backend (#149708 )	2025-04-01 17:32:41 +00:00
random.rst
rpc.rst
signal.rst
size.rst	Added a docstring for torch.Size.numel. (#124186 )	2024-04-19 09:23:02 +00:00
sparse.rst	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 )	2024-08-22 07:57:30 +00:00
special.rst
storage.rst	Super tiny fix typo (#151212 )	2025-04-14 16:47:40 +00:00
tensor_attributes.rst	[docs] Add 32-bit complex to the list of dtypes (#144590 )	2025-04-09 13:10:21 +00:00
tensor_view.rst	[docs] fix numpy docs reference (#147697 )	2025-02-26 01:30:03 +00:00
tensorboard.rst
tensors.rst	add xpu to torch.tensors (#127280 )	2024-06-11 18:13:01 +00:00
testing.rst
threading_environment_variables.rst	Add doc page for environment variables that effect PyTorch Runtime (#119087 )	2024-02-15 21:41:38 +00:00
torch_cuda_memory.rst	Document non-pytorch CUDA memory allocation and how to query it (#150880 )	2025-04-18 03:48:54 +00:00
torch_environment_variables.rst	[Docs][MPS] Add mps environment variable table (#129008 )	2024-06-20 03:30:35 +00:00
torch_nccl_environment_variables.rst	[c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920 )	2024-08-09 21:08:20 +00:00
torch.ao.ns._numeric_suite_fx.rst
torch.ao.ns._numeric_suite.rst
torch.compiler_aot_inductor_minifier.rst	Aoti minifier flatten (#141156 )	2024-12-06 07:12:45 +00:00
torch.compiler_aot_inductor.rst	update aotinductor doc for XPU support (#149299 )	2025-03-21 04:40:31 +00:00
torch.compiler_api.rst	[export] add is_exporting flag (#142425 )	2024-12-18 21:36:28 +00:00
torch.compiler_best_practices_for_backends.rst
torch.compiler_cudagraph_trees.rst	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )"	2025-02-13 18:04:26 +00:00
torch.compiler_custom_backends.rst	[pt2, docs] Add new PT2 troubleshooting doc (#138620 )	2024-11-09 01:17:39 +00:00
torch.compiler_dynamic_shapes.rst	feat: Add min, max ranges to mark_dynamic API (#119737 )	2024-03-07 23:26:03 +00:00
torch.compiler_dynamo_deepdive.rst	fix typo in `torch.compiler_dynamo_deepdive.rst` (#140871 )	2024-11-19 14:42:36 +00:00
torch.compiler_dynamo_overview.rst	Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431 )	2024-05-07 05:07:00 +00:00
torch.compiler_fake_tensor.rst	[doc] improve code in fake tensor doc (#140329 )	2024-11-13 05:14:56 +00:00
torch.compiler_faq.rst	Rename cache limit to recompile limit in configs (#143709 )	2024-12-22 10:03:57 +00:00
torch.compiler_fine_grain_apis.rst	[export] add is_exporting flag (#142425 )	2024-12-18 21:36:28 +00:00
torch.compiler_get_started.rst	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 )	2024-10-14 20:20:29 +00:00
torch.compiler_inductor_profiling.rst
torch.compiler_ir.rst
torch.compiler_nn_module.rst
torch.compiler_performance_dashboard.rst
torch.compiler_profiling_torch_compile.rst	Update Doc for Intel XPU Profiling (#134515 )	2025-03-27 09:15:35 +00:00
torch.compiler_transformations.rst
torch.compiler_troubleshooting_old.rst	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
torch.compiler_troubleshooting.rst	Rename cache limit to recompile limit in configs (#143709 )	2024-12-22 10:03:57 +00:00
torch.compiler.config.rst	Profile guided optimization for automatic_dynamic (#139001 )	2024-11-03 06:29:57 +00:00
torch.compiler.rst	Profile guided optimization for automatic_dynamic (#139001 )	2024-11-03 06:29:57 +00:00
torch.overrides.rst
torch.rst	Move get accelerator to use build time flags when possible (#146098 )	2025-03-10 13:17:58 +00:00
type_info.rst	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
utils.rst	New swap function (#111747 )	2023-12-08 18:49:35 +00:00
xpu.rst	Add get_stream_from_external API for XPU backend (#141123 )	2024-12-31 11:15:52 +00:00