pytorch/docs/source
Gagan Jain 61d431fab0 Adding health check server hook in torch elastic (#122750)
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55108182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750
Approved by: https://github.com/kurman
2024-04-05 23:17:30 +00:00
..
_static [docs] Update PT2+Profiler docs (#122272) 2024-03-28 17:52:28 +00:00
_templates Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689) 2024-01-24 22:28:04 +00:00
community Fix typo on Contribution Guide (#119428) 2024-02-08 01:07:27 +00:00
elastic Adding health check server hook in torch elastic (#122750) 2024-04-05 23:17:30 +00:00
notes Graph-Safe RNG State Exchange for Tensor Parallelism (#114068) 2024-03-27 01:14:38 +00:00
rpc
scripts Fixed typo in build_activation_images.py (#117458) 2024-01-15 03:27:40 +00:00
amp.rst add GradScaler on CPU (#109993) 2024-01-29 23:42:35 +00:00
autograd.rst Autograd doc cleanup (#118500) 2024-01-29 21:51:33 +00:00
backends.rst [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663) 2024-02-14 22:02:06 +00:00
benchmark_utils.rst
bottleneck.rst
checkpoint.rst Add missing words to torch.utils.checkpoint doc (#120196) 2024-02-20 20:18:42 +00:00
complex_numbers.rst Document complex optimizer semantic behavior (#121667) 2024-03-16 00:43:47 +00:00
cond.rst Fix typo under docs directory (#119657) 2024-02-15 21:14:34 +00:00
conf.py Ignore some known duplicated modules in doc build config script (#123425) 2024-04-05 21:12:14 +00:00
config_mod.rst
cpp_extension.rst
cpp_index.rst
cpu.rst Add current_device() to torch.cpu (#110987) 2023-10-11 05:13:10 +00:00
cuda_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
cuda._sanitizer.rst
cuda.rst [Doc][NVTX] Add documentation for nvtx.range (#121699) 2024-03-15 20:26:44 +00:00
cudnn_persistent_rnn.rst
cudnn_rnn_determinism.rst
data.rst Revert "reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131)" 2023-08-23 17:08:07 +00:00
ddp_comm_hooks.rst [DOCS][DDP]Fix the simple of saving and reloading PowerSGD state and hook. (#102721) 2023-06-10 00:15:00 +00:00
debugging_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
deploy.rst
deterministic.rst Add torch.utils.deterministic.fill_uninitialized_memory flag (#111377) 2023-11-01 16:10:09 +00:00
distributed.algorithms.join.rst
distributed.checkpoint.rst [DCP] DCP logger (#121352) 2024-04-05 17:50:50 +00:00
distributed.elastic.rst [Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373) 2024-03-08 01:37:34 +00:00
distributed.optim.rst
distributed.rst [dtensor] add support for loss parallel (#119877) 2024-03-02 05:06:26 +00:00
distributed.tensor.parallel.rst [tp] doc fixes (#121431) 2024-03-08 17:46:44 +00:00
distributions.rst Add inverse gamma distribution and fix sign bug in PowerTransform. (#104501) 2023-11-01 02:26:25 +00:00
dlpack.rst
docutils.conf
export.ir_spec.rst [export] Remove torch._export.export (#119095) 2024-02-08 21:22:04 +00:00
export.rst [export] Add docs for 2.3 release (#121466) 2024-03-08 22:29:48 +00:00
fft.rst
fsdp.rst [FSDP][state_dict] Expose optimizer state_dict config (#105949) 2023-08-21 07:29:49 +00:00
func.api.rst
func.batch_norm.rst
func.migrating.rst
func.rst
func.ux_limitations.rst
func.whirlwind_tour.rst
future_mod.rst Add swap_tensors path to nn.Module._apply (#117167) 2024-02-07 18:55:44 +00:00
futures.rst
fx.experimental.rst Revert "Preserve unbacked SymInt on SymNode (#120816)" (#122988) 2024-04-01 22:11:09 +00:00
fx.rst Introduce size oblivious guards (#118579) 2024-02-06 19:45:32 +00:00
hub.rst
index.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
jit_builtin_functions.rst
jit_language_reference_v2.rst
jit_language_reference.rst
jit_python_reference.rst
jit_unsupported.rst Add support for torch.Generator type in TorchScript (#110413) 2023-11-21 23:07:21 +00:00
jit_utils.rst
jit.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
library.rst Add torch.library.custom_op (#122344) 2024-04-03 18:36:17 +00:00
linalg.rst
logging.rst Change classification to beta for TORCH_LOGS (#118682) 2024-01-31 21:50:55 +00:00
masked.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
math-quantizer-equation.png
meta.rst Add documentation for meta device (#119119) 2024-02-04 01:05:22 +00:00
miscellaneous_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
mobile_optimizer.rst
model_zoo.rst
monitor.rst
mps.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
multiprocessing.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
name_inference.rst [docs] Properly link register_post_accumulate_grad_hook docs (#108157) 2023-08-29 22:13:33 +00:00
named_tensor.rst fixing named tensor unflatten example (#106921) 2023-08-22 18:00:10 +00:00
nested.rst
nn.attention.bias.rst Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689) 2024-01-24 22:28:04 +00:00
nn.attention.rst Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689) 2024-01-24 22:28:04 +00:00
nn.functional.rst Add RMSNorm module (#121364) 2024-03-29 18:05:28 +00:00
nn.init.rst
nn.rst Cleanup some duplicated placeholder py:module docs (#123244) 2024-04-05 03:18:53 +00:00
onnx_dynamo_onnxruntime_backend.rst Follow-up #108379 (#108905) 2023-09-09 01:38:36 +00:00
onnx_dynamo.rst [ez][doc] Fix sample code in onnx_dynamo.rst (#114770) 2023-11-29 19:27:52 +00:00
onnx_torchscript_supported_aten_ops.rst Refactor torch.onnx documentation (#108379) 2023-09-08 18:23:48 +00:00
onnx_torchscript.rst Follow-up #108379 (#108905) 2023-09-09 01:38:36 +00:00
onnx.rst Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828) 2024-03-22 18:01:33 +00:00
optim.rst Added example regarding weight_decay distinction with per-parameter API (#117436) 2024-01-22 21:26:02 +00:00
package.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
pipeline.rst [c10d] Deprecate torch.distributed.pipeline (#121464) 2024-03-08 19:55:02 +00:00
profiler.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
quantization-accuracy-debugging.rst
quantization-backend-configuration.rst
quantization-support.rst [quant][pt2e] Add model_is_exported util function (#119726) 2024-02-16 19:29:36 +00:00
quantization.rst Cleanup some duplicated placeholder py:module docs (#123244) 2024-04-05 03:18:53 +00:00
random.rst
rpc.rst [BE] RPC is missing RRef docs (#106902) 2023-08-10 16:26:27 +00:00
signal.rst
sparse.rst Fix typo in sparse.rst (#121826) 2024-03-19 00:17:19 +00:00
special.rst
storage.rst
tensor_attributes.rst Include the scalar tensor auto-transfer in the doc (#119967) 2024-02-15 22:37:39 +00:00
tensor_view.rst
tensorboard.rst
tensors.rst Integrate swap_tensors into nn.Module.load_state_dict (#117913) 2024-02-09 22:32:29 +00:00
testing.rst
threading_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
torch_cuda_memory.rst Fix typo under docs directory (#110359) 2023-10-03 16:36:05 +00:00
torch_environment_variables.rst Add doc page for environment variables that effect PyTorch Runtime (#119087) 2024-02-15 21:41:38 +00:00
torch.ao.ns._numeric_suite_fx.rst
torch.ao.ns._numeric_suite.rst
torch.compiler_aot_inductor.rst Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672) 2024-03-12 23:43:40 +00:00
torch.compiler_api.rst [torch.export] Support is_compiling() flag for non-strict mode (#119602) 2024-02-29 05:52:51 +00:00
torch.compiler_best_practices_for_backends.rst Restructure torch.compile docs (#105376) 2023-07-28 20:58:57 +00:00
torch.compiler_cudagraph_trees.rst [docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529) 2024-01-05 22:54:20 +00:00
torch.compiler_custom_backends.rst Fix torch.compile links (#121824) 2024-03-15 19:49:37 +00:00
torch.compiler_deepdive.rst [Dynamo]Expose bytecode hooks and add example usage for decompilation in docs (#110714) 2023-10-13 12:36:00 +00:00
torch.compiler_dynamic_shapes.rst feat: Add min, max ranges to mark_dynamic API (#119737) 2024-03-07 23:26:03 +00:00
torch.compiler_dynamo_deepdive.rst Add a Dynamo deepdive to documentation (#122305) 2024-04-02 15:08:08 +00:00
torch.compiler_fake_tensor.rst Restructure torch.compile docs (#105376) 2023-07-28 20:58:57 +00:00
torch.compiler_faq.rst Update torch.compile_faq w.r.t to functorch (#122213) 2024-04-05 03:29:11 +00:00
torch.compiler_fine_grain_apis.rst [torch.export] Support is_compiling() flag for non-strict mode (#119602) 2024-02-29 05:52:51 +00:00
torch.compiler_get_started.rst [Reland2] [inductor][BE] split triton_meta and inductor_meta (#112351) 2023-11-02 00:40:12 +00:00
torch.compiler_inductor_profiling.rst Restructure torch.compile docs (#105376) 2023-07-28 20:58:57 +00:00
torch.compiler_ir.rst [export] torch.export landing page (#108783) 2023-09-10 01:40:42 +00:00
torch.compiler_nn_module.rst Revert "Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)" + Forward fixes + test (#110964) 2023-10-11 05:16:47 +00:00
torch.compiler_performance_dashboard.rst Restructure torch.compile docs (#105376) 2023-07-28 20:58:57 +00:00
torch.compiler_profiling_torch_compile.rst [docs] Update PT2+Profiler docs (#122272) 2024-03-28 17:52:28 +00:00
torch.compiler_transformations.rst Fix typo under docs directory (#110359) 2023-10-03 16:36:05 +00:00
torch.compiler_troubleshooting.rst Add a Dynamo deepdive to documentation (#122305) 2024-04-02 15:08:08 +00:00
torch.compiler.rst Add a Dynamo deepdive to documentation (#122305) 2024-04-02 15:08:08 +00:00
torch.overrides.rst Doc test non packages (#110568) 2023-10-06 14:16:01 +00:00
torch.rst Fix torch.compile links (#121824) 2024-03-15 19:49:37 +00:00
type_info.rst
utils.rst New swap function (#111747) 2023-12-08 18:49:35 +00:00
xpu.rst [2/2] Intel GPU Runtime Upstreaming for Generator (#118613) 2024-02-28 05:28:11 +00:00