mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55837899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504 Approved by: https://github.com/kurman |
||
|---|---|---|
| .. | ||
| agent_diagram.jpg | ||
| agent.rst | ||
| customization.rst | ||
| errors.rst | ||
| etcd_rdzv_diagram.png | ||
| events.rst | ||
| examples.rst | ||
| kubernetes.rst | ||
| metrics.rst | ||
| multiprocessing.rst | ||
| quickstart.rst | ||
| rendezvous.rst | ||
| run.rst | ||
| subprocess_handler.rst | ||
| timer.rst | ||
| train_script.rst | ||