pytorch/docs/source/elastic
Tristan Rice 3d541835d5 distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
..
agent_diagram.jpg
agent.rst Adding health check server hook in torch elastic (#122750) (#123504) 2024-04-11 19:10:56 +00:00
control_plane.rst distributed debug handlers (#126601) 2024-05-30 02:21:08 +00:00
customization.rst
errors.rst
etcd_rdzv_diagram.png
events.rst
examples.rst
kubernetes.rst Fix typo under docs directory (#92762) 2023-01-23 18:07:22 +00:00
metrics.rst
multiprocessing.rst [TorchElastic] Refactoring to support non-default logging strategy (#120691) 2024-02-29 20:59:17 +00:00
quickstart.rst [BE] Prefer dash over underscore in command-line options (#94505) 2023-02-09 20:16:49 +00:00
rendezvous.rst [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743) 2024-05-22 18:24:11 +00:00
run.rst Introduce the torchrun entrypoint (#64049) 2021-08-26 20:17:48 -07:00
subprocess_handler.rst [Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373) 2024-03-08 01:37:34 +00:00
timer.rst [Torch][Timer] Adding debug info logging interface for expired timers (#123883) 2024-04-25 01:15:52 +00:00
train_script.rst [BE] Prefer dash over underscore in command-line options (#94505) 2023-02-09 20:16:49 +00:00