mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o
45 lines
693 B
ReStructuredText
45 lines
693 B
ReStructuredText
Torch Distributed Elastic
|
|
============================
|
|
|
|
Makes distributed PyTorch fault-tolerant and elastic.
|
|
|
|
Get Started
|
|
---------------
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
:caption: Usage
|
|
|
|
elastic/quickstart
|
|
elastic/train_script
|
|
elastic/examples
|
|
|
|
Documentation
|
|
---------------
|
|
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
:caption: API
|
|
|
|
elastic/run
|
|
elastic/agent
|
|
elastic/multiprocessing
|
|
elastic/errors
|
|
elastic/rendezvous
|
|
elastic/timer
|
|
elastic/metrics
|
|
elastic/events
|
|
elastic/subprocess_handler
|
|
elastic/control_plane
|
|
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
:caption: Advanced
|
|
|
|
elastic/customization
|
|
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
:caption: Plugins
|
|
|
|
elastic/kubernetes
|