pytorch/docs/source/distributed.elastic.md
raghavhrishi 7ef3c3357d NUMA binding integration with elastic agent and torchrun (#149334)
Implements #148689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334
Approved by: https://github.com/d4l3k

Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>
2025-07-25 21:19:49 +00:00

601 B

Torch Distributed Elastic

Makes distributed PyTorch fault-tolerant and elastic.

Get Started

:caption: Usage
:maxdepth: 1

elastic/quickstart
elastic/train_script
elastic/examples

Documentation

:caption: API
:maxdepth: 1

elastic/run
elastic/agent
elastic/multiprocessing
elastic/errors
elastic/rendezvous
elastic/timer
elastic/metrics
elastic/events
elastic/subprocess_handler
elastic/control_plane
elastic/numa
:caption: Advanced
:maxdepth: 1

elastic/customization
:caption: Plugins
:maxdepth: 1

elastic/kubernetes