pytorch/docs/source/elastic
Kurman Karabukaev a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
..
agent_diagram.jpg
agent.rst
customization.rst
errors.rst
etcd_rdzv_diagram.png
events.rst
examples.rst
kubernetes.rst Fix typo under docs directory (#92762) 2023-01-23 18:07:22 +00:00
metrics.rst
multiprocessing.rst
quickstart.rst [BE] Prefer dash over underscore in command-line options (#94505) 2023-02-09 20:16:49 +00:00
rendezvous.rst [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066) 2024-01-18 01:16:55 +00:00
run.rst
timer.rst
train_script.rst [BE] Prefer dash over underscore in command-line options (#94505) 2023-02-09 20:16:49 +00:00