pytorch/docs/source/elastic
Kurman Karabukaev a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
..
agent_diagram.jpg
agent.rst
customization.rst
errors.rst
etcd_rdzv_diagram.png
events.rst
examples.rst
kubernetes.rst
metrics.rst
multiprocessing.rst
quickstart.rst
rendezvous.rst [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066) 2024-01-18 01:16:55 +00:00
run.rst
timer.rst
train_script.rst