pytorch/torch/distributed/elastic
Carlos Mocholi aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
..
agent PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
events [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
metrics [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
multiprocessing [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
rendezvous Expose the rendezvous keepalive arguments (#145228) 2025-03-03 19:11:56 +00:00
timer [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
utils [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
__init__.py
control_plane.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00