pytorch/torch/distributed/elastic/rendezvous
Carlos Mocholi aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
..
__init__.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
_etcd_stub.py Make torchelastic etcd rendezvous publicly importable (#145396) 2025-01-23 23:56:45 +00:00
api.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
c10d_rendezvous_backend.py
dynamic_rendezvous.py Expose the rendezvous keepalive arguments (#145228) 2025-03-03 19:11:56 +00:00
etcd_rendezvous_backend.py Make torchelastic etcd rendezvous publicly importable (#145396) 2025-01-23 23:56:45 +00:00
etcd_rendezvous.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
etcd_server.py
etcd_store.py Make torchelastic etcd rendezvous publicly importable (#145396) 2025-01-23 23:56:45 +00:00
registry.py [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547) 2025-02-28 07:35:56 +00:00
static_tcp_rendezvous.py
utils.py