mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Enables support for this:
```python
from torch.distributed.launcher.api import LaunchConfig
config = LaunchConfig(
...,
rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```
These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.
Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
|
||
|---|---|---|
| .. | ||
| agent | ||
| events | ||
| metrics | ||
| multiprocessing | ||
| rendezvous | ||
| timer | ||
| utils | ||
| __init__.py | ||
| control_plane.py | ||