mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Enables support for this:
```python
from torch.distributed.launcher.api import LaunchConfig
config = LaunchConfig(
...,
rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```
These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.
Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| _etcd_stub.py | ||
| api.py | ||
| c10d_rendezvous_backend.py | ||
| dynamic_rendezvous.py | ||
| etcd_rendezvous_backend.py | ||
| etcd_rendezvous.py | ||
| etcd_server.py | ||
| etcd_store.py | ||
| registry.py | ||
| static_tcp_rendezvous.py | ||
| utils.py | ||