mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Fixes #132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| agent | ||
| events | ||
| metrics | ||
| multiprocessing | ||
| rendezvous | ||
| timer | ||
| utils | ||
| __init__.py | ||
| control_plane.py | ||