pytorch/torch/distributed/elastic
alenawang 18525e185e Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056)
Fixes #132950

This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch.

This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).

We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly.

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056
Approved by: https://github.com/fduwjj

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
2024-10-02 01:45:00 +00:00
..
agent [reland][Elastic] Skip store barrier and store get in host assign (#136865) 2024-09-27 23:40:42 +00:00
events [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
metrics [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00
multiprocessing [torchelastic] Don't do signal handling when off the main thread (#135088) 2024-09-06 14:47:03 +00:00
rendezvous Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056) 2024-10-02 01:45:00 +00:00
timer passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913) 2024-09-20 00:54:02 +00:00
utils [TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062) 2024-09-04 22:05:51 +00:00
__init__.py
control_plane.py [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866) 2024-06-18 13:51:53 +00:00