pytorch/test/distributed/elastic
Georg Narodoslawsky 8739a8c288 elastic: do not shutdown rendezvous on leaving workers (#152525)
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
2025-05-14 00:44:10 +00:00
..
agent/server/test elastic: do not shutdown rendezvous on leaving workers (#152525) 2025-05-14 00:44:10 +00:00
events
metrics
multiprocessing Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
rendezvous Enable ruff F841 on distributed tests (#146131) 2025-02-01 03:06:16 +00:00
timer Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
utils [elastic][test] fix race condition in test_barrier_timeout_rank_tracing (#150768) 2025-04-10 04:40:16 +00:00
test_control_plane.py Fix xrefs (#151888) 2025-04-25 21:27:27 +00:00