pytorch/test/distributed/elastic/utils
Aliaksandr Ivanou e54c1f6c90 [torch][elastic] Make final agent barrier to shutdown properly
Summary:
When workers finish their work TE agent will start `synchronize_barrier` procedure. The barrier will wait for other agents at the end of the execution.

There is a race condition may happen: The barrier uses TCPStore which is located on Rank0. When Rank0 finishes the work, other ranks may still be in a process of executing `get_all` method. This means that some of them will fail because the TCPStore will be destroyed.

The fix adds additional check on Rank0 process: Rank0 process now waits for all other ranks to finish before terminating the process.

Test Plan: unit tests

Differential Revision: D35227180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74931
Approved by: https://github.com/kiukchung
2022-04-15 20:29:05 +00:00
..
data Add test owners for elastic tests (#67293) 2021-10-28 08:32:50 -07:00
__init__.py
distributed_test.py Revise the socket implementation of c10d (#68226) 2021-11-16 20:49:25 -08:00
logging_test.py Overload TestCase not vanilla TestCase for some elastic tests (#67700) 2021-11-03 11:14:52 -07:00
util_test.py [torch][elastic] Make final agent barrier to shutdown properly 2022-04-15 20:29:05 +00:00