mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: When workers finish their work TE agent will start `synchronize_barrier` procedure. The barrier will wait for other agents at the end of the execution. There is a race condition may happen: The barrier uses TCPStore which is located on Rank0. When Rank0 finishes the work, other ranks may still be in a process of executing `get_all` method. This means that some of them will fail because the TCPStore will be destroyed. The fix adds additional check on Rank0 process: Rank0 process now waits for all other ranks to finish before terminating the process. Test Plan: unit tests Differential Revision: D35227180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74931 Approved by: https://github.com/kiukchung |
||
|---|---|---|
| .. | ||
| data | ||
| __init__.py | ||
| distributed_test.py | ||
| logging_test.py | ||
| util_test.py | ||