mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary: When workers finish their work TE agent will start `synchronize_barrier` procedure. The barrier will wait for other agents at the end of the execution. There is a race condition may happen: The barrier uses TCPStore which is located on Rank0. When Rank0 finishes the work, other ranks may still be in a process of executing `get_all` method. This means that some of them will fail because the TCPStore will be destroyed. The fix adds additional check on Rank0 process: Rank0 process now waits for all other ranks to finish before terminating the process. Test Plan: unit tests Differential Revision: D35227180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74931 Approved by: https://github.com/kiukchung |
||
|---|---|---|
| .. | ||
| agent/server/test | ||
| events | ||
| metrics | ||
| multiprocessing | ||
| rendezvous | ||
| timer | ||
| utils | ||