mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency. What we did in this PR: 1. Move all related variables and methods into a class named `HeartbeatMonitor`. 2. Correct some errors in the original logics inside monitoring thread loop. 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT. 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977 Approved by: https://github.com/kwen2501, https://github.com/d4l3k |
||
|---|---|---|
| .. | ||
| aoti_abi_check | ||
| aoti_inference | ||
| api | ||
| c10d | ||
| common | ||
| dist_autograd | ||
| jit | ||
| lazy | ||
| lite_interpreter_runtime | ||
| monitor | ||
| nativert | ||
| profiler | ||
| rpc | ||
| tensorexpr | ||
| __init__.py | ||