mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary: This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya). The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it. `LocalTimerClient` and `LocalTimerServer` require a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes. `FileTimerClient` and `FileTimerServer` does not have such limitation. Test Plan: ### Unit Test ``` buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test ``` ``` RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666 ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895) Summary Pass: 12 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666 ``` Differential Revision: D38604238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695 Approved by: https://github.com/d4l3k
53 lines
1.4 KiB
ReStructuredText
53 lines
1.4 KiB
ReStructuredText
Expiration Timers
|
|
==================
|
|
|
|
.. automodule:: torch.distributed.elastic.timer
|
|
.. currentmodule:: torch.distributed.elastic.timer
|
|
|
|
Client Methods
|
|
---------------
|
|
.. autofunction:: torch.distributed.elastic.timer.configure
|
|
|
|
.. autofunction:: torch.distributed.elastic.timer.expires
|
|
|
|
Server/Client Implementations
|
|
------------------------------
|
|
Below are the timer server and client pairs that are provided by torchelastic.
|
|
|
|
.. note:: Timer server and clients always have to be implemented and used
|
|
in pairs since there is a messaging protocol between the server
|
|
and client.
|
|
|
|
Below is a pair of timer server and client that is implemented based on
|
|
a ``multiprocess.Queue``.
|
|
|
|
.. autoclass:: LocalTimerServer
|
|
|
|
.. autoclass:: LocalTimerClient
|
|
|
|
Below is another pair of timer server and client that is implemented
|
|
based on a named pipe.
|
|
|
|
.. autoclass:: FileTimerServer
|
|
|
|
.. autoclass:: FileTimerClient
|
|
|
|
|
|
Writing a custom timer server/client
|
|
--------------------------------------
|
|
|
|
To write your own timer server and client extend the
|
|
``torch.distributed.elastic.timer.TimerServer`` for the server and
|
|
``torch.distributed.elastic.timer.TimerClient`` for the client. The
|
|
``TimerRequest`` object is used to pass messages between
|
|
the server and client.
|
|
|
|
.. autoclass:: TimerRequest
|
|
:members:
|
|
|
|
.. autoclass:: TimerServer
|
|
:members:
|
|
|
|
.. autoclass:: TimerClient
|
|
:members:
|