pytorch/docs/source/elastic/timer.rst
Bin Chen 3b11b80fc3 Named pipe based watchdog timer (#83695)
Summary:
This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya).

The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it.

`LocalTimerClient` and `LocalTimerServer` require  a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes.

`FileTimerClient` and `FileTimerServer` does not have such limitation.

Test Plan:
### Unit Test
```
buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
```
```
RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
    ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895)
Summary
  Pass: 12
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
```

Differential Revision: D38604238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695
Approved by: https://github.com/d4l3k
2022-08-24 22:16:12 +00:00

53 lines
1.4 KiB
ReStructuredText

Expiration Timers
==================
.. automodule:: torch.distributed.elastic.timer
.. currentmodule:: torch.distributed.elastic.timer
Client Methods
---------------
.. autofunction:: torch.distributed.elastic.timer.configure
.. autofunction:: torch.distributed.elastic.timer.expires
Server/Client Implementations
------------------------------
Below are the timer server and client pairs that are provided by torchelastic.
.. note:: Timer server and clients always have to be implemented and used
in pairs since there is a messaging protocol between the server
and client.
Below is a pair of timer server and client that is implemented based on
a ``multiprocess.Queue``.
.. autoclass:: LocalTimerServer
.. autoclass:: LocalTimerClient
Below is another pair of timer server and client that is implemented
based on a named pipe.
.. autoclass:: FileTimerServer
.. autoclass:: FileTimerClient
Writing a custom timer server/client
--------------------------------------
To write your own timer server and client extend the
``torch.distributed.elastic.timer.TimerServer`` for the server and
``torch.distributed.elastic.timer.TimerClient`` for the client. The
``TimerRequest`` object is used to pass messages between
the server and client.
.. autoclass:: TimerRequest
:members:
.. autoclass:: TimerServer
:members:
.. autoclass:: TimerClient
:members: