mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Summary: Adding function to log additional debug information before killing the expired watchdog timers. Additional information like stack trace can be added in the debug function using worker process IDs from expired timers. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test Differential Revision: D56044153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883 Approved by: https://github.com/kurman
61 lines
1.6 KiB
ReStructuredText
61 lines
1.6 KiB
ReStructuredText
Expiration Timers
|
|
==================
|
|
|
|
.. automodule:: torch.distributed.elastic.timer
|
|
.. currentmodule:: torch.distributed.elastic.timer
|
|
|
|
Client Methods
|
|
---------------
|
|
.. autofunction:: torch.distributed.elastic.timer.configure
|
|
|
|
.. autofunction:: torch.distributed.elastic.timer.expires
|
|
|
|
Server/Client Implementations
|
|
------------------------------
|
|
Below are the timer server and client pairs that are provided by torchelastic.
|
|
|
|
.. note:: Timer server and clients always have to be implemented and used
|
|
in pairs since there is a messaging protocol between the server
|
|
and client.
|
|
|
|
Below is a pair of timer server and client that is implemented based on
|
|
a ``multiprocess.Queue``.
|
|
|
|
.. autoclass:: LocalTimerServer
|
|
|
|
.. autoclass:: LocalTimerClient
|
|
|
|
Below is another pair of timer server and client that is implemented
|
|
based on a named pipe.
|
|
|
|
.. autoclass:: FileTimerServer
|
|
|
|
.. autoclass:: FileTimerClient
|
|
|
|
|
|
Writing a custom timer server/client
|
|
--------------------------------------
|
|
|
|
To write your own timer server and client extend the
|
|
``torch.distributed.elastic.timer.TimerServer`` for the server and
|
|
``torch.distributed.elastic.timer.TimerClient`` for the client. The
|
|
``TimerRequest`` object is used to pass messages between
|
|
the server and client.
|
|
|
|
.. autoclass:: TimerRequest
|
|
:members:
|
|
|
|
.. autoclass:: TimerServer
|
|
:members:
|
|
|
|
.. autoclass:: TimerClient
|
|
:members:
|
|
|
|
|
|
Debug info logging
|
|
-------------------
|
|
|
|
.. automodule:: torch.distributed.elastic.timer.debug_info_logging
|
|
|
|
.. autofunction:: torch.distributed.elastic.timer.debug_info_logging.log_debug_info_for_expired_timers
|