pytorch/docs/source/elastic/timer.rst
Gagan Jain c5e567c573 [Torch][Timer] Adding debug info logging interface for expired timers (#123883)
Summary:
Adding function to log additional debug information before killing the expired watchdog timers.

Additional information like stack trace can be added in the debug function using worker process IDs from expired timers.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test

Differential Revision: D56044153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883
Approved by: https://github.com/kurman
2024-04-25 01:15:52 +00:00

61 lines
1.6 KiB
ReStructuredText

Expiration Timers
==================
.. automodule:: torch.distributed.elastic.timer
.. currentmodule:: torch.distributed.elastic.timer
Client Methods
---------------
.. autofunction:: torch.distributed.elastic.timer.configure
.. autofunction:: torch.distributed.elastic.timer.expires
Server/Client Implementations
------------------------------
Below are the timer server and client pairs that are provided by torchelastic.
.. note:: Timer server and clients always have to be implemented and used
in pairs since there is a messaging protocol between the server
and client.
Below is a pair of timer server and client that is implemented based on
a ``multiprocess.Queue``.
.. autoclass:: LocalTimerServer
.. autoclass:: LocalTimerClient
Below is another pair of timer server and client that is implemented
based on a named pipe.
.. autoclass:: FileTimerServer
.. autoclass:: FileTimerClient
Writing a custom timer server/client
--------------------------------------
To write your own timer server and client extend the
``torch.distributed.elastic.timer.TimerServer`` for the server and
``torch.distributed.elastic.timer.TimerClient`` for the client. The
``TimerRequest`` object is used to pass messages between
the server and client.
.. autoclass:: TimerRequest
:members:
.. autoclass:: TimerServer
:members:
.. autoclass:: TimerClient
:members:
Debug info logging
-------------------
.. automodule:: torch.distributed.elastic.timer.debug_info_logging
.. autofunction:: torch.distributed.elastic.timer.debug_info_logging.log_debug_info_for_expired_timers