pytorch/docs/source/elastic/agent.rst
Bin Chen 06ebe2d5bc Add watchdog to TorchElastic agent and trainers (#84081)
Summary:
D38604238 (3b11b80fc3) introduced a named pipe based watchdog timer.

This diff uses the named pipe based watchdog timer in TorchElastic agent and training worker processes (in the StuckJobDetector class) to allow the TorchElastic agent to detect the stuck of a training process, and kill the process to create a core dump.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
```
```
RemoteExecution session id: reSessionID-0bfcacef-24d1-42bc-a1d3-f3058fc42b2f-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7318349503394739
    ✓ ListingSuccess: caffe2/test/distributed/elastic/agent/server/test:local_agent_test : 55 tests discovered (22.699)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (47.140)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (49.198)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (46.387)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.094)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (106.342)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_homogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (64.888)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_homogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (69.158)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_enabled_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.965)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (79.626)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.113)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.487)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (24.358)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.216)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.433)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (47.029)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (44.357)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_check_master_addr_port_override_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (45.176)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_nccl_async_error_handling_env_default_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.980)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_c10d (local_elastic_agent_test.LocalElasticAgentTest) (47.151)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.614)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_heterogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (69.099)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_enabled_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.367)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_etcd (local_elastic_agent_test.LocalElasticAgentTest) (22.804)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_c10d (local_elastic_agent_test.LocalElasticAgentTest) (77.560)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.050)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.088)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_etcd (local_elastic_agent_test.LocalElasticAgentTest) (77.286)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_fault_tolerance_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.670)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_check_master_addr_port_override_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.631)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.867)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_fault_tolerance_etcd (local_elastic_agent_test.LocalElasticAgentTest) (51.095)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.000)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (45.197)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.873)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_c10d (local_elastic_agent_test.LocalElasticAgentTest) (23.160)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_etcd (local_elastic_agent_test.LocalElasticAgentTest) (43.632)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.536)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (89.859)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_fail_etcd (local_elastic_agent_test.LocalElasticAgentTest) (48.277)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_nccl_async_error_handling_env_c10d (local_elastic_agent_test.LocalElasticAgentTest) (43.930)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (87.677)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_success_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (48.965)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_fail_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.143)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_success_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.781)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.152)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_c10d (local_elastic_agent_test.LocalElasticAgentTest) (44.832)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.281)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_heterogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (74.968)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_disabled_c10d (local_elastic_agent_test.LocalElasticAgentTest) (46.141)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_c10d (local_elastic_agent_test.LocalElasticAgentTest) (44.960)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.292)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_disabled_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.611)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_env_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.939)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (47.609)
    ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.628)
Summary
  Pass: 55
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7318349503394739
```
-----------
```
buck test caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test
```
```
RemoteExecution session id: reSessionID-607a0028-4095-4dfc-b657-55f0807fe621-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774432794818
    ✓ ListingSuccess: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test : 11 tests discovered (39.037)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_thrift_api_called (caffe2.torch.fb.trainer.stuck_detection.tests.collect_quickstack_test.CollectQuickstackTrace) (0.655)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (36.510)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_dont_print_when_job_normal (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (36.727)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_send_watchdog_request_on_batch_callbacks_no_server (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.060)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_quickstack_stuck_job (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.242)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog_disabled (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.243)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_print_stack_trace_when_job_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.590)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_print_when_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.590)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog_no_file (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.589)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_signposts_stack_trace_when_job_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (38.132)
    ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_send_watchdog_request_on_batch_callbacks (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (38.133)
Summary
  Pass: 11
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774432794818
```

Differential Revision: D38930476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84081
Approved by: https://github.com/d4l3k
2022-09-07 00:17:20 +00:00

77 lines
2.1 KiB
ReStructuredText

Elastic Agent
==============
.. automodule:: torch.distributed.elastic.agent
.. currentmodule:: torch.distributed.elastic.agent
Server
--------
.. automodule:: torch.distributed.elastic.agent.server
Below is a diagram of an agent that manages a local group of workers.
.. image:: agent_diagram.jpg
Concepts
--------
This section describes the high-level classes and concepts that
are relevant to understanding the role of the ``agent`` in torchelastic.
.. currentmodule:: torch.distributed.elastic.agent.server
.. autoclass:: ElasticAgent
:members:
.. autoclass:: WorkerSpec
:members:
.. autoclass:: WorkerState
:members:
.. autoclass:: Worker
:members:
.. autoclass:: WorkerGroup
:members:
Implementations
-------------------
Below are the agent implementations provided by torchelastic.
.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent
.. autoclass:: LocalElasticAgent
Extending the Agent
---------------------
To extend the agent you can implement ```ElasticAgent`` directly, however
we recommend you extend ``SimpleElasticAgent`` instead, which provides
most of the scaffolding and leaves you with a few specific abstract methods
to implement.
.. currentmodule:: torch.distributed.elastic.agent.server
.. autoclass:: SimpleElasticAgent
:members:
:private-members:
.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult
Watchdog in the Agent
---------------------
A named pipe based watchdog can be enabled in ```LocalElasticAgent``` if an
environment variable ``TORCHELASTIC_ENABLE_FILE_TIMER`` with value 1 has
been defined in the ```LocalElasticAgent``` process.
Optionally, another environment variable ```TORCHELASTIC_TIMER_FILE```
can be set with a unique file name for the named pipe. If the environment
variable ```TORCHELASTIC_TIMER_FILE``` is not set, ```LocalElasticAgent```
will internally create a unique file name and set it to the environment
variable ```TORCHELASTIC_TIMER_FILE```, and this environment variable will
be propagated to the worker processes to allow them to connect to the same
named pipe that ```LocalElasticAgent``` uses.