pytorch/docs/source
Bin Chen 3b11b80fc3 Named pipe based watchdog timer (#83695)
Summary:
This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya).

The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it.

`LocalTimerClient` and `LocalTimerServer` require  a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes.

`FileTimerClient` and `FileTimerServer` does not have such limitation.

Test Plan:
### Unit Test
```
buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
```
```
RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
    ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151)
    ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895)
Summary
  Pass: 12
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
```

Differential Revision: D38604238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695
Approved by: https://github.com/d4l3k
2022-08-24 22:16:12 +00:00
..
_static
_templates Fix left nav (#78552) 2022-06-01 00:49:53 +00:00
community module names are made more consistent with POI page (#83219) 2022-08-16 18:38:08 +00:00
elastic Named pipe based watchdog timer (#83695) 2022-08-24 22:16:12 +00:00
notes Improve autograd custom function docs (#81340) 2022-07-21 19:54:30 +00:00
rpc
scripts [ONNX] Clean up onnx_supported_ops (#79424) 2022-06-23 20:44:51 +00:00
amp.rst Change seperate -> separate (#83056) 2022-08-09 23:11:34 +00:00
autograd.rst Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289) 2022-07-13 13:50:15 +00:00
backends.rst Update backends.rst (#82525) 2022-08-03 18:33:15 +00:00
benchmark_utils.rst Cleanup all module references in doc (#73983) 2022-03-10 22:26:29 +00:00
bottleneck.rst Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289) 2022-07-13 13:50:15 +00:00
checkpoint.rst
complex_numbers.rst Add a note on CUDA 11.6 (#80363) 2022-06-27 21:34:24 +00:00
conf.py Revert "[quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)" 2022-08-22 07:32:37 +00:00
config_mod.rst
cpp_extension.rst
cpp_index.rst
cuda.rst Propagate CUDAOutOfMemoryError to Python. (#83146) 2022-08-11 21:32:11 +00:00
cudnn_persistent_rnn.rst
cudnn_rnn_determinism.rst
data.rst [DataLoader] Minor documentation improvement 2022-05-31 15:59:46 +00:00
ddp_comm_hooks.rst Fix two small typos in ddp_comm_hooks.rst (#82047) 2022-07-23 19:10:57 +00:00
deploy.rst Back out "Back out "[torch deploy] Update deploy.rst with working simple example"" (#76713) 2022-05-03 14:12:18 +00:00
distributed.algorithms.join.rst
distributed.elastic.rst
distributed.optim.rst
distributed.rst Add TORCH_CPP_LOG_LEVEL to the docs 2022-05-03 17:01:11 +00:00
distributions.rst
dlpack.rst
docutils.conf
fft.rst Cleanup all module references in doc (#73983) 2022-03-10 22:26:29 +00:00
fsdp.rst
futures.rst
fx.rst CSE Pass and common pass Tests (#81742) 2022-07-22 03:45:09 +00:00
hub.rst
index.rst [maskedtensor] first commit, core and creation (#82836) 2022-08-16 20:10:34 +00:00
jit_builtin_functions.rst
jit_language_reference_v2.rst
jit_language_reference.rst
jit_python_reference.rst
jit_unsupported.rst
jit_utils.rst Create __init__.py (#78629) 2022-06-03 18:14:21 +00:00
jit.rst torch.jit doc link for nvfuser readme.md (#77780) 2022-07-07 23:25:35 +00:00
library.rst Add docs for Python Registration 2022-06-13 23:21:23 +00:00
linalg.rst [Array API] Add linalg.vecdot (#70542) 2022-07-12 14:28:54 +00:00
masked.rst [maskedtensor] first commit, core and creation (#82836) 2022-08-16 20:10:34 +00:00
math-quantizer-equation.png
mobile_optimizer.rst
model_zoo.rst
monitor.rst
multiprocessing.rst
name_inference.rst
named_tensor.rst Add torch.unflatten and improve its docs (#81399) 2022-07-29 15:02:42 +00:00
nested.rst Update NestedTensor docs (#80963) 2022-07-07 22:15:39 +00:00
nn.functional.rst Add Dropout1d module 2022-06-15 14:39:07 +00:00
nn.init.rst update nn.init doc to reflect the no_grad (#80882) 2022-07-07 17:19:29 +00:00
nn.rst Add Dropout1d module 2022-06-15 14:39:07 +00:00
onnx_supported_aten_ops.rst Add list of supported ATen ops by ONNX converter into torch.onnx page 2022-04-07 00:05:44 +00:00
onnx.rst [ONNX] Reland #81953 Type utility for converting among JIT, torch and ONNX data types (#82995) 2022-08-08 23:43:43 +00:00
optim.rst feat: add PolynomialLR scheduler (#82769) 2022-08-10 18:21:00 +00:00
package.rst Fix typos in torch.package documentation (#82994) 2022-08-08 20:19:17 +00:00
pipeline.rst
profiler.rst
quantization-accuracy-debugging.rst quant docs: best practices for quantization accuracy debugging 2022-05-17 12:16:52 +00:00
quantization-backend-configuration.rst quantization: autogenerate quantization backend configs for documentation (#75126) 2022-04-04 22:22:30 +00:00
quantization-support.rst Revert "[quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)" 2022-08-22 07:32:37 +00:00
quantization.rst Revert "[quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)" 2022-08-22 07:32:37 +00:00
random.rst
rpc.rst
sparse.rst Revise sparse docs regarding Sparse Compressed tensors (#82108) 2022-07-29 18:15:09 +00:00
special.rst torch.special.scaled_modified_bessel_k0 (#78900) 2022-06-29 14:53:37 +00:00
storage.rst Rename _Typed/_UntypedStorage to Typed/UntypedStorage and update docs (#82438) 2022-07-30 19:37:08 +00:00
tensor_attributes.rst
tensor_view.rst
tensorboard.rst Cleanup all module references in doc (#73983) 2022-03-10 22:26:29 +00:00
tensors.rst Revise sparse docs regarding Sparse Compressed tensors (#82108) 2022-07-29 18:15:09 +00:00
testing.rst Fix links in torch.testing docs (#80353) 2022-07-11 19:15:53 +00:00
torch.ao.ns._numeric_suite_fx.rst
torch.ao.ns._numeric_suite.rst
torch.overrides.rst Revert "Revert "Implement sym_sizes to create proper IR for sym ints representing tensor sizes (#76836)"" 2022-05-18 18:40:57 +00:00
torch.rst Add torch.unflatten and improve its docs (#81399) 2022-07-29 15:02:42 +00:00
type_info.rst ENH: Convert finfo.tiny to finfo.smallest_normal (#76292) 2022-05-20 00:59:48 +00:00