mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
06ebe2d5bc
12 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
06ebe2d5bc |
Add watchdog to TorchElastic agent and trainers (#84081)
Summary:
D38604238 (
|
||
|
|
3b11b80fc3 |
Named pipe based watchdog timer (#83695)
Summary: This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya). The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it. `LocalTimerClient` and `LocalTimerServer` require a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes. `FileTimerClient` and `FileTimerServer` does not have such limitation. Test Plan: ### Unit Test ``` buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test ``` ``` RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666 ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151) ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895) Summary Pass: 12 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666 ``` Differential Revision: D38604238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695 Approved by: https://github.com/d4l3k |
||
|
|
3900509b7d |
(torchelastic) make --max_restarts explicit in the quickstart and runner docs (#65838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65838 closes https://github.com/pytorch/pytorch/pull/65675 The default `--max_restarts` for `torch.distributed.run` was changed to `0` from `3` to make things backwards compatible with `torch.distributed.launch`. Since the default `--max_restarts` used to be greater than `0` we never documented passing `--max_restarts` explicitly in any of our example code. Test Plan: N/A doc change only Reviewed By: d4l3k Differential Revision: D31279544 fbshipit-source-id: 98b31e6a158371bc56907552c5c13958446716f9 |
||
|
|
65e6194aeb |
Introduce the torchrun entrypoint (#64049)
Summary: This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049 Reviewed By: cbalioglu Differential Revision: D30584041 Pulled By: kiukchung fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d |
||
|
|
73ba166e2a |
fix(elastic-docs): Fix elastic launch doc (#62378)
Summary: The documentation link should be https://pytorch.org/docs/stable/elastic/run.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/62378 Reviewed By: aivanou Differential Revision: D30002830 Pulled By: kiukchung fbshipit-source-id: 34b434acaa10222561df43f6397a2420eef02015 |
||
|
|
13658b10bb |
[torch] Various improvements to torch.distributed.launch and torch.distributed.run (#61294)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294 Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925 * Make `torch.distributed.launch` restarts to 0 * Remove unnecessary `-use_env` warning, move `-use_env` warnings * Move `-use_env` warnings to `torch.distributed.launch` * Make default log level WARNING * Add new doc section around transitioning to `torch.distributed.run` * Make `torch.distributed.launch` not use error-propagation * Set default events handler to `null` that does not print events to console * Add reference from `torch.distributed.launch` to `torch.distributed.run` * Set correct preexec function that sends SIGTERM to child processes when parent dies Issues resolved: https://github.com/pytorch/pytorch/issues/60716 https://github.com/pytorch/pytorch/issues/60754 Test Plan: sandcastle python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning Output of running torch.distributed.launch without --use_env: $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ('LOCAL_RANK')` instead. New section: {F628923078} {F628974089} Reviewed By: cbalioglu Differential Revision: D29559553 fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b |
||
|
|
ccfdb30644 |
Revert D29413019: [torch] Various improvements to torch.distributed.launch and torch.distributed.run
Test Plan: revert-hammer
Differential Revision:
D29413019 (
|
||
|
|
4e181dfc35 |
[torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925 * Make `torch.distributed.launch` restarts to 0 * Remove unnecessary `-use_env` warning, move `-use_env` warnings * Move `-use_env` warnings to `torch.distributed.launch` * Make default log level WARNING * Add new doc section around transitioning to `torch.distributed.run` * Make `torch.distributed.launch` not use error-propagation * Set default events handler to `null` that does not print events to console * Add reference from `torch.distributed.launch` to `torch.distributed.run` * Set correct preexec function that sends SIGTERM to child processes when parent dies Issues resolved: https://github.com/pytorch/pytorch/issues/60716 https://github.com/pytorch/pytorch/issues/60754 Test Plan: sandcastle python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning Output of running torch.distributed.launch without --use_env: $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ('LOCAL_RANK')` instead. New section: {F628923078} {F628974089} Reviewed By: kiukchung, cbalioglu Differential Revision: D29413019 fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630 |
||
|
|
028f2f62ac |
[torch/elastic] Update the rendezvous docs (#58160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160 This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend. ghstack-source-id: 128783809 Test Plan: Visually verified the correct Reviewed By: tierex Differential Revision: D28384996 fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592 |
||
|
|
b7d674eb21 |
Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs
Test Plan: revert-hammer
Differential Revision:
D28331386 (
|
||
|
|
e4418b67c7 |
[torch/elastic] Update the rendezvous docs (#57973)
Summary: This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973 Reviewed By: kiukchung Differential Revision: D28331386 Pulled By: cbalioglu fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b |
||
|
|
a80b215a9a |
[1/n][torch/elastic] Move torchelastic docs *.rst (#148)
Summary: Pull Request resolved: https://github.com/pytorch/elastic/pull/148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811 Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst` Reviewed By: H-Huang Differential Revision: D27974751 fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5 |