This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
Summary:
we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid
something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out
Test Plan: experiment in next diff shows this works
Differential Revision: D65837085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060
Approved by: https://github.com/gag1jain
Fixes#143188
The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.
Test plan:
```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
Fixes#143188
The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.
Test plan:
```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization
The workaround is to update the healthcheck time when calling `get_last_progress_time`
Test Plan:
Logs show that the progress time value is being changed despite client not being set
Behavior when watchdog is enabled with SJD config is left unchanged
Differential Revision: D64733766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user.
Test Plan: unit tests
Reviewed By: gag1jain
Differential Revision: D62408767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913
Approved by: https://github.com/gag1jain
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.
This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.
Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.
```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```
To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.
Differential Revision: D62256407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
Summary:
Adding function to log additional debug information before killing the expired watchdog timers.
Additional information like stack trace can be added in the debug function using worker process IDs from expired timers.
Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
Differential Revision: D56044153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883
Approved by: https://github.com/kurman
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.
Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.
Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test
Differential Revision: D55837899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504
Approved by: https://github.com/kurman
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.
Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.
Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test
Differential Revision: D55108182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750
Approved by: https://github.com/kurman
Summary:
Minor logging cleanup in distributed library
1. Don't use "f" formatted strings - address linter issues.
2. Nits: Make use of unused `e` (error) in a few logs.
3. Change info->debug as asked in issue #113545
4. Nit: rename log -> logger in a few files for consistency
5. Fix a linter error.
Test Plan:
1. Local build passes.
2. Linter is happy.
Reviewers: wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
Summary:
handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry.
removing the non-existent process during clear timer.
Test Plan: unit tests
Differential Revision: D55099773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324
Approved by: https://github.com/d4l3k
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
Not sure, how it worked before, but if arguments must be annotated is optional if they are defaulted to None
Towards enabling mypy-1.4.1 in lintrunner
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 5e1b9f4</samp>
> _We annotate the arguments of doom_
> _To show the `None` values of gloom_
> _We improve the type checking and readability_
> _With `Optional` annotations of metal-ity_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105022
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn, https://github.com/Skylion007
Summary: The "kill worker process" event was logged to Scuba only when the worker process was really reaped. We want to add a new event "timer expired", no matter the worker process will be reaped or not. This will help collect data before we enable the JustKnob to kill the worker process on timeout.
Test Plan:
### Unit Test
```
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
```
```
Test Session: https://www.internalfb.com/intern/testinfra/testrun/7318349508929624
RE: reSessionID-ea464c43-54e7-44f2-942b-14ea8aa98c74 Up: 10.5 KiB Down: 1.1 MiB
Jobs completed: 100. Time elapsed: 3206.9s. Cache hits: 91%. Commands: 11 (cached: 10, remote: 1, local: 0)
Tests finished: Pass 55. Fail 0. Fatal 0. Skip 0. 0 builds failed
```
--------
```
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
```
```
Test Session: https://www.internalfb.com/intern/testinfra/testrun/6473924579130483
RE: reSessionID-231a47b7-a43d-4c0f-9f73-64713ffcbbd3 Up: 5.7 MiB Down: 1.9 GiB
Jobs completed: 182156. Time elapsed: 282.4s. Cache hits: 99%. Commands: 72112 (cached: 72107, remote: 1, local: 4)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. 0 builds failed
```
Differential Revision: D39903376
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85861
Approved by: https://github.com/d4l3k
Summary:
This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya).
The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it.
`LocalTimerClient` and `LocalTimerServer` require a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes.
`FileTimerClient` and `FileTimerServer` does not have such limitation.
Test Plan:
### Unit Test
```
buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
```
```
RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151)
✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895)
Summary
Pass: 12
ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
```
Differential Revision: D38604238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695
Approved by: https://github.com/d4l3k