Summary:
Calls to this function without an argument will get a stack trace at
import time. This is expensive, we can just skip it by passing in a value.
Test Plan: Wait for tests
Differential Revision: D44244345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274
Approved by: https://github.com/kiukchung
Summary: Refractor error_handler.py
Test Plan:
In the previous diff, I added a unit test which showcases the failed case. With this diff, we can see that the override works as expected.
Also added few additional tests for test coverage
Reviewed By: wilson100hong
Differential Revision: D37677402
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81408
Approved by: https://github.com/d4l3k
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477
resolves https://github.com/pytorch/pytorch/issues/73465
This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`.
Test Plan:
Induce an error on the agent process by `kill -15 $AGENT_PID`
```
python -m torch.distributed.run \
--nproc_per_node 2 \
--nnodes 1:1 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500 \
--monitor_interval 3 test.py
```
Produces
{F704936697}
In contrast to the duplicated error before:
{F704936729}
Reviewed By: d4l3k
Differential Revision: D34501852
fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8
(cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182
closes https://github.com/pytorch/pytorch/issues/63174
Does a few things:
1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header
NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).
Test Plan:
Sample
```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2021-10-05_17:37:22
host : devvm4955.prn0.facebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3296201)
error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
traceback :
Traceback (most recent call last):
File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
return f(*args, **kwargs)
File "main.py", line 28, in main
raise RuntimeError(args.throws)
RuntimeError: foobar
============================================================
```
Reviewed By: cbalioglu, aivanou
Differential Revision: D31416492
fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65041
Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/64036 where the traceback of the error handler is printed out rather than the traceback of the actual exception.
Fixes https://github.com/pytorch/pytorch/issues/60910
Closes https://github.com/pytorch/pytorch/issues/60910
BEFORE (note that the `py_callstack` is NOT the traceback of the RuntimeError):
```
**************************************************************************************************************************************************************************************************************************************************
run_script_path FAILED
==================================================================================================================================================================================================================================================
Root Cause:
[0]:
time: 2021-09-14_22:01:06
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 1092727)
error_file: /tmp/torchelastic_aeyvjbpe/none_8zuih7tj/attempt_0/0/error.json
msg:
{
"message": "RuntimeError: rasing error since --throw was specified",
"extraInfo": {
"py_callstack": [
" File \"<string>\", line 1, in <module>\n",
" File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 116, in spawn_main\n exitcode = _main(fd, parent_sentinel)\n",
" File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 129, in _main\n return self._bootstrap(parent_sentinel)\n",
" File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 315, in _bootstrap\n self.run()\n",
" File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 108, in run\n self._target(*self._args, **self._kwargs)\n",
" File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/multiprocessing/spawn.py\", line 59, in _wrap\n fn(i, *args)\n",
" File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/api.py\", line 382, in _wrap\n ret = record(fn)(*args_)\n",
" File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n error_handler.record_exception(e)\n",
" File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n _write_error(e, self._get_error_file_path())\n",
" File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n \"py_callstack\": traceback.format_stack(),\n"
],
"timestamp": "1631682066"
}
}
==================================================================================================================================================================================================================================================
Other Failures:
<NO_OTHER_FAILURES>
**************************************************************************************************************************************************************************************************************************************************
```
AFTER (note the traceback is the traceback of the RuntimeError):
```
********************************************************************************
run_script_path FAILED
================================================================================
Root Cause:
[0]:
time: 2021-09-14_21:49:25
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 1014681)
error_file: /tmp/torchelastic_q0zods2c/none_qwmz5dgj/attempt_0/0/error.json
msg: Traceback (most recent call last):
File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
return f(*args, **kwargs)
File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/run.py", line 671, in run_script_path
runpy.run_path(sys.argv[0], run_name="__main__")
File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/kiuk/tmp/test.py", line 55, in <module>
main()
File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/kiuk/tmp/test.py", line 25, in main
raise RuntimeError("rasing error since --throw was specified")
RuntimeError: rasing error since --throw was specified
================================================================================
Other Failures:
<NO_OTHER_FAILURES>
********************************************************************************
```
Test Plan:
(see summary for before and after)
`test.py` contents:
```
import argparse
import os
import sys
import torch
import torch.distributed as dist
import torch.nn.functional as F
from torch.distributed.elastic.multiprocessing.errors import record
def parse_args(argv):
parser = argparse.ArgumentParser(description="test script")
parser.add_argument("--init_method", type=str, default="env://")
parser.add_argument("--backend", type=str, default="gloo")
parser.add_argument("--throw", action="store_true", default=False)
parser.add_argument("--exit", action="store_true", default=False)
return parser.parse_args()
record
def main():
args = parse_args(sys.argv[1:])
if args.throw:
raise RuntimeError("rasing error since --throw was specified")
if args.exit:
sys.exit(1)
init_method=args.init_method
backend=args.backend
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
print(f"initializing `{backend}` process group with rank={rank}, world_size={world_size} at {init_method}")
dist.init_process_group(
backend=backend,
init_method=init_method,
world_size=world_size,
rank=rank)
print(f"successfully initialized process group with rank={dist.get_rank()}, world_size={dist.get_world_size()}")
t = F.one_hot(torch.tensor(rank), num_classes=world_size)
dist.all_reduce(t)
derived_world_size = torch.sum(t).item()
if derived_world_size != world_size:
raise RuntimeError(f"derived world size: {derived_world_size} != actual world size: {world_size}")
else:
print(f"sucessfully derived world size: {derived_world_size} (expected: {world_size}). Exiting")
if __name__ == "__main__":
main()
```
run it as:
```
$ python -m torch.distributed.run --nproc_per_node 2 test.py --throw
```
Reviewed By: cbalioglu
Differential Revision: D30953731
fbshipit-source-id: bbea04c59c2aec58969cf44d8e3723d5f8abe8a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64036
This PR slightly revises the implementation of the internal `_format_failure()` method in order to pretty print the error message captured in a subprocess by the `record` annotation.
With this PR a failure log is formatted as below:
```
Root Cause:
[0]:
time: 2021-08-26_17:12:07
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 8045)
error_file: /tmp/torchelastic_6cj9eppm/6d9d844a-6ce4-4838-93ed-1639a9525b00_rec9kuv3/attempt_0/0/error.json
msg:
{
"message": "ValueError: Test",
"extraInfo": {
"py_callstack": [
" File \"/data/home/balioglu/fail.py\", line 7, in <module>\n main()\n",
" File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n error_handler.record_exception(e)\n",
" File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n _write_error(e, self._get_error_file_path())\n",
" File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n \"py_callstack\": traceback.format_stack(),\n"
],
"timestamp": "1629997927"
}
}
```
in contrast to the old formatting:
```
Root Cause:
[0]:
time: 2021-08-26_17:15:50
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 9417)
error_file: /tmp/torchelastic_22pwarnq/19f22638-848c-4b8f-8379-677f34fc44e7_u43o9vs7/attempt_0/0/error.json
msg: "{'message': 'ValueError: Test', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n File "/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper\n return f(*args, **kwargs)\n File "/data/home/balioglu/fail.py", line 5, in main\n raise ValueError("BALIOGLU")\nValueError: BALIOGLU\n', 'timestamp': '1629998150'}}"
```
ghstack-source-id: 136761768
Test Plan: Run the existing unit tests.
Reviewed By: kiukchung
Differential Revision: D30579025
fbshipit-source-id: 37df0b7c7ec9b620355766122986c2c77e8495ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62823
The diff makes sure that the warning message is printed only when the child processes are stuck after sending termination code.
Test Plan:
sandcastle
buck build mode/dev-nosan //caffe2:run
buck-out/gen/caffe2/run.par --nnodes 1 --nproc_per_node 1 main.py
P435691445
Differential Revision: D30046695
fbshipit-source-id: c59170b297f4a0e530906fa5069234303deee938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602
The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.
When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.
Test Plan: unittests, sandcastle
Reviewed By: cbalioglu
Differential Revision: D29671783
fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925
* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies
Issues resolved:
https://github.com/pytorch/pytorch/issues/60716https://github.com/pytorch/pytorch/issues/60754
Test Plan:
sandcastle
python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts
python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning
Output of running torch.distributed.launch without --use_env:
$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.
New section:
{F628923078}
{F628974089}
Reviewed By: cbalioglu
Differential Revision: D29559553
fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925
* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies
Issues resolved:
https://github.com/pytorch/pytorch/issues/60716https://github.com/pytorch/pytorch/issues/60754
Test Plan:
sandcastle
python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts
python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning
Output of running torch.distributed.launch without --use_env:
$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.
New section:
{F628923078}
{F628974089}
Reviewed By: kiukchung, cbalioglu
Differential Revision: D29413019
fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712
When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5
This diff adds support for casting str timestamp to int.
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test
Reviewed By: suphoff
Differential Revision: D28995827
fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57217
In torch multiprocessing error handler, we try to remove the file if it already exists. Before removing, we try to log the contents of the file. Here the assumption is that the contents would be valid json.
However, in some cases, it isn't and then we end up not clearing the file.
Let's handle this error and make sure that the file is cleaned irrespective of the contents of the file.
Reviewed By: devashisht
Differential Revision: D28041470
fbshipit-source-id: da96d11b8f7091715cf0152cccd3ecc08b688eae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739
The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test
https://fburl.com/tupperware/0nizb9z8
Reviewed By: borovsky-d, wilson100hong
Differential Revision: D27952596
fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386
The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher
Added test_launcher to the ci/cd tests
Reviewed By: cbalioglu
Differential Revision: D27858897
fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
Summary: `Redirects` was renamed to `Std` in `torch.distributed.elastic.multiprocessing.api`. Pointed out by a user in https://github.com/pytorch/elastic/issues/147.
Test Plan: N/A just doc change
Reviewed By: tierex
Differential Revision: D27866614
fbshipit-source-id: 9fb901aae7ebe11cde13000a1c118de527f34400
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/
When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
User workflow: f263531643
Reviewed By: cbalioglu
Differential Revision: D27602838
fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/
When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
User workflow: f263531643
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27572158
fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b