Commit Graph

49 Commits

Author SHA1 Message Date
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Edward Z. Yang
b8b840be3d Convert logging f-strings to use % format, part five (#98765)
This does some annoying but simple cases by hand.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98765
Approved by: https://github.com/wanchaol
2023-04-11 13:17:59 +00:00
Edward Z. Yang
5a7aad9681 Convert logging f-strings to use % format, part four (#98705)
This does multi-line concatenated string literals.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705
Approved by: https://github.com/voznesenskym
2023-04-11 13:17:59 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Kazuaki Ishizaki
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Jeffrey Dunn
d779dadda1 Remove stack trace captures from import (#97274)
Summary:
Calls to this function without an argument will get a stack trace at
import time. This is expensive, we can just skip it by passing in a value.

Test Plan: Wait for tests

Differential Revision: D44244345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274
Approved by: https://github.com/kiukchung
2023-03-22 18:34:13 +00:00
Aaron Gokaslan
5471621497 [BE] Remove unnecessary dict comprehensions (#97116)
Removes unnecessary dict comprehensions that optimize creation of dicts from iterables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116
Approved by: https://github.com/kit1980
2023-03-20 00:56:57 +00:00
Horace He
5bbec680d7 Fix usages of contextmanager without finally (#96170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-08 20:59:27 +00:00
fduwjj
e98a942399 [PTD] Land 'to_std' utility parser fix #93209 (#94023)
Land https://github.com/pytorch/pytorch/pull/93209 faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94023
Approved by: https://github.com/wz337
2023-02-03 09:04:34 +00:00
blzheng
0c1777acec Dynamo benchmark: add CPU specific changes (#88477)
This pr adds some CPU specific changes:

- Add support for IPEX backend
- https://github.com/pytorch/torchdynamo/issues/1618
- https://github.com/pytorch/torchdynamo/issues/1534
- Enable CPU launcher in runner.py.
- Fix the issue that some environment variables are not support on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88477
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-07 09:26:06 +00:00
Sergii Dymchenko
365071c73c Fix non-existing parameters in docstrings in torch/distributed (#91116)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116
Approved by: https://github.com/huydhn
2022-12-22 02:37:31 +00:00
Kazuaki Ishizaki
1cd6ebe095 Fix typos in messages under torch (#89049)
This PR fixes typos of messages in `.py` files under torch directory.
Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049
Approved by: https://github.com/lezcano
2022-11-17 04:18:14 +00:00
anjali411
cf2f552cd8 Add __all__ to torch.{fx, distributed, backends} submodules (#85079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85079
Approved by: https://github.com/rohan-varma
2022-09-20 12:51:08 +00:00
Atul Jangra
ba54165392 Make sure that exit code is propagated from Child to parent process (#81408)
Summary: Refractor error_handler.py

Test Plan:
In the previous diff, I added a unit test which showcases the failed case. With this diff, we can see that the override works as expected.

Also added few additional tests for test coverage

Reviewed By: wilson100hong

Differential Revision: D37677402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81408
Approved by: https://github.com/d4l3k
2022-07-19 18:47:54 +00:00
anjali411
4bf076e964 Add __all__ to torch.distributed, futures, fx, nn, package, benchmark submodules (#80520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80520
Approved by: https://github.com/rohan-varma
2022-07-08 14:31:24 +00:00
PyTorch MergeBot
58532256e9 Revert "Add __all__ for torch.distributed and fx modules (#80460)"
This reverts commit 5d40c3d5c8.

Reverted https://github.com/pytorch/pytorch/pull/80460 on behalf of https://github.com/malfet due to Broke MacOS testing, see https://github.com/pytorch/pytorch/runs/7105579664?check_suite_focus=true
2022-06-29 16:20:55 +00:00
anjali411
5d40c3d5c8 Add __all__ for torch.distributed and fx modules (#80460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80460
Approved by: https://github.com/albanD, https://github.com/rohan-varma
2022-06-29 02:53:56 +00:00
Michael Suo
fb0f285638 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 20:51:34 +00:00
PyTorch MergeBot
3d7428d9ac Revert "[lint] upgrade mypy to latest version"
This reverts commit 9bf18aab94.

Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo
2022-05-03 20:01:18 +00:00
Michael Suo
9bf18aab94 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 19:43:28 +00:00
Kiuk Chung
b08309ee0a (torch/elastic) skip logging structured error info if error_file is not set (#73477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477

resolves https://github.com/pytorch/pytorch/issues/73465

This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`.

Test Plan:
Induce an error on the agent process by `kill -15 $AGENT_PID`
```
python -m torch.distributed.run \
   --nproc_per_node 2 \
   --nnodes 1:1 \
   --rdzv_backend c10d \
  --rdzv_endpoint localhost:29500 \
  --monitor_interval 3 test.py
```

Produces

{F704936697}

In contrast to the duplicated error before:

{F704936729}

Reviewed By: d4l3k

Differential Revision: D34501852

fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8
(cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
2022-03-01 19:31:44 +00:00
Brian Muse
8bf3179f6e #71946 Remove Python 3.6 references (#72211)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71946

This commit removes some bits of code that were hard coded for Python 3.6 support from the `.circleci` and `torch` folders. It should only be merged if https://github.com/pytorch/pytorch/issues/66462 is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72211

Reviewed By: dagitses, seemethere

Differential Revision: D33982604

Pulled By: musebc

fbshipit-source-id: 8f453bf9909df615addd59538adb369c65484044
(cherry picked from commit 944a9970fe)
2022-02-08 03:46:20 +00:00
s-kumano
4ed02748be fix typo in the docs of multiprocessing (#70448)
Summary:
Fix typo in the docs of multiprocessing.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70448

Reviewed By: gchanan

Differential Revision: D33336962

Pulled By: H-Huang

fbshipit-source-id: 1235703b8ddc26c33dcbc34bd25ac36b11a18923
2021-12-28 09:58:47 -08:00
Kiuk Chung
df11e2d6f9 (torch/elastic) add fqdn hostname to error printout (#66182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182

closes https://github.com/pytorch/pytorch/issues/63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
2021-10-07 01:40:02 -07:00
Kiuk Chung
db134a6843 (torch.distributed.elastic) properly format traceback on error (#65041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65041

Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/64036 where the traceback of the error handler is printed out rather than the traceback of the actual exception.

Fixes https://github.com/pytorch/pytorch/issues/60910
Closes https://github.com/pytorch/pytorch/issues/60910

BEFORE (note that the `py_callstack` is NOT the traceback of the RuntimeError):
```
**************************************************************************************************************************************************************************************************************************************************
                                                                                                              run_script_path FAILED
==================================================================================================================================================================================================================================================
Root Cause:
[0]:
  time: 2021-09-14_22:01:06
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1092727)
  error_file: /tmp/torchelastic_aeyvjbpe/none_8zuih7tj/attempt_0/0/error.json
  msg:
    {
      "message": "RuntimeError: rasing error since --throw was specified",
      "extraInfo": {
        "py_callstack": [
          "  File \"<string>\", line 1, in <module>\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 116, in spawn_main\n    exitcode = _main(fd, parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 129, in _main\n    return self._bootstrap(parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 315, in _bootstrap\n    self.run()\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 108, in run\n    self._target(*self._args, **self._kwargs)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/multiprocessing/spawn.py\", line 59, in _wrap\n    fn(i, *args)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/api.py\", line 382, in _wrap\n    ret = record(fn)(*args_)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1631682066"
      }
    }

==================================================================================================================================================================================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
**************************************************************************************************************************************************************************************************************************************************
```

AFTER (note the traceback is the traceback of the RuntimeError):
```
********************************************************************************
                             run_script_path FAILED
================================================================================
Root Cause:
[0]:
  time: 2021-09-14_21:49:25
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1014681)
  error_file: /tmp/torchelastic_q0zods2c/none_qwmz5dgj/attempt_0/0/error.json
  msg: Traceback (most recent call last):
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/run.py", line 671, in run_script_path
      runpy.run_path(sys.argv[0], run_name="__main__")
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 265, in run_path
      return _run_module_code(code, init_globals, run_name,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 97, in _run_module_code
      _run_code(code, mod_globals, init_globals,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/home/kiuk/tmp/test.py", line 55, in <module>
      main()
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/home/kiuk/tmp/test.py", line 25, in main
      raise RuntimeError("rasing error since --throw was specified")
  RuntimeError: rasing error since --throw was specified

================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
********************************************************************************
```

Test Plan:
(see summary for before and after)

`test.py` contents:
```
import argparse
import os
import sys

import torch
import torch.distributed as dist
import torch.nn.functional as F

from torch.distributed.elastic.multiprocessing.errors import record

def parse_args(argv):
    parser = argparse.ArgumentParser(description="test script")
    parser.add_argument("--init_method", type=str, default="env://")
    parser.add_argument("--backend", type=str, default="gloo")
    parser.add_argument("--throw", action="store_true", default=False)
    parser.add_argument("--exit", action="store_true", default=False)
    return parser.parse_args()

record
def main():
    args = parse_args(sys.argv[1:])

    if args.throw:
        raise RuntimeError("rasing error since --throw was specified")

    if args.exit:
        sys.exit(1)

    init_method=args.init_method
    backend=args.backend

    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])

    print(f"initializing `{backend}` process group with rank={rank}, world_size={world_size} at {init_method}")

    dist.init_process_group(
        backend=backend,
        init_method=init_method,
        world_size=world_size,
        rank=rank)

    print(f"successfully initialized process group with rank={dist.get_rank()}, world_size={dist.get_world_size()}")

    t = F.one_hot(torch.tensor(rank), num_classes=world_size)
    dist.all_reduce(t)
    derived_world_size = torch.sum(t).item()
    if derived_world_size != world_size:
        raise RuntimeError(f"derived world size: {derived_world_size} != actual world size: {world_size}")
    else:
        print(f"sucessfully derived world size: {derived_world_size} (expected: {world_size}). Exiting")

if __name__ == "__main__":
    main()
```

run it as:

```
$ python -m torch.distributed.run --nproc_per_node 2 test.py --throw
```

Reviewed By: cbalioglu

Differential Revision: D30953731

fbshipit-source-id: bbea04c59c2aec58969cf44d8e3723d5f8abe8a8
2021-09-15 12:50:21 -07:00
Can Balioglu
d8d8e4902a [torch/elastic] Pretty print the failure message captured by @record (#64036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64036

This PR slightly revises the implementation of the internal `_format_failure()` method in order to pretty print the error message captured in a subprocess by the `record` annotation.

With this PR a failure log is formatted as below:

```
Root Cause:
[0]:
  time: 2021-08-26_17:12:07
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 8045)
  error_file: /tmp/torchelastic_6cj9eppm/6d9d844a-6ce4-4838-93ed-1639a9525b00_rec9kuv3/attempt_0/0/error.json
  msg:
    {
      "message": "ValueError: Test",
      "extraInfo": {
        "py_callstack": [
          "  File \"/data/home/balioglu/fail.py\", line 7, in <module>\n    main()\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1629997927"
      }
    }
```

in contrast to the old formatting:

```
Root Cause:
[0]:
  time: 2021-08-26_17:15:50
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 9417)
  error_file: /tmp/torchelastic_22pwarnq/19f22638-848c-4b8f-8379-677f34fc44e7_u43o9vs7/attempt_0/0/error.json
  msg: "{'message': 'ValueError: Test', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n  File "/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper\n    return f(*args, **kwargs)\n  File "/data/home/balioglu/fail.py", line 5, in main\n    raise ValueError("BALIOGLU")\nValueError: BALIOGLU\n', 'timestamp': '1629998150'}}"
```
ghstack-source-id: 136761768

Test Plan: Run the existing unit tests.

Reviewed By: kiukchung

Differential Revision: D30579025

fbshipit-source-id: 37df0b7c7ec9b620355766122986c2c77e8495ae
2021-08-26 13:56:46 -07:00
Aliaksandr Ivanou
e1f81c9321 [torchelastic][multiprocessing] Print warning message only when child processes are stuck (#62823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62823

The diff makes sure that the warning message is printed only when the child processes are stuck after sending termination code.

Test Plan:
sandcastle

    buck build mode/dev-nosan //caffe2:run
    buck-out/gen/caffe2/run.par --nnodes 1 --nproc_per_node 1 main.py
P435691445

Differential Revision: D30046695

fbshipit-source-id: c59170b297f4a0e530906fa5069234303deee938
2021-08-05 19:57:31 -07:00
Aliaksandr Ivanou
0c55f1bdec [torchelastic] Improve process termination logic (#61602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602

The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.

When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.

Test Plan: unittests, sandcastle

Reviewed By: cbalioglu

Differential Revision: D29671783

fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
2021-07-23 11:00:15 -07:00
Aliaksandr Ivanou
8296cb37c7 [torchelastic] Set the correct maximum border width
Summary: The diff sets the correct max border delimiters between error sections

Test Plan: Example of the uncontrolled border: https://www.internalfb.com/intern/testinfra/diagnostics/7599824415964133.844424970500348.1625590344/

Reviewed By: kiukchung

Differential Revision: D29636814

fbshipit-source-id: 95465d3150066bff82dc7499bb1c63ea4f5ebc2d
2021-07-09 13:29:23 -07:00
Aliaksandr Ivanou
13658b10bb [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#61294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: cbalioglu

Differential Revision: D29559553

fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
2021-07-08 16:28:06 -07:00
Vitaly Fedyunin
ccfdb30644 Revert D29413019: [torch] Various improvements to torch.distributed.launch and torch.distributed.run
Test Plan: revert-hammer

Differential Revision:
D29413019 (4e181dfc35)

Original commit changeset: 323bfbad9d0e

fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2
2021-07-01 08:44:51 -07:00
Aliaksandr Ivanou
4e181dfc35 [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: kiukchung, cbalioglu

Differential Revision: D29413019

fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
2021-06-30 23:31:02 -07:00
Aliaksandr Ivanou
1735775662 [Torch] Cast timestamp type to int (#59712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712

When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5

This diff adds support for casting str timestamp to int.

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

Reviewed By: suphoff

Differential Revision: D28995827

fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
2021-06-09 15:56:37 -07:00
Aliaksandr Ivanou
c22ac14969 [Error-reporting] Set upper boundary on border element (#59311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59311

The diff sets the upper boundary on border element when presenting the error message. This is required in order to avoid unnecessary log contamination

Test Plan: Example of log contamination: https://www.internalfb.com/fblearner/details/276849996/operator/2942475685?tab=try_27021599785797968

Reviewed By: d4l3k

Differential Revision: D28812745

fbshipit-source-id: 4f491b9acc8cc9831d763f185022879bbbfb4c8a
2021-06-02 12:28:54 -07:00
Atul Jangra
ca814904b4 Handle error reporting when reply file already exists (#57217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57217

In torch multiprocessing error handler, we try to remove the file if it already exists. Before removing, we try to log the contents of the file. Here the assumption is that the contents would be valid json.
However, in some cases, it isn't and then we end up not clearing the file.
Let's handle this error and make sure that the file is cleaned irrespective of the contents of the file.

Reviewed By: devashisht

Differential Revision: D28041470

fbshipit-source-id: da96d11b8f7091715cf0152cccd3ecc08b688eae
2021-04-29 04:57:35 -07:00
Aliaksandr Ivanou
0df574017d Torchelastic: add support for the new error file format (#57084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57084

The diff adds support for new error message file format:

    {
        "message":"test",
        "timestamp": 12
    }

Test Plan:
fbcode buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

example job: tsm_aivanou-torchelastic_distributed_sum_77c0b147

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D28042764

fbshipit-source-id: 4d21c2319654f3460d551d91cbf48568356cf4e8
2021-04-28 00:04:45 -07:00
Aliaksandr Ivanou
0a72904ab4 Torchelastic: make process failure init error non-fatal (#56739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739

The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

    https://fburl.com/tupperware/0nizb9z8

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D27952596

fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
2021-04-23 00:49:47 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Aliaksandr Ivanou
c5c5230890 Pytorch resolve bug around incorrect rdzv handler resolution (#56386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386

The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher

Added test_launcher to the ci/cd tests

Reviewed By: cbalioglu

Differential Revision: D27858897

fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
2021-04-19 23:50:28 -07:00
Kiuk Chung
023231a2ac [torch/distributed] Fix pydoc for torch.distributed.elastic.multiprocessing (replace Redirect with Std)
Summary: `Redirects` was renamed to `Std` in `torch.distributed.elastic.multiprocessing.api`. Pointed out by a user in https://github.com/pytorch/elastic/issues/147.

Test Plan: N/A just doc change

Reviewed By: tierex

Differential Revision: D27866614

fbshipit-source-id: 9fb901aae7ebe11cde13000a1c118de527f34400
2021-04-19 21:40:16 -07:00
Sam Estep
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
Richard Barnes
6269efde91 Add stricter typing to caffe2/torch/distributed/elastic/multiprocessing/errors/__init__.py (#55848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55848

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D27714781

fbshipit-source-id: cff651e04c1e8363a249c7de9de01c33db47f003
2021-04-13 10:47:08 -07:00
Wilson Hong
f665a7f8a1 [pet] Set error code in reply file when child process is terminated by signals.
Summary: Fill reply file's error code with ProcessFailure's exitcode. This is necessary when child process terminated by signals (ex. SIGSEGV).

Test Plan:
- Buck test
```
buck test mode/dev-nosan pytorch/elastic/torchelastic/distributed/fb/test:launch_test
buck test mode/dev-nosan caffe2/torch/distributed/elastic/multiprocessing/errors/fb/test:error_handler_fb_test_needed_coverage
```

- TSM
```
fbpkg build -E torchelastic_distributed_sum

buck run mode/dev-nosan //pytorch/elastic/torchelastic/tsm/fb/cli:tsm -- run_ddp --scheduler mast --fbpkg torchelastic_distributed_sum:ecdf31f --nnodes 2 --nproc_per_node 2 --resource T1  --run_cfg hpcIdentity=oncall_dai_pet,hpcClusterUuid=MastNaoTestCluster main.pa
```
https://www.internalfb.com/mast/job/tsm_wilsonhong-torchelastic_distributed_sum_ef3fd8d3

- classy_vision
```
flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow
```
https://our.intern.facebook.com/intern/fblearner/details/263970380/?notif_channel=cli

Reviewed By: tierex

Differential Revision: D27512554

fbshipit-source-id: 903d25d96655085685f874113826d4627d9a79e4
2021-04-08 09:58:20 -07:00
Aliaksandr Ivanou
f5675f8306 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412

The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu

Differential Revision: D27602838

fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
2021-04-07 09:39:24 -07:00
Brian Hirsh
ae3a876c9c Revert D27572158: [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Test Plan: revert-hammer

Differential Revision:
D27572158 (e9c6a51100)

Original commit changeset: 9a360468acc9

fbshipit-source-id: 29f7e2cba3e134bc81fb31b7e1dfceb7c1f9d734
2021-04-06 11:41:55 -07:00
Aliaksandr Ivanou
e9c6a51100 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27572158

fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b
2021-04-06 11:03:00 -07:00
Kiuk Chung
b03c92a9c5 [2/n][torch/elastic][upstream] Move torchelastic/timer torchelastic/multiprocessing to torch/distributed/elastic (#53574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53574

Upstreams `torchelastic/timer|multiprocessing` to `torch/distributed/elastic/timer|multiprocessing`

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/...
buck test mode/dev-nosan //caffe2/test/distributed/elastic/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
buck test mode/dev-nosan //hpc/...
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/...
```

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D26899809

fbshipit-source-id: e6dbc2a78282eac296c262b3206a979e3ef1ff53
2021-03-10 12:32:53 -08:00