Commit Graph

247 Commits

Author SHA1 Message Date
Michael Suo
9bf18aab94 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 19:43:28 +00:00
Aliaksandr Ivanou
e54c1f6c90 [torch][elastic] Make final agent barrier to shutdown properly
Summary:
When workers finish their work TE agent will start `synchronize_barrier` procedure. The barrier will wait for other agents at the end of the execution.

There is a race condition may happen: The barrier uses TCPStore which is located on Rank0. When Rank0 finishes the work, other ranks may still be in a process of executing `get_all` method. This means that some of them will fail because the TCPStore will be destroyed.

The fix adds additional check on Rank0 process: Rank0 process now waits for all other ranks to finish before terminating the process.

Test Plan: unit tests

Differential Revision: D35227180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74931
Approved by: https://github.com/kiukchung
2022-04-15 20:29:05 +00:00
Kiuk Chung
766eba60f7 (torchx/elastic) honor NCCL_ASYNC_ERROR_HANDLING set from the env var (#73982)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73982

Currently there is no way for users using torchelastic to override NCCL_ASYNC_ERROR_HANDLING=0. This PR enables this.

Test Plan:
Added unittests

Manual testing
```
$ torchx run fb.dist.ddp -- --img torchx_examples -m print_env_vars.py --env NCCL_ASYNC_ERROR_HANDLING=0
```

Validated the NCCL_ASYNC_ERROR_HANDLING in the process running `print_env_vars.py` is indeed `0`.

Reviewed By: mannatsingh, aivanou

Differential Revision: D34765786

fbshipit-source-id: 3f9f6d3b61e7d265adf689d387e020ab534c9259
(cherry picked from commit 2b787b46c6d37f049fe39eb64eecedf68799e75c)
2022-03-11 01:03:54 +00:00
Kiuk Chung
b08309ee0a (torch/elastic) skip logging structured error info if error_file is not set (#73477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477

resolves https://github.com/pytorch/pytorch/issues/73465

This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`.

Test Plan:
Induce an error on the agent process by `kill -15 $AGENT_PID`
```
python -m torch.distributed.run \
   --nproc_per_node 2 \
   --nnodes 1:1 \
   --rdzv_backend c10d \
  --rdzv_endpoint localhost:29500 \
  --monitor_interval 3 test.py
```

Produces

{F704936697}

In contrast to the duplicated error before:

{F704936729}

Reviewed By: d4l3k

Differential Revision: D34501852

fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8
(cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
2022-03-01 19:31:44 +00:00
Steven Troxler
906d26fb9b [codemod][type-comments] Convert type comments in api.py (#73084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73084

I'm wrapping up the conversion of type comments to type annotations
in caffe2. The last remaining "bulk" codemod has test failures that
are hard for me to understand, so I'm going to submit PRs for each
module individually which makes it easier to see what's causing
problems.

All the codemods were produced via LibCST and then manually cleaned up.

Test Plan: Wait for github CI

Reviewed By: H-Huang

Differential Revision: D34344289

fbshipit-source-id: e8e3a13c3d95f6804829f1818fb7f0605e5ba137
(cherry picked from commit 92d47d9cd5)
2022-02-19 00:31:45 +00:00
Brian Muse
8bf3179f6e #71946 Remove Python 3.6 references (#72211)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71946

This commit removes some bits of code that were hard coded for Python 3.6 support from the `.circleci` and `torch` folders. It should only be merged if https://github.com/pytorch/pytorch/issues/66462 is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72211

Reviewed By: dagitses, seemethere

Differential Revision: D33982604

Pulled By: musebc

fbshipit-source-id: 8f453bf9909df615addd59538adb369c65484044
(cherry picked from commit 944a9970fe)
2022-02-08 03:46:20 +00:00
Erjia Guan
0721fc6474 Decouple MapDataPipe from Dataset (#70991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70991

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33477680

Pulled By: ejguan

fbshipit-source-id: d3e89492e921a96791319f35052a229684ddf7cf
2022-01-07 14:28:41 -08:00
s-kumano
4ed02748be fix typo in the docs of multiprocessing (#70448)
Summary:
Fix typo in the docs of multiprocessing.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70448

Reviewed By: gchanan

Differential Revision: D33336962

Pulled By: H-Huang

fbshipit-source-id: 1235703b8ddc26c33dcbc34bd25ac36b11a18923
2021-12-28 09:58:47 -08:00
Can Balioglu
6e640a0acf Revise the socket implementation of c10d (#68226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226

**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**

This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:

 - significantly better error handling and much more verbose logging (see the example output below)
 - explicit support for IPv6 and dual-stack sockets
 - correct handling of signal interrupts
 - better Windows support

A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.

## Example Output

```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987

Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.

Reviewed By: Babar, wilson100hong, mrshenli

Differential Revision: D32372333

fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
2021-11-16 20:49:25 -08:00
Kiuk Chung
f6402c469e (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749

Fixes: https://github.com/pytorch/pytorch/issues/67742

Test Plan:
Added unittests.

Validated manually:

```
# start agent 0
$ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# start agent 1
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# kill agent 0
CTRL+C (SIGINT) or kill -15 (SIGTERM)

# restart it
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
```

Reviewed By: cbalioglu

Differential Revision: D32129005

fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325
2021-11-05 12:18:46 -07:00
Howard Huang
d7ac6e977a Fix test_create_store_multi flaky test (#66953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66953

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: kiukchung

Differential Revision: D31802767

Pulled By: H-Huang

fbshipit-source-id: a430e242788aac164496d4e65b85bf326537d019
2021-10-26 11:08:51 -07:00
Kiuk Chung
df11e2d6f9 (torch/elastic) add fqdn hostname to error printout (#66182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182

closes https://github.com/pytorch/pytorch/issues/63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
2021-10-07 01:40:02 -07:00
edward-io
51e12f0071 fix torch.distributed.elastic event docs (#64974)
Summary:
the example code wasn't working for me.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64974

Reviewed By: kiukchung, cbalioglu

Differential Revision: D30926481

Pulled By: edward-io

fbshipit-source-id: f5e32cc2b948b6ee30d84a8247856f39fc786f67
2021-09-17 12:27:09 -07:00
Kiuk Chung
db134a6843 (torch.distributed.elastic) properly format traceback on error (#65041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65041

Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/64036 where the traceback of the error handler is printed out rather than the traceback of the actual exception.

Fixes https://github.com/pytorch/pytorch/issues/60910
Closes https://github.com/pytorch/pytorch/issues/60910

BEFORE (note that the `py_callstack` is NOT the traceback of the RuntimeError):
```
**************************************************************************************************************************************************************************************************************************************************
                                                                                                              run_script_path FAILED
==================================================================================================================================================================================================================================================
Root Cause:
[0]:
  time: 2021-09-14_22:01:06
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1092727)
  error_file: /tmp/torchelastic_aeyvjbpe/none_8zuih7tj/attempt_0/0/error.json
  msg:
    {
      "message": "RuntimeError: rasing error since --throw was specified",
      "extraInfo": {
        "py_callstack": [
          "  File \"<string>\", line 1, in <module>\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 116, in spawn_main\n    exitcode = _main(fd, parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 129, in _main\n    return self._bootstrap(parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 315, in _bootstrap\n    self.run()\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 108, in run\n    self._target(*self._args, **self._kwargs)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/multiprocessing/spawn.py\", line 59, in _wrap\n    fn(i, *args)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/api.py\", line 382, in _wrap\n    ret = record(fn)(*args_)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1631682066"
      }
    }

==================================================================================================================================================================================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
**************************************************************************************************************************************************************************************************************************************************
```

AFTER (note the traceback is the traceback of the RuntimeError):
```
********************************************************************************
                             run_script_path FAILED
================================================================================
Root Cause:
[0]:
  time: 2021-09-14_21:49:25
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1014681)
  error_file: /tmp/torchelastic_q0zods2c/none_qwmz5dgj/attempt_0/0/error.json
  msg: Traceback (most recent call last):
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/run.py", line 671, in run_script_path
      runpy.run_path(sys.argv[0], run_name="__main__")
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 265, in run_path
      return _run_module_code(code, init_globals, run_name,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 97, in _run_module_code
      _run_code(code, mod_globals, init_globals,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/home/kiuk/tmp/test.py", line 55, in <module>
      main()
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/home/kiuk/tmp/test.py", line 25, in main
      raise RuntimeError("rasing error since --throw was specified")
  RuntimeError: rasing error since --throw was specified

================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
********************************************************************************
```

Test Plan:
(see summary for before and after)

`test.py` contents:
```
import argparse
import os
import sys

import torch
import torch.distributed as dist
import torch.nn.functional as F

from torch.distributed.elastic.multiprocessing.errors import record

def parse_args(argv):
    parser = argparse.ArgumentParser(description="test script")
    parser.add_argument("--init_method", type=str, default="env://")
    parser.add_argument("--backend", type=str, default="gloo")
    parser.add_argument("--throw", action="store_true", default=False)
    parser.add_argument("--exit", action="store_true", default=False)
    return parser.parse_args()

record
def main():
    args = parse_args(sys.argv[1:])

    if args.throw:
        raise RuntimeError("rasing error since --throw was specified")

    if args.exit:
        sys.exit(1)

    init_method=args.init_method
    backend=args.backend

    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])

    print(f"initializing `{backend}` process group with rank={rank}, world_size={world_size} at {init_method}")

    dist.init_process_group(
        backend=backend,
        init_method=init_method,
        world_size=world_size,
        rank=rank)

    print(f"successfully initialized process group with rank={dist.get_rank()}, world_size={dist.get_world_size()}")

    t = F.one_hot(torch.tensor(rank), num_classes=world_size)
    dist.all_reduce(t)
    derived_world_size = torch.sum(t).item()
    if derived_world_size != world_size:
        raise RuntimeError(f"derived world size: {derived_world_size} != actual world size: {world_size}")
    else:
        print(f"sucessfully derived world size: {derived_world_size} (expected: {world_size}). Exiting")

if __name__ == "__main__":
    main()
```

run it as:

```
$ python -m torch.distributed.run --nproc_per_node 2 test.py --throw
```

Reviewed By: cbalioglu

Differential Revision: D30953731

fbshipit-source-id: bbea04c59c2aec58969cf44d8e3723d5f8abe8a8
2021-09-15 12:50:21 -07:00
Yanli Zhao
adb85b32d3 minor fix for elastic doc (#64531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64531

fix #64530

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D30760879

fbshipit-source-id: 94ed1476e886513427d928a36f5be6b9bfff0826
2021-09-07 09:31:01 -07:00
Can Balioglu
d8d8e4902a [torch/elastic] Pretty print the failure message captured by @record (#64036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64036

This PR slightly revises the implementation of the internal `_format_failure()` method in order to pretty print the error message captured in a subprocess by the `record` annotation.

With this PR a failure log is formatted as below:

```
Root Cause:
[0]:
  time: 2021-08-26_17:12:07
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 8045)
  error_file: /tmp/torchelastic_6cj9eppm/6d9d844a-6ce4-4838-93ed-1639a9525b00_rec9kuv3/attempt_0/0/error.json
  msg:
    {
      "message": "ValueError: Test",
      "extraInfo": {
        "py_callstack": [
          "  File \"/data/home/balioglu/fail.py\", line 7, in <module>\n    main()\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1629997927"
      }
    }
```

in contrast to the old formatting:

```
Root Cause:
[0]:
  time: 2021-08-26_17:15:50
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 9417)
  error_file: /tmp/torchelastic_22pwarnq/19f22638-848c-4b8f-8379-677f34fc44e7_u43o9vs7/attempt_0/0/error.json
  msg: "{'message': 'ValueError: Test', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n  File "/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper\n    return f(*args, **kwargs)\n  File "/data/home/balioglu/fail.py", line 5, in main\n    raise ValueError("BALIOGLU")\nValueError: BALIOGLU\n', 'timestamp': '1629998150'}}"
```
ghstack-source-id: 136761768

Test Plan: Run the existing unit tests.

Reviewed By: kiukchung

Differential Revision: D30579025

fbshipit-source-id: 37df0b7c7ec9b620355766122986c2c77e8495ae
2021-08-26 13:56:46 -07:00
Aliaksandr Ivanou
f12f667e12 [torch] Set default log level for torch elastic (#63214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63214

The default log level in fb and oss is different: in oss we use WARNING and in fb we use INFO.

Test Plan: unittests, f291441502

Reviewed By: cbalioglu

Differential Revision: D30296298

fbshipit-source-id: 89067352be767255fbc66e790ec333582de64c6c
2021-08-17 19:58:13 -07:00
Aliaksandr Ivanou
e1f81c9321 [torchelastic][multiprocessing] Print warning message only when child processes are stuck (#62823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62823

The diff makes sure that the warning message is printed only when the child processes are stuck after sending termination code.

Test Plan:
sandcastle

    buck build mode/dev-nosan //caffe2:run
    buck-out/gen/caffe2/run.par --nnodes 1 --nproc_per_node 1 main.py
P435691445

Differential Revision: D30046695

fbshipit-source-id: c59170b297f4a0e530906fa5069234303deee938
2021-08-05 19:57:31 -07:00
Neel Pragnesh Gandhi
f2369f12f9 Add logging for dynamic rendezvous (#61822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822

Added scuba logging to the following files:
- dynamic_rendezvous.py
- c10d_rendezvous_backend.py

NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed.

Test Plan:
The following tests can be run.
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test
```

Reviewed By: aivanou

Differential Revision: D29643774

fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd
2021-07-26 09:39:09 -07:00
Aliaksandr Ivanou
0c55f1bdec [torchelastic] Improve process termination logic (#61602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602

The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.

When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.

Test Plan: unittests, sandcastle

Reviewed By: cbalioglu

Differential Revision: D29671783

fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
2021-07-23 11:00:15 -07:00
Rohan Varma
8bcf24b37a [TCPStore] enhance connect timeout error message (#61390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61390

Enhances this error message for better debugability.
ghstack-source-id: 133185482

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D29601528

fbshipit-source-id: f7aaf4d67ac96e6ed0b535e0200f918dd01e42f9
2021-07-10 03:57:23 -07:00
Aliaksandr Ivanou
8296cb37c7 [torchelastic] Set the correct maximum border width
Summary: The diff sets the correct max border delimiters between error sections

Test Plan: Example of the uncontrolled border: https://www.internalfb.com/intern/testinfra/diagnostics/7599824415964133.844424970500348.1625590344/

Reviewed By: kiukchung

Differential Revision: D29636814

fbshipit-source-id: 95465d3150066bff82dc7499bb1c63ea4f5ebc2d
2021-07-09 13:29:23 -07:00
Aliaksandr Ivanou
13658b10bb [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#61294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: cbalioglu

Differential Revision: D29559553

fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
2021-07-08 16:28:06 -07:00
Vitaly Fedyunin
ccfdb30644 Revert D29413019: [torch] Various improvements to torch.distributed.launch and torch.distributed.run
Test Plan: revert-hammer

Differential Revision:
D29413019 (4e181dfc35)

Original commit changeset: 323bfbad9d0e

fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2
2021-07-01 08:44:51 -07:00
Aliaksandr Ivanou
4e181dfc35 [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: kiukchung, cbalioglu

Differential Revision: D29413019

fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
2021-06-30 23:31:02 -07:00
Kiuk Chung
aaea81e3fb [torch/distributed] remove outdated FutureWarning in distributed/elastic/util/store.py (#60807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60807

Addresses: https://github.com/pytorch/pytorch/issues/60717

This warning should have been removed since this code is no longer in "experimental" mode.

Test Plan: N/A - just removing experimental warning that should've been removed.

Reviewed By: H-Huang, aivanou

Differential Revision: D29412972

fbshipit-source-id: 16a8a98abde70a4ae0c1ac1b14bda339cb44863a
2021-06-28 10:22:16 -07:00
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
Neel Pragnesh Gandhi
2c5db9a40a Add c10d filestore functionality to the current c10d_rendezvous_backend (#59719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719

Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.

Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.

Reviewed By: cbalioglu, H-Huang

Differential Revision: D28997436

fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
2021-06-16 12:13:36 -07:00
Aliaksandr Ivanou
1735775662 [Torch] Cast timestamp type to int (#59712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712

When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5

This diff adds support for casting str timestamp to int.

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

Reviewed By: suphoff

Differential Revision: D28995827

fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
2021-06-09 15:56:37 -07:00
Can Balioglu
44c442293f [torch/elastic] Fix the edge case when no node is alive (#59663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663

This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D28971809

fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
2021-06-09 15:31:50 -07:00
Can Balioglu
4ee761c2c5 [2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329

This PR is part of a stack that addresses the GitHub issue #41614; it introduces:

- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.

- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.

Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: rohan-varma

Differential Revision: D28424978

fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
2021-06-05 07:50:04 -07:00
Aliaksandr Ivanou
c22ac14969 [Error-reporting] Set upper boundary on border element (#59311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59311

The diff sets the upper boundary on border element when presenting the error message. This is required in order to avoid unnecessary log contamination

Test Plan: Example of log contamination: https://www.internalfb.com/fblearner/details/276849996/operator/2942475685?tab=try_27021599785797968

Reviewed By: d4l3k

Differential Revision: D28812745

fbshipit-source-id: 4f491b9acc8cc9831d763f185022879bbbfb4c8a
2021-06-02 12:28:54 -07:00
Can Balioglu
f0a5500722 [torch/elastic] Add logging to the sanitize function of RendezvousStateHolder (#58169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58169

This PR adds logging to the `_sanitize()` function of `RendezvousStateHolder` to output the nodes that had no recent heartbeat and are considered "dead".
ghstack-source-id: 128798389

Test Plan: Run the existing tests.

Reviewed By: tierex

Differential Revision: D28333394

fbshipit-source-id: ba0a398a759815e4224b58323c0e743eb383f723
2021-05-12 18:53:55 -07:00
Can Balioglu
028f2f62ac [torch/elastic] Update the rendezvous docs (#58160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160

This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend.
ghstack-source-id: 128783809

Test Plan: Visually verified the correct

Reviewed By: tierex

Differential Revision: D28384996

fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
2021-05-12 16:54:28 -07:00
Can Balioglu
ae63b1d1c6 [torch/elastic] Revise distributed run script (#58159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159

This PR includes the following changes:

- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.

- The `import` statement for `EtcdServer` has been removed from the run script.

- The docstrings and parameter descriptions of the run script have been revised and improved.

- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267

Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.

Reviewed By: tierex

Differential Revision: D28383681

fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
Kimish Patel
b7d674eb21 Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs
Test Plan: revert-hammer

Differential Revision:
D28331386 (e4418b67c7)

Original commit changeset: 95dd32146222

fbshipit-source-id: 5522d4a09bc06ac42943eec9aa8bf5292cc778b2
2021-05-11 18:10:46 -07:00
Can Balioglu
e4418b67c7 [torch/elastic] Update the rendezvous docs (#57973)
Summary:
This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973

Reviewed By: kiukchung

Differential Revision: D28331386

Pulled By: cbalioglu

fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b
2021-05-11 15:32:50 -07:00
Can Balioglu
1d4d9ffca0 [torch/elastic] Refactor rendezvous store initialization logic (#58057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58057

This PR refactors the store initialization logic and moves it to the `create_backend` function for both C10d and etcd backends.
ghstack-source-id: 128671579

Test Plan: Run the existing and revised tests.

Reviewed By: tierex

Differential Revision: D28356587

fbshipit-source-id: caf9416ab811eefe4834268d8a11a48f2236ed5b
2021-05-11 13:46:07 -07:00
Can Balioglu
731dcd75f5 [torch/elastic] Revise the note section of RendezvousHandler doc (#57723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57723

Updated the note section of `RendezvousHandler`:

- Removed the experimental API warning.
- Recommended using the C10d Store instead of etcd for most users.

Test Plan: N/A

Reviewed By: kiukchung

Differential Revision: D28253828

fbshipit-source-id: c4f34dffd1a3cc132977029fe449b6d63ddc877b
2021-05-07 13:10:22 -07:00
Can Balioglu
e5e095cbe4 [torch/elastic] Rename etcd-/c10d-experimental to etcd-v2 and c10d (#57764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57764

As discussed offline this PR renames etcd-experimental backend to etcd-v2 and c10d-experimental backend to c10d.
ghstack-source-id: 128342523

Test Plan: Run the existing unit tests.

Reviewed By: kiukchung

Differential Revision: D28263739

fbshipit-source-id: c3409037ecea5a8ff6daadeeb1f2fb4205cc3852
2021-05-06 19:51:53 -07:00
Aliaksandr Ivanou
cb7197ce3f Torchelastic: populate __init__.py with failover documentation
Summary: Torchelastic: populate __init__.py with failover documentation

Test Plan: {F613772684}

Reviewed By: cbalioglu

Differential Revision: D28243715

fbshipit-source-id: aeed8d3ddd2d27ef86d837e7e3ebfa7a0b80a07d
2021-05-06 07:38:48 -07:00
Kiuk Chung
a80b215a9a [1/n][torch/elastic] Move torchelastic docs *.rst (#148)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811

Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst`

Reviewed By: H-Huang

Differential Revision: D27974751

fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
2021-05-04 00:57:56 -07:00
Can Balioglu
bf6e3425b0 [23/n] [torch/elastic] Introduce the implementation of DynamicRendezvousHandler (#57151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151

This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28060531

fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
2021-05-03 18:32:43 -07:00
Can Balioglu
a357fc8a4b [22/n] [torch/elastic] Introduce a new from_backend static constructor for DynamicRendezvousHandler (#57150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150

This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28060336

fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
2021-05-03 18:32:42 -07:00
Can Balioglu
4a10bd3b58 [21/n] [torch/elastic] Introduce _RendezvousJoinOp (#57149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149

This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059785

fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
2021-05-03 18:32:40 -07:00
Can Balioglu
81ef683cb3 [20/n] [torch/elastic] Introduce _RendezvousExitOp (#57148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148

This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059764

fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
2021-05-03 18:32:38 -07:00
Can Balioglu
baf8f4c0a6 [19/n] [torch/elastic] Introduce _RendezvousKeepAliveOp (#57147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147

This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059733

fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
2021-05-03 18:32:37 -07:00
Can Balioglu
3e024fcfc9 [18/n] [torch/elastic] Introduce _RendezvousCloseOp (#57146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146

This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059693

fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
2021-05-03 18:32:35 -07:00
Can Balioglu
aa5d35e1d7 [17/n] [torch/elastic] Introduce _DistributedRendezvousOpExecutor (#57145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145

This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059417

fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
2021-05-03 18:31:23 -07:00
Can Balioglu
1a6f827ae6 [16/n] [torch/elastic] Introduce _RendezvousOpExecutor (#57144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144

This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous.
ghstack-source-id: 127684898

Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition.

Reviewed By: tierex

Differential Revision: D28059159

fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f
2021-05-03 12:18:27 -07:00
Can Balioglu
76bccfb2e0 [15/n] [torch/elastic] Introduce _RendezvousStateHolder (#56538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538

This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D27892600

fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
2021-05-03 12:17:18 -07:00
Can Balioglu
1b745efbe8 [14/n] Introduce a name attribute to _PeriodicTimer (#57143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143

This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751

Test Plan: Run the new and updated unit tests.

Reviewed By: tierex

Differential Revision: D28059045

fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
2021-05-03 11:37:05 -07:00
Can Balioglu
233004b4c8 [13/n] Extend the return type of RendezvousBackend's set_state method (#57142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142

This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058980

fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
2021-05-03 11:37:03 -07:00
Can Balioglu
a6f60cf4f0 [12/n] Rename last_keep_alives to last_heartbeats in _RendezvousState (#57141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141

Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058948

fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
2021-05-03 11:37:01 -07:00
Can Balioglu
3209364724 [11/n] [torch/elastic] Add heartbeat timeout to RendezvousTimeout (#57140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140

This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058908

fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
2021-05-03 11:37:00 -07:00
Can Balioglu
6876e15dbe [10/n] [torch/elastic] Add comparison operators to _NodeDesc (#57139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139

This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`.
ghstack-source-id: 127626783

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D28058851

fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310
2021-05-03 11:36:58 -07:00
Can Balioglu
6bf8df6b3b [9/n] [torch/elastic] Introduce RendezvousSettings (#56537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537

This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D27890155

fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
2021-05-03 11:36:04 -07:00
Atul Jangra
ca814904b4 Handle error reporting when reply file already exists (#57217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57217

In torch multiprocessing error handler, we try to remove the file if it already exists. Before removing, we try to log the contents of the file. Here the assumption is that the contents would be valid json.
However, in some cases, it isn't and then we end up not clearing the file.
Let's handle this error and make sure that the file is cleaned irrespective of the contents of the file.

Reviewed By: devashisht

Differential Revision: D28041470

fbshipit-source-id: da96d11b8f7091715cf0152cccd3ecc08b688eae
2021-04-29 04:57:35 -07:00
Aliaksandr Ivanou
6ff0002b12 Pytorch: enable many torchelastic tests (#56970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56970

The diff enables metrics, events, utils and timer tests on ci/cd pipeline

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28015200

fbshipit-source-id: 6b419aaf9e62a10a747b6511bff90c82cfb7bcd6
2021-04-28 17:05:09 -07:00
Aliaksandr Ivanou
0df574017d Torchelastic: add support for the new error file format (#57084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57084

The diff adds support for new error message file format:

    {
        "message":"test",
        "timestamp": 12
    }

Test Plan:
fbcode buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

example job: tsm_aivanou-torchelastic_distributed_sum_77c0b147

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D28042764

fbshipit-source-id: 4d21c2319654f3460d551d91cbf48568356cf4e8
2021-04-28 00:04:45 -07:00
Aliaksandr Ivanou
0a72904ab4 Torchelastic: make process failure init error non-fatal (#56739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739

The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

    https://fburl.com/tupperware/0nizb9z8

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D27952596

fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
2021-04-23 00:49:47 -07:00
Can Balioglu
853112bbfc [7/n] [torch/elastic] Rename _Rendezvous to _RendezvousState (#56535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535

This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes.
ghstack-source-id: 126979138

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889894

fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2
2021-04-21 16:01:03 -07:00
Can Balioglu
21d9bc246b [6/n] [torch/elastic] Reorder type definitions in dynamic_rendezvous.py (#56534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534

This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889817

fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
2021-04-21 16:01:02 -07:00
Can Balioglu
df91eb924c [5/n] [torch/elastic] Introduce the delay utility function (#56533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533

This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035

Test Plan: Run the associated unit tests.

Reviewed By: H-Huang

Differential Revision: D27889671

fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
2021-04-21 16:01:00 -07:00
Can Balioglu
76ca1eeeb8 [4/n] [torch/elastic] Fix the finalizer of PeriodicTimer (#56532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532

This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.

We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904

Test Plan: Run the existing unit tests as there is no behavioral change.

Reviewed By: H-Huang

Differential Revision: D27889289

fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5
2021-04-21 15:59:19 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Aliaksandr Ivanou
c5c5230890 Pytorch resolve bug around incorrect rdzv handler resolution (#56386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386

The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher

Added test_launcher to the ci/cd tests

Reviewed By: cbalioglu

Differential Revision: D27858897

fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
2021-04-19 23:50:28 -07:00
Kiuk Chung
023231a2ac [torch/distributed] Fix pydoc for torch.distributed.elastic.multiprocessing (replace Redirect with Std)
Summary: `Redirects` was renamed to `Std` in `torch.distributed.elastic.multiprocessing.api`. Pointed out by a user in https://github.com/pytorch/elastic/issues/147.

Test Plan: N/A just doc change

Reviewed By: tierex

Differential Revision: D27866614

fbshipit-source-id: 9fb901aae7ebe11cde13000a1c118de527f34400
2021-04-19 21:40:16 -07:00
Sam Estep
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
Aliaksandr Ivanou
a6940aae37 [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037

The diff introduces new  `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.

Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`

The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: H-Huang

Differential Revision: D27805799

fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907
2021-04-16 13:38:23 -07:00
Can Balioglu
512c744f2e [torch/elastic] Introduce PeriodicTimer (#55919)
Summary:
This PR introduces a basic timer type that periodically calls a specified function. Its main use in the upcoming `DynamicRendezvousHandler` implementation will be to send periodic keep-alive updates in a background thread.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55919

Reviewed By: tierex

Differential Revision: D27740823

Pulled By: cbalioglu

fbshipit-source-id: e46fc848ab033995946a38a29c01d67d387a4cf5
2021-04-15 14:51:14 -07:00
Can Balioglu
71f9e99e29 [torch/elastic] Introduce aux types required by DynamicRendezvousHandler (#55932)
Summary:
This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27742329

Pulled By: cbalioglu

fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198
2021-04-15 11:12:20 -07:00
Aliaksandr Ivanou
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
Richard Barnes
6269efde91 Add stricter typing to caffe2/torch/distributed/elastic/multiprocessing/errors/__init__.py (#55848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55848

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D27714781

fbshipit-source-id: cff651e04c1e8363a249c7de9de01c33db47f003
2021-04-13 10:47:08 -07:00
Sam Estep
4753100a3b Un-ignore F403 in .flake8 (#55838)
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html

This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).

This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838

Test Plan: CI. You can also run `flake8` locally.

Reviewed By: jbschlosser

Differential Revision: D27724232

Pulled By: samestep

fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
2021-04-13 09:24:07 -07:00
Can Balioglu
e61b4fa691 [3/n] [torch/elastic] Introduce EtcdRendezvousBackend. (#55637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55637

This diff introduces the `EtcdRendezvousBackend` type that will serve as an experimental alternative to the existing `EtcdRendezvousHandler`.

The major advantage of `EtcdRendezvousBackend` is that it delegates the bulk of the rendezvous handling logic to `DynamicRendezvousHandler` which is shared with `C10dRendezvousBackend` (see D27654492) and any other potential future rendezvous backend (e.g. Amazon S3).
ghstack-source-id: 126312209

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654498

fbshipit-source-id: f3259adfc9068b7e323b947a7d8d52fcd0b8ada1
2021-04-12 22:20:29 -07:00
Can Balioglu
339d3bf394 [2/n] [torch/elastic] Introduce C10dRendezvousBackend. (#55636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636

This diff introduces:

- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654492

fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
2021-04-12 22:20:27 -07:00
Can Balioglu
b3dd8cde61 [1/n] [torch/elastic] Introduce DynamicRendezvousHandler and RendezvousBackend. (#55635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635

This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.

`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654478

fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
2021-04-12 22:18:49 -07:00
Ralf Gommers
48ddc9762b Upgrade mypy to version 0.812 (#55712)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211

This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.

I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712

Reviewed By: walterddr

Differential Revision: D27694976

Pulled By: malfet

fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
2021-04-12 18:08:28 -07:00
Sam Estep
cc11aaaa60 Disallow non-breaking spaces (#55465)
Summary:
malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465

Test Plan:
The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout ece075195d and run it locally:
```sh
(! git --no-pager grep -In $'\u00a0' -- . || (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false))
```
It should print over a hundred lines of output and exit with status 1.

Reviewed By: janeyx99

Differential Revision: D27622136

Pulled By: samestep

fbshipit-source-id: e7ffd5a9519093e7a0ffdf55e9291f63e21ce841
2021-04-08 15:44:44 -07:00
Wilson Hong
f665a7f8a1 [pet] Set error code in reply file when child process is terminated by signals.
Summary: Fill reply file's error code with ProcessFailure's exitcode. This is necessary when child process terminated by signals (ex. SIGSEGV).

Test Plan:
- Buck test
```
buck test mode/dev-nosan pytorch/elastic/torchelastic/distributed/fb/test:launch_test
buck test mode/dev-nosan caffe2/torch/distributed/elastic/multiprocessing/errors/fb/test:error_handler_fb_test_needed_coverage
```

- TSM
```
fbpkg build -E torchelastic_distributed_sum

buck run mode/dev-nosan //pytorch/elastic/torchelastic/tsm/fb/cli:tsm -- run_ddp --scheduler mast --fbpkg torchelastic_distributed_sum:ecdf31f --nnodes 2 --nproc_per_node 2 --resource T1  --run_cfg hpcIdentity=oncall_dai_pet,hpcClusterUuid=MastNaoTestCluster main.pa
```
https://www.internalfb.com/mast/job/tsm_wilsonhong-torchelastic_distributed_sum_ef3fd8d3

- classy_vision
```
flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow
```
https://our.intern.facebook.com/intern/fblearner/details/263970380/?notif_channel=cli

Reviewed By: tierex

Differential Revision: D27512554

fbshipit-source-id: 903d25d96655085685f874113826d4627d9a79e4
2021-04-08 09:58:20 -07:00
Can Balioglu
493a233c04 [torch/elastic] Revise the rendezvous handler registry logic. (#55466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466

Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27623215

fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
2021-04-07 20:43:20 -07:00
Aliaksandr Ivanou
f5675f8306 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412

The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu

Differential Revision: D27602838

fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
2021-04-07 09:39:24 -07:00
Nikita Shulga
add49e7e4e Enforce PEP263 for PyTorch python codebase (#55346)
Summary:
All python files containing non-ASCII characters should be correctly annotated with `# -*- coding: utf-8 -*-` comment

Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346

Reviewed By: samestep

Differential Revision: D27582044

Pulled By: malfet

fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44
2021-04-06 18:31:38 -07:00
Brian Hirsh
ae3a876c9c Revert D27572158: [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Test Plan: revert-hammer

Differential Revision:
D27572158 (e9c6a51100)

Original commit changeset: 9a360468acc9

fbshipit-source-id: 29f7e2cba3e134bc81fb31b7e1dfceb7c1f9d734
2021-04-06 11:41:55 -07:00
Aliaksandr Ivanou
e9c6a51100 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27572158

fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b
2021-04-06 11:03:00 -07:00
Brian Hirsh
bf70fe69ae Revert D27442325: [torch/elastic] Revise the rendezvous handler registry logic.
Test Plan: revert-hammer

Differential Revision:
D27442325 (df299dbd7d)

Original commit changeset: 8519a2caacbe

fbshipit-source-id: f10452567f592c23ae79ca31556a2a77546726b1
2021-04-06 06:17:14 -07:00
Can Balioglu
df299dbd7d [torch/elastic] Revise the rendezvous handler registry logic.
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27442325

fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
2021-04-05 23:38:29 -07:00
Can Balioglu
359d0a0205 [torch/elastic] Improve the implementation of RendezvousParameters and add its unit tests. (#146)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807

Improve the implementation and the unit test coverage of `RendezvousParameters`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27342444

fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f
2021-04-05 23:38:27 -07:00
Can Balioglu
7f06c65a4c [torch/elastic] Improve the implementation of the utility functions and add their unit tests. (#54804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54804

Improve the implementation of the utility functions to handle more edge cases and also have a new set of unit tests to cover their usage.

Test Plan: Run the existing and newly introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27327898

fbshipit-source-id: 96b6fe2d910e3de69f44947a0e8a9f687ab50633
2021-04-05 23:38:25 -07:00
Can Balioglu
de7f05b9eb [torch/elastic] Expose a stderr parameter in EtcdServer. (#54805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54805

Expose a `stderr` parameter to `EtcdServer` to have a clean unit test outputs.

Test Plan: Run the existing test suite.

Reviewed By: kiukchung

Differential Revision: D27327495

fbshipit-source-id: 0a342aeda0ff4d85d809aab1cbf155d3fafd4fa1
2021-04-05 23:38:22 -07:00
Can Balioglu
bad8d34780 [torch/elastic] Revise the rendezvous exception types. (#54803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803

Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types.

Test Plan: Run the existing test suite.

Reviewed By: H-Huang

Differential Revision: D27327505

fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92
2021-04-05 23:36:50 -07:00
Aliaksandr Ivanou
77ccd4f9a3 [5/n][torch/elastic][upstream] Move torchelastic/agent to torch/distributed/elastic/agent (#54343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54343

Move torchelastic/agent to torch/distributed/elastic/agent

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
      buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...

Reviewed By: kiukchung, wilson100hong

Differential Revision: D27173271

fbshipit-source-id: 26761acc3f962af2afffcc3c7a237f3b6d65e531
2021-03-22 23:15:37 -07:00
Aliaksandr Ivanou
e91aeb0470 [4/n][torch/elastic][upstream] Move torchelastic/metrics to torch/distributed/elastic/metrics (#53870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53870

Move torchelastic/metrics to torch/distributed/elastic/metrics

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/...

Reviewed By: kiukchung

Differential Revision: D26970901

fbshipit-source-id: 0e0a211fe509b7bc3ab10adfefba81cd71b6db37
2021-03-15 16:07:18 -07:00
Aliaksandr Ivanou
ec484981c6 [3/n][torch/elastic][upstream] Move torchelastic/events to torch/distributed/events (#53760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53760

Pull Request resolved: https://github.com/pytorch/elastic/pull/143

The diff upsteams torchelastic/events to the torch.

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/agent/...
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/events/fb/...

Reviewed By: kiukchung

Differential Revision: D26932830

fbshipit-source-id: 23fc10d2ead5af7f7ed510ae0d2581cc2421cf76
2021-03-11 11:25:24 -08:00
Kiuk Chung
b03c92a9c5 [2/n][torch/elastic][upstream] Move torchelastic/timer torchelastic/multiprocessing to torch/distributed/elastic (#53574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53574

Upstreams `torchelastic/timer|multiprocessing` to `torch/distributed/elastic/timer|multiprocessing`

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/...
buck test mode/dev-nosan //caffe2/test/distributed/elastic/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
buck test mode/dev-nosan //hpc/...
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/...
```

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D26899809

fbshipit-source-id: e6dbc2a78282eac296c262b3206a979e3ef1ff53
2021-03-10 12:32:53 -08:00
Kiuk Chung
ba75cedfc5 [1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172

Pull Request resolved: https://github.com/pytorch/elastic/pull/141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
2021-03-05 11:27:57 -08:00