Commit Graph

45 Commits

Author SHA1 Message Date
Carlos Mocholi
aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
Aaron Orenstein
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Junjie Wang (PyTorch)
13eb3b3f6f [Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702)
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use  `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.

Test Plan: CI.

Differential Revision: D65453312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
2024-11-05 15:28:03 +00:00
fduwjj
40c825d773 [reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768
Approved by: https://github.com/kwen2501, https://github.com/atalman
2024-09-26 17:37:07 +00:00
PyTorch MergeBot
706eda5cd8 Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)"
This reverts commit 5033a1ca0d.

Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))
2024-09-24 22:24:26 +00:00
Amin Alam
1266be21f4 deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141)
Fix to #136140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141
Approved by: https://github.com/kwen2501
2024-09-24 07:26:10 +00:00
fduwjj
5033a1ca0d [RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
2024-09-23 20:32:24 +00:00
Tristan Rice
196748d491 [elastic] support local_addr across all rendezvous impls (#135262)
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.

This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.

Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.

```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```

To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.

Differential Revision: D62256407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-09-06 17:55:43 +00:00
Xilun Wu
e7731b3f8a [TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882)
D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via
1) explicit argument passing in user code when instantiating `MastRendezvousHandler`
2) pass `--use_libuv` command line argument to `torchrun`.

The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch.

PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type:
when `USE_LIBUV="0"`, the non-libuv backend will be used.
when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option.

Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882
Approved by: https://github.com/shuqiangzhang
2024-09-03 19:43:21 +00:00
Xuehai Pan
e6d4451ae8 [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866
Approved by: https://github.com/fegin
2024-06-18 13:51:53 +00:00
Kurman Karabukaev
bb13fad7aa Share TCPStore by default when using c10d rdzv handler (#128096)
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.

Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.

Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.

Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)

Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)

Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch

import os
import time

logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)

def _run_test(store):

    if dist.get_rank() == 1:
        logger.info("Rank %s is sleeping", dist.get_rank())
        time.sleep(5)
        key = "lookup_key"
        logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
        store.check([key])
    else:
        logger.info("rank %s done", dist.get_rank())

def main() -> None:
    use_gpu = torch.cuda.is_available()
    dist.init_process_group(backend="nccl" if use_gpu else "gloo")
    dist.barrier()

    logger.info(f"Hello World from rank {dist.get_rank()}")

    host = os.environ['MASTER_ADDR']
    port = os.environ['MASTER_PORT']
    world_size = os.environ['WORLD_SIZE']

    logger.info("testing TCPStore")
    store = dist.TCPStore(
        host_name=host, port=int(port), world_size=int(world_size),
    )
    _run_test(store)

if __name__ == "__main__":
    main()
~~~

With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]:     main()
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]:     _run_test(store)
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]:     store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
    main()
  File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
    run(args)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-05_17:40:22
  host      : devgpu011.cln5.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2279237)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~

Differential Revision: D58180193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
2024-06-12 21:49:42 +00:00
Aaron Orenstein
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
Kurman Karabukaev
d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743)
Summary:

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases
   -
Test Plan: CI

Differential Revision: D57055235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
2024-05-22 18:24:11 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Chirag Pandya
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
Kurman Karabukaev
a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
Kazuaki Ishizaki
91973e1c31 Issue113185 (#113523)
Fixes #113185

I have fixed the given docstring errors. The followings are the outputs with numbers before and after the changes:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113523
Approved by: https://github.com/kit1980
2023-11-14 22:25:28 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Chris Zheng
5d37890b8e Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified (#88922)
Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: https://github.com/pytorch/pytorch/issues/85300

Before:
For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
```
return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)
```

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, `socket.getfqdn(<hostname>) `

Test Plan: Unit tests.

Differential Revision: D41204028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922
Approved by: https://github.com/d4l3k
2022-12-21 03:55:01 +00:00
PyTorch MergeBot
9db3c517de Add __all__ for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80240
Approved by: https://github.com/rohan-varma
2022-06-27 17:11:12 +00:00
Neel Pragnesh Gandhi
f2369f12f9 Add logging for dynamic rendezvous (#61822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822

Added scuba logging to the following files:
- dynamic_rendezvous.py
- c10d_rendezvous_backend.py

NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed.

Test Plan:
The following tests can be run.
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test
```

Reviewed By: aivanou

Differential Revision: D29643774

fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd
2021-07-26 09:39:09 -07:00
Can Balioglu
44c442293f [torch/elastic] Fix the edge case when no node is alive (#59663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663

This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D28971809

fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
2021-06-09 15:31:50 -07:00
Can Balioglu
f0a5500722 [torch/elastic] Add logging to the sanitize function of RendezvousStateHolder (#58169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58169

This PR adds logging to the `_sanitize()` function of `RendezvousStateHolder` to output the nodes that had no recent heartbeat and are considered "dead".
ghstack-source-id: 128798389

Test Plan: Run the existing tests.

Reviewed By: tierex

Differential Revision: D28333394

fbshipit-source-id: ba0a398a759815e4224b58323c0e743eb383f723
2021-05-12 18:53:55 -07:00
Can Balioglu
028f2f62ac [torch/elastic] Update the rendezvous docs (#58160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160

This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend.
ghstack-source-id: 128783809

Test Plan: Visually verified the correct

Reviewed By: tierex

Differential Revision: D28384996

fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
2021-05-12 16:54:28 -07:00
Kimish Patel
b7d674eb21 Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs
Test Plan: revert-hammer

Differential Revision:
D28331386 (e4418b67c7)

Original commit changeset: 95dd32146222

fbshipit-source-id: 5522d4a09bc06ac42943eec9aa8bf5292cc778b2
2021-05-11 18:10:46 -07:00
Can Balioglu
e4418b67c7 [torch/elastic] Update the rendezvous docs (#57973)
Summary:
This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973

Reviewed By: kiukchung

Differential Revision: D28331386

Pulled By: cbalioglu

fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b
2021-05-11 15:32:50 -07:00
Can Balioglu
bf6e3425b0 [23/n] [torch/elastic] Introduce the implementation of DynamicRendezvousHandler (#57151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151

This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28060531

fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
2021-05-03 18:32:43 -07:00
Can Balioglu
a357fc8a4b [22/n] [torch/elastic] Introduce a new from_backend static constructor for DynamicRendezvousHandler (#57150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150

This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28060336

fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
2021-05-03 18:32:42 -07:00
Can Balioglu
4a10bd3b58 [21/n] [torch/elastic] Introduce _RendezvousJoinOp (#57149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149

This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059785

fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
2021-05-03 18:32:40 -07:00
Can Balioglu
81ef683cb3 [20/n] [torch/elastic] Introduce _RendezvousExitOp (#57148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148

This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059764

fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
2021-05-03 18:32:38 -07:00
Can Balioglu
baf8f4c0a6 [19/n] [torch/elastic] Introduce _RendezvousKeepAliveOp (#57147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147

This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059733

fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
2021-05-03 18:32:37 -07:00
Can Balioglu
3e024fcfc9 [18/n] [torch/elastic] Introduce _RendezvousCloseOp (#57146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146

This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059693

fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
2021-05-03 18:32:35 -07:00
Can Balioglu
aa5d35e1d7 [17/n] [torch/elastic] Introduce _DistributedRendezvousOpExecutor (#57145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145

This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059417

fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
2021-05-03 18:31:23 -07:00
Can Balioglu
1a6f827ae6 [16/n] [torch/elastic] Introduce _RendezvousOpExecutor (#57144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144

This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous.
ghstack-source-id: 127684898

Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition.

Reviewed By: tierex

Differential Revision: D28059159

fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f
2021-05-03 12:18:27 -07:00
Can Balioglu
76bccfb2e0 [15/n] [torch/elastic] Introduce _RendezvousStateHolder (#56538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538

This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D27892600

fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
2021-05-03 12:17:18 -07:00
Can Balioglu
233004b4c8 [13/n] Extend the return type of RendezvousBackend's set_state method (#57142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142

This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058980

fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
2021-05-03 11:37:03 -07:00
Can Balioglu
a6f60cf4f0 [12/n] Rename last_keep_alives to last_heartbeats in _RendezvousState (#57141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141

Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058948

fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
2021-05-03 11:37:01 -07:00
Can Balioglu
3209364724 [11/n] [torch/elastic] Add heartbeat timeout to RendezvousTimeout (#57140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140

This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058908

fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
2021-05-03 11:37:00 -07:00
Can Balioglu
6876e15dbe [10/n] [torch/elastic] Add comparison operators to _NodeDesc (#57139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139

This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`.
ghstack-source-id: 127626783

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D28058851

fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310
2021-05-03 11:36:58 -07:00
Can Balioglu
6bf8df6b3b [9/n] [torch/elastic] Introduce RendezvousSettings (#56537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537

This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D27890155

fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
2021-05-03 11:36:04 -07:00
Can Balioglu
853112bbfc [7/n] [torch/elastic] Rename _Rendezvous to _RendezvousState (#56535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535

This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes.
ghstack-source-id: 126979138

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889894

fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2
2021-04-21 16:01:03 -07:00
Can Balioglu
21d9bc246b [6/n] [torch/elastic] Reorder type definitions in dynamic_rendezvous.py (#56534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534

This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889817

fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
2021-04-21 16:01:02 -07:00
Can Balioglu
71f9e99e29 [torch/elastic] Introduce aux types required by DynamicRendezvousHandler (#55932)
Summary:
This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27742329

Pulled By: cbalioglu

fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198
2021-04-15 11:12:20 -07:00
Can Balioglu
b3dd8cde61 [1/n] [torch/elastic] Introduce DynamicRendezvousHandler and RendezvousBackend. (#55635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635

This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.

`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654478

fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
2021-04-12 22:18:49 -07:00