Enables support for this:
```python
from torch.distributed.launcher.api import LaunchConfig
config = LaunchConfig(
...,
rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```
These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.
Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.
Test Plan: CI.
Differential Revision: D65453312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.
Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.
This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.
Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.
```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```
To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.
Differential Revision: D62256407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via
1) explicit argument passing in user code when instantiating `MastRendezvousHandler`
2) pass `--use_libuv` command line argument to `torchrun`.
The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch.
PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type:
when `USE_LIBUV="0"`, the non-libuv backend will be used.
when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option.
Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882
Approved by: https://github.com/shuqiangzhang
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.
Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.
Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.
Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)
Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)
Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch
import os
import time
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)
def _run_test(store):
if dist.get_rank() == 1:
logger.info("Rank %s is sleeping", dist.get_rank())
time.sleep(5)
key = "lookup_key"
logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
store.check([key])
else:
logger.info("rank %s done", dist.get_rank())
def main() -> None:
use_gpu = torch.cuda.is_available()
dist.init_process_group(backend="nccl" if use_gpu else "gloo")
dist.barrier()
logger.info(f"Hello World from rank {dist.get_rank()}")
host = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
world_size = os.environ['WORLD_SIZE']
logger.info("testing TCPStore")
store = dist.TCPStore(
host_name=host, port=int(port), world_size=int(world_size),
)
_run_test(store)
if __name__ == "__main__":
main()
~~~
With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~
TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]: main()
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]: _run_test(store)
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]: store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
main()
File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
run(args)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-05_17:40:22
host : devgpu011.cln5.facebook.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2279237)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~
Differential Revision: D58180193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
Summary:
1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
- Depending on the implementation they can either:
- point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
- build args that `torch.distributed.init_process_group` can bootstrap by creating new store.
Additional points:
- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.
Why:
- Reduce moving parts
- easier to swap implementation
- improve tractability
- addressing perf/debug-ability will benefit all usecases
-
Test Plan: CI
Differential Revision: D57055235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
Summary:
Minor logging cleanup in distributed library
1. Don't use "f" formatted strings - address linter issues.
2. Nits: Make use of unused `e` (error) in a few logs.
3. Change info->debug as asked in issue #113545
4. Nit: rename log -> logger in a few files for consistency
5. Fix a linter error.
Test Plan:
1. Local build passes.
2. Linter is happy.
Reviewers: wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822
Added scuba logging to the following files:
- dynamic_rendezvous.py
- c10d_rendezvous_backend.py
NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed.
Test Plan:
The following tests can be run.
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test
```
Reviewed By: aivanou
Differential Revision: D29643774
fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663
This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D28971809
fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58169
This PR adds logging to the `_sanitize()` function of `RendezvousStateHolder` to output the nodes that had no recent heartbeat and are considered "dead".
ghstack-source-id: 128798389
Test Plan: Run the existing tests.
Reviewed By: tierex
Differential Revision: D28333394
fbshipit-source-id: ba0a398a759815e4224b58323c0e743eb383f723
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160
This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend.
ghstack-source-id: 128783809
Test Plan: Visually verified the correct
Reviewed By: tierex
Differential Revision: D28384996
fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151
This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28060531
fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150
This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28060336
fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149
This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059785
fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148
This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059764
fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147
This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059733
fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146
This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059693
fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145
This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059417
fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144
This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous.
ghstack-source-id: 127684898
Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition.
Reviewed By: tierex
Differential Revision: D28059159
fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538
This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D27892600
fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142
This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058980
fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141
Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058948
fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140
This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058908
fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139
This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`.
ghstack-source-id: 127626783
Test Plan: Run the existing unit tests.
Reviewed By: tierex
Differential Revision: D28058851
fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537
This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738
Test Plan: Run the existing unit tests.
Reviewed By: tierex
Differential Revision: D27890155
fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535
This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes.
ghstack-source-id: 126979138
Test Plan: Run the existing unit tests.
Reviewed By: H-Huang
Differential Revision: D27889894
fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534
This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087
Test Plan: Run the existing unit tests.
Reviewed By: H-Huang
Differential Revision: D27889817
fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
Summary:
This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27742329
Pulled By: cbalioglu
fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635
This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.
`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654478
fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5