Enables support for this:
```python
from torch.distributed.launcher.api import LaunchConfig
config = LaunchConfig(
...,
rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```
These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.
Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
Summary:
we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid
something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out
Test Plan: experiment in next diff shows this works
Differential Revision: D65837085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060
Approved by: https://github.com/gag1jain
Fixes#143188
The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.
Test plan:
```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
Fixes#143188
The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.
Test plan:
```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.
Test Plan: CI.
Differential Revision: D65453312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization
The workaround is to update the healthcheck time when calling `get_last_progress_time`
Test Plan:
Logs show that the progress time value is being changed despite client not being set
Behavior when watchdog is enabled with SJD config is left unchanged
Differential Revision: D64733766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
Fixes#132950
This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch.
This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).
We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly.
Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056
Approved by: https://github.com/fduwjj
Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.
Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user.
Test Plan: unit tests
Reviewed By: gag1jain
Differential Revision: D62408767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913
Approved by: https://github.com/gag1jain
Fixes#127519
Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.
#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:
```
plugin_root
|_ pyproject.toml
|_ src
|_ redis
|_ __init__.py
|_ redis_store.py
|_ redis_backend.py
```
The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:
```
[project]
name = "redis"
version = "0.0.1"
[project.entry-points.'torchrun.plugins']
redis = 'redis'
```
The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:
```
def getPluginHandler():
def _create_redis_handler(params: RendezvousParameters):
from redis_rendezvous_backend import create_backend
backend, store = create_backend(params)
return create_handler(store, backend, params)
return _create_redis_handler
```
The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.
#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.
Once installed, the new backend can be used in torchrun as follows:
```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/fduwjj
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.
This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.
Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.
```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```
To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.
Differential Revision: D62256407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
Summary:
In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error:
> "ValueError('signal only works in main thread of the main interpreter')"
To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling.
Test Plan:
Before this change, MAST job failed:
https://fburl.com/mlhub/iq2m10v8
With this change, MAST job succeeded:
https://fburl.com/mlhub/q6kb8343
Differential Revision: D62166943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088
Approved by: https://github.com/d4l3k
D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via
1) explicit argument passing in user code when instantiating `MastRendezvousHandler`
2) pass `--use_libuv` command line argument to `torchrun`.
The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch.
PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type:
when `USE_LIBUV="0"`, the non-libuv backend will be used.
when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option.
Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882
Approved by: https://github.com/shuqiangzhang
Fixes#127519
Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.
#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:
```
plugin_root
|_ pyproject.toml
|_ src
|_ redis
|_ __init__.py
|_ redis_store.py
|_ redis_backend.py
```
The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:
```
[project]
name = "redis"
version = "0.0.1"
[project.entry-points.'torchrun.plugins']
redis = 'redis'
```
The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:
```
def getPluginHandler():
def _create_redis_handler(params: RendezvousParameters):
from redis_rendezvous_backend import create_backend
backend, store = create_backend(params)
return create_handler(store, backend, params)
return _create_redis_handler
```
The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.
#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.
Once installed, the new backend can be used in torchrun as follows:
```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/wconstab
Optional option to detect missing ranks (that can be mapped to host info via `rank_tracing_decoder` lambda argument) in store barrier operation.
This approach has been used in some form already, moving it to collectives API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132818
Approved by: https://github.com/d4l3k
Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first.
Test Plan:
CI + existing tests -- no behavior change
Differential Revision: D59396199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148
Approved by: https://github.com/kurman, https://github.com/wconstab
Summary:
Use JK to control the release instead of using env variable to toggle the feature.
Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time.
Test Plan: CI
Differential Revision: D59071544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603
Approved by: https://github.com/d4l3k
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.
Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.
Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.
Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)
Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)
Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch
import os
import time
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)
def _run_test(store):
if dist.get_rank() == 1:
logger.info("Rank %s is sleeping", dist.get_rank())
time.sleep(5)
key = "lookup_key"
logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
store.check([key])
else:
logger.info("rank %s done", dist.get_rank())
def main() -> None:
use_gpu = torch.cuda.is_available()
dist.init_process_group(backend="nccl" if use_gpu else "gloo")
dist.barrier()
logger.info(f"Hello World from rank {dist.get_rank()}")
host = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
world_size = os.environ['WORLD_SIZE']
logger.info("testing TCPStore")
store = dist.TCPStore(
host_name=host, port=int(port), world_size=int(world_size),
)
_run_test(store)
if __name__ == "__main__":
main()
~~~
With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~
TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]: main()
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]: _run_test(store)
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]: store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
main()
File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
run(args)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-05_17:40:22
host : devgpu011.cln5.facebook.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2279237)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~
Differential Revision: D58180193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
Resolves#126888
- #126888
This PR is split from PR #126898.
- #126898
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007