pytorch/test/distributed/elastic
Sathyanarayanan Saravanamuthu 34dc8f69a1 Adding entry-point based support for out-of-tree rendezvous plugins (#132633)
Fixes #127519

Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.

#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:

```
plugin_root
|_ pyproject.toml
|_ src
   |_ redis
      |_ __init__.py
      |_ redis_store.py
      |_ redis_backend.py
```

The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:

```
[project]
name = "redis"
version = "0.0.1"

[project.entry-points.'torchrun.plugins']
redis = 'redis'
```

The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:

```
def getPluginHandler():
    def _create_redis_handler(params: RendezvousParameters):
        from redis_rendezvous_backend import create_backend
        backend, store = create_backend(params)
        return create_handler(store, backend, params)
    return _create_redis_handler
```

The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.

#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.

Once installed, the new backend can be used in torchrun as follows:

```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/fduwjj
2024-09-11 03:35:02 +00:00
..
agent/server/test [BE][Easy][10/19] enforce style for empty lines in import segments in test/d*/ (#129761) 2024-07-17 16:57:39 +00:00
events Enable UFMT on all of test/distributed (#123539) 2024-04-17 06:46:02 +00:00
metrics Add None return type to init -- tests rest (#132376) 2024-08-01 15:44:51 +00:00
multiprocessing [torchelastic] Don't do signal handling when off the main thread (#135088) 2024-09-06 14:47:03 +00:00
rendezvous Adding entry-point based support for out-of-tree rendezvous plugins (#132633) 2024-09-11 03:35:02 +00:00
timer [elastic] support local_addr across all rendezvous impls (#135262) 2024-09-06 17:55:43 +00:00
utils [TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882) 2024-09-03 19:43:21 +00:00
test_control_plane.py Add new control plane handler (#129712) 2024-07-12 17:32:01 +00:00