pytorch/docs/source/elastic/customization.rst
Kiuk Chung a80b215a9a [1/n][torch/elastic] Move torchelastic docs *.rst (#148)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811

Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst`

Reviewed By: H-Huang

Differential Revision: D27974751

fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
2021-05-04 00:57:56 -07:00

119 lines
3.5 KiB
ReStructuredText

Customization
=============
This section describes how to customize TorchElastic to fit your needs.
Launcher
------------------------
The launcher program that ships with TorchElastic
should be sufficient for most use-cases (see :ref:`launcher-api`).
You can implement a custom launcher by
programmatically creating an agent and passing it specs for your workers as
shown below.
.. code-block:: python
# my_launcher.py
if __name__ == "__main__":
args = parse_args(sys.argv[1:])
rdzv_handler = RendezvousHandler(...)
spec = WorkerSpec(
local_world_size=args.nproc_per_node,
fn=trainer_entrypoint_fn,
args=(trainer_entrypoint_fn args.fn_args,...),
rdzv_handler=rdzv_handler,
max_restarts=args.max_restarts,
monitor_interval=args.monitor_interval,
)
agent = LocalElasticAgent(spec, start_method="spawn")
try:
run_result = agent.run()
if run_result.is_failed():
print(f"worker 0 failed with: run_result.failures[0]")
else:
print(f"worker 0 return value is: run_result.return_values[0]")
except Exception ex:
# handle exception
Rendezvous Handler
------------------------
To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler``
and implement its methods.
.. warning:: Rendezvous handlers are tricky to implement. Before you begin
make sure you completely understand the properties of rendezvous.
Please refer to :ref:`rendezvous-api` for more information.
Once implemented you can pass your custom rendezvous handler to the worker
spec when creating the agent.
.. code-block:: python
spec = WorkerSpec(
rdzv_handler=MyRendezvousHandler(params),
...
)
elastic_agent = LocalElasticAgent(spec, start_method=start_method)
elastic_agent.run(spec.role)
Metric Handler
-----------------------------
TorchElastic emits platform level metrics (see :ref:`metrics-api`).
By default metrics are emitted to `/dev/null` so you will not see them.
To have the metrics pushed to a metric handling service in your infrastructure,
implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your
custom launcher.
.. code-block:: python
# my_launcher.py
import torch.distributed.elastic.metrics as metrics
class MyMetricHandler(metrics.MetricHandler):
def emit(self, metric_data: metrics.MetricData):
# push metric_data to your metric sink
def main():
metrics.configure(MyMetricHandler())
spec = WorkerSpec(...)
agent = LocalElasticAgent(spec)
agent.run()
Events Handler
-----------------------------
TorchElastic supports events recording (see :ref:`events-api`).
The events module defines API that allows you to record events and
implement custom EventHandler. EventHandler is used for publishing events
produced during torchelastic execution to different sources, e.g. AWS CloudWatch.
By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores
events. To configure custom events handler you need to implement
`torch.distributed.elastic.events.EventHandler` interface and `configure` it
in your custom launcher.
.. code-block:: python
# my_launcher.py
import torch.distributed.elastic.events as events
class MyEventHandler(events.EventHandler):
def record(self, event: events.Event):
# process event
def main():
events.configure(MyEventHandler())
spec = WorkerSpec(...)
agent = LocalElasticAgent(spec)
agent.run()