mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/elastic/pull/148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811 Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst` Reviewed By: H-Huang Differential Revision: D27974751 fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
119 lines
3.5 KiB
ReStructuredText
119 lines
3.5 KiB
ReStructuredText
Customization
|
|
=============
|
|
|
|
This section describes how to customize TorchElastic to fit your needs.
|
|
|
|
Launcher
|
|
------------------------
|
|
|
|
The launcher program that ships with TorchElastic
|
|
should be sufficient for most use-cases (see :ref:`launcher-api`).
|
|
You can implement a custom launcher by
|
|
programmatically creating an agent and passing it specs for your workers as
|
|
shown below.
|
|
|
|
.. code-block:: python
|
|
|
|
# my_launcher.py
|
|
|
|
if __name__ == "__main__":
|
|
args = parse_args(sys.argv[1:])
|
|
rdzv_handler = RendezvousHandler(...)
|
|
spec = WorkerSpec(
|
|
local_world_size=args.nproc_per_node,
|
|
fn=trainer_entrypoint_fn,
|
|
args=(trainer_entrypoint_fn args.fn_args,...),
|
|
rdzv_handler=rdzv_handler,
|
|
max_restarts=args.max_restarts,
|
|
monitor_interval=args.monitor_interval,
|
|
)
|
|
|
|
agent = LocalElasticAgent(spec, start_method="spawn")
|
|
try:
|
|
run_result = agent.run()
|
|
if run_result.is_failed():
|
|
print(f"worker 0 failed with: run_result.failures[0]")
|
|
else:
|
|
print(f"worker 0 return value is: run_result.return_values[0]")
|
|
except Exception ex:
|
|
# handle exception
|
|
|
|
|
|
Rendezvous Handler
|
|
------------------------
|
|
|
|
To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler``
|
|
and implement its methods.
|
|
|
|
.. warning:: Rendezvous handlers are tricky to implement. Before you begin
|
|
make sure you completely understand the properties of rendezvous.
|
|
Please refer to :ref:`rendezvous-api` for more information.
|
|
|
|
Once implemented you can pass your custom rendezvous handler to the worker
|
|
spec when creating the agent.
|
|
|
|
.. code-block:: python
|
|
|
|
spec = WorkerSpec(
|
|
rdzv_handler=MyRendezvousHandler(params),
|
|
...
|
|
)
|
|
elastic_agent = LocalElasticAgent(spec, start_method=start_method)
|
|
elastic_agent.run(spec.role)
|
|
|
|
|
|
Metric Handler
|
|
-----------------------------
|
|
|
|
TorchElastic emits platform level metrics (see :ref:`metrics-api`).
|
|
By default metrics are emitted to `/dev/null` so you will not see them.
|
|
To have the metrics pushed to a metric handling service in your infrastructure,
|
|
implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your
|
|
custom launcher.
|
|
|
|
.. code-block:: python
|
|
|
|
# my_launcher.py
|
|
|
|
import torch.distributed.elastic.metrics as metrics
|
|
|
|
class MyMetricHandler(metrics.MetricHandler):
|
|
def emit(self, metric_data: metrics.MetricData):
|
|
# push metric_data to your metric sink
|
|
|
|
def main():
|
|
metrics.configure(MyMetricHandler())
|
|
|
|
spec = WorkerSpec(...)
|
|
agent = LocalElasticAgent(spec)
|
|
agent.run()
|
|
|
|
Events Handler
|
|
-----------------------------
|
|
|
|
TorchElastic supports events recording (see :ref:`events-api`).
|
|
The events module defines API that allows you to record events and
|
|
implement custom EventHandler. EventHandler is used for publishing events
|
|
produced during torchelastic execution to different sources, e.g. AWS CloudWatch.
|
|
By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores
|
|
events. To configure custom events handler you need to implement
|
|
`torch.distributed.elastic.events.EventHandler` interface and `configure` it
|
|
in your custom launcher.
|
|
|
|
.. code-block:: python
|
|
|
|
# my_launcher.py
|
|
|
|
import torch.distributed.elastic.events as events
|
|
|
|
class MyEventHandler(events.EventHandler):
|
|
def record(self, event: events.Event):
|
|
# process event
|
|
|
|
def main():
|
|
events.configure(MyEventHandler())
|
|
|
|
spec = WorkerSpec(...)
|
|
agent = LocalElasticAgent(spec)
|
|
agent.run()
|