pytorch/docs/source/elastic
Kurman Karabukaev 67d3e4f2a2 [TorchElastic] Refactoring to support non-default logging strategy (#120691)
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)

Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json

In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.

With current changes, users can create custom log spec that can use env variables to change the behavior.

Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.

Test Plan: CI + unit tests

Differential Revision: D54176265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
2024-02-29 20:59:17 +00:00
..
agent_diagram.jpg
agent.rst
customization.rst
errors.rst
etcd_rdzv_diagram.png
events.rst
examples.rst
kubernetes.rst
metrics.rst
multiprocessing.rst [TorchElastic] Refactoring to support non-default logging strategy (#120691) 2024-02-29 20:59:17 +00:00
quickstart.rst
rendezvous.rst [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066) 2024-01-18 01:16:55 +00:00
run.rst
timer.rst
train_script.rst