This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core.
Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU.
As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035).
## Motivation
This setting is used to control how frequently we poll for failed processes in elastic.
* For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry.
* @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things
## Results
This is measured in cores (100% is a single core under full load).
| monitor_interval (s) | nproc-per-node | CPU util (highest observed) |
| -------------------- | -------------- | --------------------------- |
| 1.0 | 10 | 0.2% |
| 0.1 | 1 | 0.4% |
| 0.1 | 10 | 0.4% |
| 0.01 | 10 | 0.9% |
| 0.001 | 10 | 4.0% |
| 0.1 | 100 | 0.5% |
| 0.1 | 1000 | 2.2% |
| 0.1 | 10000 | 15.7% |
| 0.1 | 100000 | 45.4% |
## Methodology
```sh
# run command
$ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py
# wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest
$ top -b -d 10 -c | rg 'torchrun.*wait
```
wait.py
```py
import time
time.sleep(10*60)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692
Approved by: https://github.com/kiukchung, https://github.com/kurman
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)
Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json
In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.
With current changes, users can create custom log spec that can use env variables to change the behavior.
Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.
Test Plan: CI + unit tests
Differential Revision: D54176265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68624
Fix `caffe2.test.distributed.launcher.api_test` flaky tests for opt-tsan mode.
The diff changes the default `mp.Process` invocation to use spawn context. `mp.Process` will uses `fork` method that is not compatible with `*san`.
Test Plan: CI
Reviewed By: d4l3k
Differential Revision: D32550578
fbshipit-source-id: f4767987e8e10a6a2ece3f86e48278f2dbaebe7c
Summary:
It turns out my lint doesn't work on CI all the time because of shell differences. I'm working on a new more comprehensive lint in https://github.com/pytorch/pytorch/pull/66826 and it'd be nice if these could be cleared first.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67583
Reviewed By: H-Huang, mruberry
Differential Revision: D32045155
Pulled By: janeyx99
fbshipit-source-id: ecfe9f008310c28e3b731e246c2b2ed0106d03b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442
Continuation of https://github.com/pytorch/pytorch/pull/62051, I've
enabled elastic and launcher tests to run in opt-asan mode which is supported
with spawn multiprocessing.
This allows us to completely get rid of fork based tests from torch.distributed
and have all tests run in spawn mode.
ghstack-source-id: 136057123
Test Plan: waitforbuildbot
Reviewed By: cbalioglu
Differential Revision: D30384267
fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887
1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`
Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29784152
fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56827
The diff makes sure that mp tests are not executed in modes that allow *san, since python mp does not behave well with tsan and asan.
Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/launcher/... -- --run-disabled
Reviewed By: cbalioglu
Differential Revision: D27976626
fbshipit-source-id: 7747d67687fa0fd095f799b3708038f672119e73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687
The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env
The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0
The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.
The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27643206
fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea