pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Tristan Rice	952a00eda7	torchelastic: change monitor_interval default to 0.1 (#124692 ) This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core. Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU. As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035). ## Motivation This setting is used to control how frequently we poll for failed processes in elastic. * For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry. * @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things ## Results This is measured in cores (100% is a single core under full load). \| monitor_interval (s) \| nproc-per-node \| CPU util (highest observed) \| \| -------------------- \| -------------- \| --------------------------- \| \| 1.0 \| 10 \| 0.2% \| \| 0.1 \| 1 \| 0.4% \| \| 0.1 \| 10 \| 0.4% \| \| 0.01 \| 10 \| 0.9% \| \| 0.001 \| 10 \| 4.0% \| \| 0.1 \| 100 \| 0.5% \| \| 0.1 \| 1000 \| 2.2% \| \| 0.1 \| 10000 \| 15.7% \| \| 0.1 \| 100000 \| 45.4% \| ## Methodology ```sh # run command $ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py # wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest $ top -b -d 10 -c \| rg 'torchrun.wait ``` wait.py ```py import time time.sleep(1060) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-24 01:44:41 +00:00
Yuanhao Ji	e3effa5855	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-17 06:46:02 +00:00
PyTorch MergeBot	52be63eb2c	Revert "Enable UFMT on all of `test/distributed` (#123539 )" This reverts commit `89ac37fe91`. Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))	2024-04-16 06:33:21 +00:00
Yuanhao Ji	89ac37fe91	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-16 03:23:56 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Sergii Dymchenko	35bf5bac26	Fix "sandcastle_skip_if decorator name is confusing" (#95649 ) Fixes https://github.com/pytorch/pytorch/issues/89473 See the issue https://github.com/pytorch/pytorch/issues/89473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649 Approved by: https://github.com/atalman, https://github.com/malfet	2023-03-03 09:29:40 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Aliaksandr Ivanou	37edb7483a	[torchelastic][1/n] Fix `caffe2.test.distributed.launcher.api_test` flaky tests (#68624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68624 Fix `caffe2.test.distributed.launcher.api_test` flaky tests for opt-tsan mode. The diff changes the default `mp.Process` invocation to use spawn context. `mp.Process` will uses `fork` method that is not compatible with `*san`. Test Plan: CI Reviewed By: d4l3k Differential Revision: D32550578 fbshipit-source-id: f4767987e8e10a6a2ece3f86e48278f2dbaebe7c	2021-11-19 15:23:30 -08:00
Kiuk Chung	f6402c469e	(torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749 Fixes: https://github.com/pytorch/pytorch/issues/67742 Test Plan: Added unittests. Validated manually: ``` # start agent 0 $ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py # start agent 1 torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py # kill agent 0 CTRL+C (SIGINT) or kill -15 (SIGTERM) # restart it torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py ``` Reviewed By: cbalioglu Differential Revision: D32129005 fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325	2021-11-05 12:18:46 -07:00
Jane Xu	251278d385	[skip ci] set more tests with owners for distributed and elastic (#67583 ) Summary: It turns out my lint doesn't work on CI all the time because of shell differences. I'm working on a new more comprehensive lint in https://github.com/pytorch/pytorch/pull/66826 and it'd be nice if these could be cleared first. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/67583 Reviewed By: H-Huang, mruberry Differential Revision: D32045155 Pulled By: janeyx99 fbshipit-source-id: ecfe9f008310c28e3b731e246c2b2ed0106d03b1	2021-11-01 12:26:03 -07:00
Aliaksandr Ivanou	018e06edca	[torchelastic] Skip tests in tsan mode (#67103 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67103 Skip tests in tsan mode for now. More info: T104010063 Test Plan: sandcastle + running tests in mode/dev-tsan Reviewed By: d4l3k Differential Revision: D31861426 fbshipit-source-id: d50e5d06afbc82ccce6d102e52f72b5b01f6f41a	2021-10-22 15:55:18 -07:00
Pritam Damania	2d671ca41b	[8/N] Remove c10d/ddp fork tests. (#63454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454 Continuation of https://github.com/pytorch/pytorch/pull/63443, this PR removes all fork tests from torch.distributed. ghstack-source-id: 136285511 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D30387872 fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513	2021-08-20 12:23:18 -07:00
Pritam Damania	d565a7bd68	[6/N] Enable opt-asan for elastic and launcher tests. (#63442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442 Continuation of https://github.com/pytorch/pytorch/pull/62051, I've enabled elastic and launcher tests to run in opt-asan mode which is supported with spawn multiprocessing. This allows us to completely get rid of fork based tests from torch.distributed and have all tests run in spawn mode. ghstack-source-id: 136057123 Test Plan: waitforbuildbot Reviewed By: cbalioglu Differential Revision: D30384267 fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb	2021-08-18 10:48:49 -07:00
Pritam Damania	82d81455ae	[2/N] Remove unittest.skip across all of torch.distributed. (#61887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887 1) Introduced a `sandcastle_skip_if` decorator that ensures these tests just get passed on sandcastle. 2) Fixed all test files under `test/distributed` to not use `unittest.skip` Overall goal is to avoid using skips since sandcastle tags these tests as continuously skipping. ghstack-source-id: 134382237 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29784152 fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d	2021-07-27 10:53:23 -07:00
Aliaksandr Ivanou	060e4c96ee	Torchelastic: forbid mp tests running with san (#56827 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56827 The diff makes sure that mp tests are not executed in modes that allow san, since python mp does not behave well with tsan and asan. Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/launcher/... -- --run-disabled Reviewed By: cbalioglu Differential Revision: D27976626 fbshipit-source-id: 7747d67687fa0fd095f799b3708038f672119e73	2021-04-23 17:55:26 -07:00
Aliaksandr Ivanou	8f663170bd	[17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687 The diff makes sure that users can transfer the following parameters: * master_addr * master_port * node_rank * use_env The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0 The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener. The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter. Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test Reviewed By: cbalioglu, wilson100hong Differential Revision: D27643206 fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea	2021-04-14 19:33:26 -07:00
Aliaksandr Ivanou	960b40156c	[6/n][torch/elastic][upstream] Move torchelastic/distributed/api to torch/distributed/elastic/launchers/api (#55471 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55471 Move torchelastic/distributed/api to torch/distributed/elastic/launchers/api Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/... buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/... SyncSGD: tsm_aivanou-SparseNNApplication_432fc009 f263322216 Reviewed By: wilson100hong Differential Revision: D27614353 fbshipit-source-id: a3b58fac2ebf803b8da5852ae2be0851b1cca695	2021-04-08 12:30:25 -07:00

17 Commits