pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Chris Zheng	5d37890b8e	Update torchrun and TorchElastic to take optional `local_addr` param to allow skip local IP lookup if specified (#88922 ) Summary: Update dynamic renderzvous nodes to use rendezvous hostname if provided. For PR: https://github.com/pytorch/pytorch/issues/85300 Before: For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address. For example, https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256 ``` return _NodeDesc(socket.getfqdn(), os.getpid(), local_id) ``` Now: If user specifies the hostname, each node will respect the given hostname. For example, `socket.getfqdn(<hostname>) ` Test Plan: Unit tests. Differential Revision: D41204028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922 Approved by: https://github.com/d4l3k	2022-12-21 03:55:01 +00:00
Kevin Wang	b6f114c208	Fix a minor typo in documentation (#90667 ) This change fixes a typo in function's documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90667 Approved by: https://github.com/kit1980	2022-12-13 00:41:25 +00:00
PyTorch MergeBot	14a7cf79c1	Add __all__ to torch.distributed and tensorboard submodules (#80444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80444 Approved by: https://github.com/rohan-varma	2022-06-28 16:33:22 +00:00
Kiuk Chung	1a8bd1a7eb	(torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73598 resolves https://github.com/pytorch/pytorch/issues/73319 Simply clarifies that `torchrun` is a console script that invokes `python -m torch.distributed.run`. Test Plan: N/A doc change only, letting github CI validate that the docs build correctly. Reviewed By: sinannasir, d4l3k Differential Revision: D34558538 fbshipit-source-id: 70332c7efc57164a15eda6621575a7c6f14120c8 (cherry picked from commit a349c048c788ece514658a0c94dc0c87c9644e71)	2022-03-03 08:35:50 +00:00
Kiuk Chung	f6402c469e	(torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749 Fixes: https://github.com/pytorch/pytorch/issues/67742 Test Plan: Added unittests. Validated manually: ``` # start agent 0 $ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py # start agent 1 torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py # kill agent 0 CTRL+C (SIGINT) or kill -15 (SIGTERM) # restart it torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py ``` Reviewed By: cbalioglu Differential Revision: D32129005 fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325	2021-11-05 12:18:46 -07:00
Aliaksandr Ivanou	028e438d6c	[torchelastic] Make sure `rdzv_configs[timeout]` is not getting overwritten (#61471 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61471 Make sure `rdzv_configs[timeout]` is not getting overwritten Test Plan: sandcastle Differential Revision: D29638606 fbshipit-source-id: e164cdddaed77e7e35412ed58ac1ee312e9d489d	2021-07-09 15:27:00 -07:00
Aliaksandr Ivanou	13658b10bb	[torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#61294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294 Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925 * Make `torch.distributed.launch` restarts to 0 * Remove unnecessary `-use_env` warning, move `-use_env` warnings * Move `-use_env` warnings to `torch.distributed.launch` * Make default log level WARNING * Add new doc section around transitioning to `torch.distributed.run` * Make `torch.distributed.launch` not use error-propagation * Set default events handler to `null` that does not print events to console * Add reference from `torch.distributed.launch` to `torch.distributed.run` * Set correct preexec function that sends SIGTERM to child processes when parent dies Issues resolved: https://github.com/pytorch/pytorch/issues/60716 https://github.com/pytorch/pytorch/issues/60754 Test Plan: sandcastle python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning Output of running torch.distributed.launch without --use_env: $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ('LOCAL_RANK')` instead. New section: {F628923078} {F628974089} Reviewed By: cbalioglu Differential Revision: D29559553 fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b	2021-07-08 16:28:06 -07:00
Vitaly Fedyunin	ccfdb30644	Revert D29413019: [torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` Test Plan: revert-hammer Differential Revision: D29413019 (`4e181dfc35`) Original commit changeset: 323bfbad9d0e fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2	2021-07-01 08:44:51 -07:00
Aliaksandr Ivanou	4e181dfc35	[torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925 * Make `torch.distributed.launch` restarts to 0 * Remove unnecessary `-use_env` warning, move `-use_env` warnings * Move `-use_env` warnings to `torch.distributed.launch` * Make default log level WARNING * Add new doc section around transitioning to `torch.distributed.run` * Make `torch.distributed.launch` not use error-propagation * Set default events handler to `null` that does not print events to console * Add reference from `torch.distributed.launch` to `torch.distributed.run` * Set correct preexec function that sends SIGTERM to child processes when parent dies Issues resolved: https://github.com/pytorch/pytorch/issues/60716 https://github.com/pytorch/pytorch/issues/60754 Test Plan: sandcastle python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning Output of running torch.distributed.launch without --use_env: $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ('LOCAL_RANK')` instead. New section: {F628923078} {F628974089} Reviewed By: kiukchung, cbalioglu Differential Revision: D29413019 fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630	2021-06-30 23:31:02 -07:00
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Howard Huang	c3745dc580	Small change for torch.distributed launcher (#59152 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59152 Small change for https://fb.workplace.com/groups/319878845696681 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D28773682 Pulled By: H-Huang fbshipit-source-id: acf82273e8622b7ffd3088d8d766bdf49273754c	2021-06-02 15:05:41 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Aliaksandr Ivanou	98ac6f7cbc	Increase default rendezvous timeout to 15 minutes Summary: Increase default rendezvous timeout to 15 minutes to address slow static initialization. Test Plan: n/a Reviewed By: wilson100hong Differential Revision: D27725655 fbshipit-source-id: a1b8c49b225b61be0d13ff5e52bf6677bf72f792	2021-04-19 09:20:15 -07:00
Aliaksandr Ivanou	8f663170bd	[17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687 The diff makes sure that users can transfer the following parameters: * master_addr * master_port * node_rank * use_env The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0 The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener. The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter. Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test Reviewed By: cbalioglu, wilson100hong Differential Revision: D27643206 fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea	2021-04-14 19:33:26 -07:00
Aliaksandr Ivanou	960b40156c	[6/n][torch/elastic][upstream] Move torchelastic/distributed/api to torch/distributed/elastic/launchers/api (#55471 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55471 Move torchelastic/distributed/api to torch/distributed/elastic/launchers/api Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/... buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/... SyncSGD: tsm_aivanou-SparseNNApplication_432fc009 f263322216 Reviewed By: wilson100hong Differential Revision: D27614353 fbshipit-source-id: a3b58fac2ebf803b8da5852ae2be0851b1cca695	2021-04-08 12:30:25 -07:00

16 Commits