pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rohan Varma	ccb79f3ac7	Add option to log subprocess output to files in DDP launcher. (#33193 ) Summary: Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout. The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files. To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument. Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193 Reviewed By: gchanan Differential Revision: D24496013 Pulled By: rohan-varma fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68	2020-10-23 11:22:57 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Xin Yao	0fc0a9308a	fix autodoc for torch.distributed.launch (#40963 ) Summary: The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line. `542ac74987/torch/distributed/launch.py (L1-L5)` I move it below the docstring to make the autodoc in Sphinx work normally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963 Differential Revision: D22380816 Pulled By: mrshenli fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f	2020-07-04 08:59:41 -07:00
Brian Wignall	f326045b37	Fix typos, via a Levenshtein-type corrector (#31523 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea	2020-01-17 16:03:19 -08:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
Luke Yeager	183aa1534f	Add --no_python flag (#29144 ) Summary: Allows you to use a bash script wrapper in-between launch and your training script. e.g. ``` python -m torch.distributed.launch --nproc_per_node=8 --no_python --use_env \ bash -c 'exec numactl --cpunodebind=$(( LOCAL_RANK / 4 )) "$@"' -- \ python train.py ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/29144 Differential Revision: D18345647 Pulled By: pietern fbshipit-source-id: f05849c38c82de782988d07d300e00cf9f37253a	2019-11-22 06:05:41 -08:00
Cameron Long	d47ced49ad	Adds a -m flag to pytorch.distributed.launch (#24910 ) Summary: Adds a '-m' flag to torch.distributed.launch that allows users to launch python modules using launch instead of specifying the full file path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24910 Differential Revision: D17221653 Pulled By: pietern fbshipit-source-id: 5c6453ed266fd121103b11caab303e3f9404227d	2019-09-06 01:13:44 -07:00
Yanli Zhao	1c0309a9a9	make OMP_NUM_THREADS default in launch.py (#22501 ) Summary: per https://github.com/pytorch/pytorch/issues/22260, default number of open mp threads are spawned to be the same of number of cores available, for multi processing data parallel cases, too many threads may be spawned and could overload the CPU, resulting in performance regression. so set OMP_NUM_THREADS = number of CPU processors/number of processes in default to neither overload or waste CPU threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/22501 Test Plan: 1. without and with this change, example codes result in same result python ~/local/fbsource-fbcode/fbcode/caffe2/torch/distributed/launch.py --nproc_per_node=2 pytorch/examples/yanlizhao/distributed_launch_example.py Setting OMP_NUM_THREADS environment variable for each process to be: 24, which is max(1, num_cpus / num_processes), you can further tune the variable for optimal performance in your application if needed. final loss = tensor(0.5211, device='cuda:0', grad_fn=<MseLossBackward>) Differential Revision: D16092225 Pulled By: zhaojuanmao fbshipit-source-id: b792a4c27a7ffae40e4a59e96669209c6a85e27f	2019-07-23 16:14:24 -07:00
Shoaib Ahmed Siddiqui	5395db22a4	Typo fixed in documentation Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22600 Differential Revision: D16156989 Pulled By: mrshenli fbshipit-source-id: e491b083d872eaceb829028dadbab2e28ecfc785	2019-07-08 19:29:07 -07:00
Soumith Chintala	f13fadd510	fix python2 corner-case in torch.distributed.launch (#20996 ) Summary: Small fix for the comment raised in `4cf76574b9 (r33134850)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/20996 Differential Revision: D15991510 Pulled By: pietern fbshipit-source-id: 4e5a35864b5a4ec9402aa83a19c4a3ba0df2f01f	2019-06-27 05:19:37 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Andy Wei	5eee0670ab	Pass torch.distributed launch process local rank as environment variable instead of argument (#16360 ) Summary: In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables. `265ed8ff45/torch/distributed/launch.py (L200-L212)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360 Differential Revision: D14070372 Pulled By: ezyang fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9	2019-02-15 14:52:55 -08:00
Andy Wei	4cf76574b9	Raise CalledProcessError when torch.distributed launch process not return 0 (#16069 ) Summary: `torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0. For better debugging it should always raise an error if processes launched have unusual behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069 Differential Revision: D13709467 Pulled By: ezyang fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20	2019-01-22 08:50:47 -08:00
Edward Yang	058a31839d	Warn about local_rank not being globally unique. (#12370 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> CC deepakn94 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12370 Differential Revision: D10220135 Pulled By: ezyang fbshipit-source-id: 6d1a8a383951ae52753e4f75a14b8080bf02b815	2018-10-05 17:38:41 -07:00
Tongzhou Wang	8e33451e2e	Make torch.cuda.* take device objects; Update distributed docs (#10833 ) Summary: Commits: 1. Make `torch.cuda.*` take device objects 2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833 Differential Revision: D9514241 Pulled By: SsnL fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e	2018-08-27 15:24:42 -07:00
Kaiyu Shi	db6e4576da	Use customized python interpreter (#7520 )	2018-05-12 13:06:39 -04:00
Edgar Andrés Margffoy Tuay	4ab6ea5b1f	Add unbuffered flag to distributed node launcher (#7226 )	2018-05-03 11:49:06 +02:00
Teng Li	f5beff334b	Added distributed docs on NCCL2 backend/functions and launch module (#6579 )	2018-04-15 21:53:10 -04:00
Teng Li	37059ba0ec	Added torch.distributed.launch module for easier multi-proc/node distributed job launching (#5348 )	2018-03-13 12:04:38 +01:00

19 Commits