Summary:
Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.
The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.
To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.
Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193
Reviewed By: gchanan
Differential Revision: D24496013
Pulled By: rohan-varma
fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68
Summary:
Adds a '-m' flag to torch.distributed.launch that allows users to launch python modules using launch instead of specifying the full file path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24910
Differential Revision: D17221653
Pulled By: pietern
fbshipit-source-id: 5c6453ed266fd121103b11caab303e3f9404227d
Summary:
per https://github.com/pytorch/pytorch/issues/22260, default number of open mp threads are spawned to be the same of number of cores available, for multi processing data parallel cases, too many threads may be spawned and could overload the CPU, resulting in performance regression.
so set OMP_NUM_THREADS = number of CPU processors/number of processes in default to neither overload or waste CPU threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22501
Test Plan:
1. without and with this change, example codes result in same result
python ~/local/fbsource-fbcode/fbcode/caffe2/torch/distributed/launch.py --nproc_per_node=2 pytorch/examples/yanlizhao/distributed_launch_example.py
Setting OMP_NUM_THREADS environment variable for each process to be: 24, which
is max(1, num_cpus / num_processes), you can further tune the variable for optimal performance in your application if needed.
final loss = tensor(0.5211, device='cuda:0', grad_fn=<MseLossBackward>)
Differential Revision: D16092225
Pulled By: zhaojuanmao
fbshipit-source-id: b792a4c27a7ffae40e4a59e96669209c6a85e27f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables.
265ed8ff45/torch/distributed/launch.py (L200-L212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360
Differential Revision: D14070372
Pulled By: ezyang
fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9
Summary:
`torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0.
For better debugging it should always raise an error if processes launched have unusual behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069
Differential Revision: D13709467
Pulled By: ezyang
fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC deepakn94
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12370
Differential Revision: D10220135
Pulled By: ezyang
fbshipit-source-id: 6d1a8a383951ae52753e4f75a14b8080bf02b815