pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

History

Kiuk Chung 9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: https://github.com/pytorch/pytorch/issues/63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5		2021-08-25 22:57:43 -07:00
..
_static	clarify the documentation of `torch.meshgrid` (#62977 )	2021-08-18 04:01:22 -07:00
_templates	Remove master documentation from being indexable by search engines (#58056 )	2021-05-18 06:20:09 -07:00
community	Update persons_of_interest.rst (#63907 )	2021-08-25 22:50:54 -07:00
elastic	fix(elastic-docs): Fix elastic launch doc (#62378 )	2021-07-30 10:58:13 -07:00
notes	Add note on ifdefing based on CUDA_VERSION for ROCm path (#62850 )	2021-08-25 15:02:03 -07:00
rpc
scripts	Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default	2021-08-12 11:45:01 -07:00
__config__.rst
amp.rst	rebase for autocast updates to include device_type and dtype flags (#61002 )	2021-08-10 20:03:12 -07:00
autograd.rst	Add docs describing saved tensor hooks (#62362 )	2021-08-20 11:10:51 -07:00
backends.rst
benchmark_utils.rst
bottleneck.rst
checkpoint.rst
complex_numbers.rst	Grammatical update of tech docs (#61547 )	2021-07-14 14:01:59 -07:00
conf.py	Add copy button to code snippets in docs (#63149 )	2021-08-15 06:25:32 -07:00
cpp_extension.rst
cpp_index.rst
cuda.rst	enable warnings on cuda synchronization (#62092 )	2021-07-30 09:13:01 -07:00
cudnn_persistent_rnn.rst
cudnn_rnn_determinism.rst
data.rst	[DataLoader][doc] Randomness for base_seed generator and NumPy seed (#56528 )	2021-04-22 09:40:45 -07:00
ddp_comm_hooks.rst	Add GradBucket::parameters() to ddp_comm_hooks.rst (#62877 )	2021-08-06 14:50:47 -07:00
distributed.algorithms.join.rst	Add tutorial link (#62785 )	2021-08-05 17:28:02 -07:00
distributed.elastic.rst	[1/n][torch/elastic] Move torchelastic docs *.rst (#148 )	2021-05-04 00:57:56 -07:00
distributed.optim.rst
distributed.rst	(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 )	2021-08-25 22:57:43 -07:00
distributions.rst
dlpack.rst	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
docutils.conf
fft.rst	Use autosummary on torch.fft, torch.linalg (#55748 )	2021-04-13 12:02:36 -07:00
futures.rst	Update docs to mention CUDA support for Future (#50048 )	2021-05-11 08:26:33 -07:00
fx.rst	[skip ci] Fix "arugment" typos (#61459 )	2021-07-15 15:20:18 -07:00
hub.rst
index.rst	Make _Join, _Joinable, _JoinHook public (#62605 )	2021-08-03 12:20:11 -07:00
jit_builtin_functions.rst	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
jit_language_reference_v2.rst	Fix hasattr support type (#57950 )	2021-05-10 12:21:56 -07:00
jit_language_reference.rst
jit_python_reference.rst	[JIT] improve documentation (#57991 )	2021-05-19 11:47:32 -07:00
jit_unsupported.rst
jit.rst	Updates internal `assert_allclose` callsites in favor of `assert_close` (#61841 )	2021-08-19 12:50:41 -07:00
linalg.rst	Add torch.linalg.inv_ex without checking for errors by default (#58039 )	2021-05-13 09:42:15 -07:00
math-quantizer-equation.png
mobile_optimizer.rst
model_zoo.rst
multiprocessing.rst
name_inference.rst	Abladawood patch 1 (#58496 )	2021-05-20 10:32:18 -07:00
named_tensor.rst
nn.functional.rst	Add mish activation function (#58648 )	2021-05-25 10:36:21 -07:00
nn.init.rst
nn.rst	Adds _LazyInstanceNorm and LazyInstanceNormXd (#60982 )	2021-07-21 06:45:45 -07:00
onnx.rst	[ONNX] Update documentation (#58712 ) (#60249 )	2021-07-08 16:29:32 -07:00
optim.rst	To add warm-up scheduler to optim (#60836 )	2021-08-15 12:31:45 -07:00
package.rst	[package] PackageExporter remove verbose mode (#61145 )	2021-07-08 18:26:43 -07:00
pipeline.rst	Add tutorials to pipeline docs. (#55209 )	2021-04-05 20:01:00 -07:00
profiler.rst	docs: fix profiler docstring (#55750 )	2021-04-13 00:23:14 -07:00
quantization-support.rst	fix typo errors in quantization-support.rst Line320 (#44447 )	2021-07-27 10:42:29 -07:00
quantization.rst	quantization: improve documentation on natively supported backends (#58925 )	2021-06-07 17:29:03 -07:00
random.rst
rpc.rst	Remove PROCESS GROUP rpc backend (#62411 )	2021-08-02 12:26:22 -07:00
sparse.rst	Add CSR (compressed sparse row) layout for sparse tensors (#50937 )	2021-04-12 10:09:12 -07:00
special.rst	[special] alias for mvlgamma (#61633 )	2021-07-23 11:24:27 -07:00
storage.rst	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
tensor_attributes.rst	Remove legacy constructor calls from pytorch codebase. (#54142 )	2021-04-11 15:45:17 -07:00
tensor_view.rst	[docs] Mention `vsplit`, `hsplit` and `tensor_split` in Tensor views doc (#63191 )	2021-08-13 11:44:38 -07:00
tensorboard.rst
tensors.rst	Exposes _aminmax as aminmax and makes it structured (#62401 )	2021-08-03 16:10:43 -07:00
testing.rst	add `torch.testing` to docs (#57247 )	2021-05-07 09:16:39 -07:00
torch.nn.intrinsic.qat.rst
torch.nn.intrinsic.quantized.rst	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
torch.nn.intrinsic.rst
torch.nn.qat.rst	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
torch.nn.quantized.dynamic.rst
torch.nn.quantized.rst
torch.overrides.rst
torch.quantization.rst	Lint trailing newlines (#54737 )	2021-03-30 13:09:52 -07:00
torch.rst	[docs][ao] Add missing docstrings for quantized_max_pool1d and quantized_max_pool2d (#63242 )	2021-08-15 22:47:03 -07:00
type_info.rst	clarify that `torch.finfo.tiny` is the smallest normal number (#63241 )	2021-08-18 13:44:52 -07:00