pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Xuehai Pan	e6888697c4	Revisit `torch._six.string_classes` removal (#94709 ) (#97863 ) Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`. Both `str` and `bytes` are `Sequence` classes. ```python In [1]: from typing import Sequence In [2]: issubclass(bytes, Sequence) Out[2]: True In [3]: issubclass(str, Sequence) Out[3]: True ``` Re-add `bytes` to type guards like: ```python def is_seq(obj): return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes)) ``` Ref: - https://github.com/pytorch/pytorch/pull/94709#issuecomment-1487282912 - #97737 - #97789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97863 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-03-30 17:02:45 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
Bill Darrow	38988a8d14	[rpc/distributed] eliminate code duplication in distributed/rendezvou… (#81577 ) This change eliminates duplication in redundant code paths inside of distributed/rendezvous.py. Minor additional test coverage is added. Fixes #74440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81577 Approved by: https://github.com/H-Huang	2022-07-22 16:21:00 +00:00
Ishaan	d65ab9a689	Improving typing and typing-related performance in rendezvous.py ### SUMMARY: It is unnecessary to perform `n + 1` calls to `cast` (one cast for each parameter name-value pair and one cast for the filter generator itself) in a dictionary comprehension in an effort to avoid mypy errors. Previously, the `cast` to `Tuple[str, str]` was necessary to prevent mypy from complaining that we are trying to create a dictionary out of lists as opposed to tuples (see the mypy issue [here](https://github.com/python/mypy/issues/7509)). However, a `cast` is both adding unnecessary overhead due to the function call and should generally only be used when a linter is unable to properly infer the type of a variable, not to "lie" to it about the type. We can avoid this by instead using a generator within the dictionary comprehension and then indexing into it twice to produce tuples of size 2; mypy recognizes this as a valid dictionary initialization. The above change is much more performant than the previous version of the code. Timing the two versions of the dictionary construction yielded the following results: ``` >python -m timeit -s "from typing import cast, Dict, Tuple, Iterable" -n 100000 -p "dict(cast(Tuple[str, str], pair.split('=')) for pair in cast(Iterable[str], filter(None, 'rank=3&world_size=5'.split('&'))))" 100000 loops, best of 5: 2.66 usec per loop >python -m timeit -n 100000 -p "dict((pair[0], pair[1]) for pair in (pair.split('=') for pair in filter(None, 'rank=3&world_size=5'.split('&'))))" 100000 loops, best of 5: 1.09 usec per loop ``` The `cast` to `Iterable[str]` is similarly a "lie" that is not necessary. It is best to be as transparent as possible to the linter rather than using `cast` to eliminate errors. This actually does not even produce any mypy errors when removed in isolation from the other changes. Further, it is good practice to type hint the return value of a function rather than specifying the type of the return value inside the function. Thus, the unnecessary intermediate variable `query_dict` inside `_query_to_dict` was removed, and the type hint of the return value was moved to the function declaration. The type of the argument `query` is specified as `str`. ### EDITS (additional commits): [The sole type hint for `query_dict` (in `_env_rendezvous_handler`) was removed to match all other functions in the file.](`76d78bfc9c`) [Incorrect typing is fixed for _env_rendezvous_handler typing so that `rank`, `world_size`, `master_port`, and `master_addr` are specified to be `int`, `int`, `int`, and `str`, respectively.](`3cc5844264`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75959 Approved by: https://github.com/kumpera, https://github.com/mrshenli	2022-04-24 21:49:51 +00:00
Howard Huang	f76d1c022e	[Dynamic RPC] Allow for optional world_size argument in init_rpc (#73372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73372 This PR which allows for optional `world_size` argument in init_rpc. This makes changes in rendezvous to allow for `NoneType` for world_size and creates a new code path when initializing TensorPipe agent for init_rpc. The TensorPipe agent is protected by a critical section enforced using the store, so that only one node can create a TPAgent at a time. This PR does not yet enable RPC commands between ranks. Previously: ```python os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' init_rpc("worker0", world_size=1, rank=0) ``` Now (only rank is needed): ```python os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' init_rpc("worker0", rank=0) ``` Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34621651 Pulled By: H-Huang fbshipit-source-id: 09dbb511d5a00c219a6ce0a35501ff2e388998b0 (cherry picked from commit 834aedc3256167399c323169ef2f0c9b3cf98dff)	2022-03-24 16:19:28 +00:00
Howard Huang	666ff0ae22	Update _create_c10d_store to check port value (#71863 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71863 Port number is int in python, but needs to be uint16_t when called for TCPStore constructor. Related to #67172 Test Plan: Imported from OSS Reviewed By: cbalioglu Differential Revision: D33793270 Pulled By: H-Huang fbshipit-source-id: 89ab47ec8bd7518f9ecbf7d01871fe059b0e77b1 (cherry picked from commit `84bff1f5bb`)	2022-01-26 22:29:33 +00:00
Kiuk Chung	9d95d48567	(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: https://github.com/pytorch/pytorch/issues/63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5	2021-08-25 22:57:43 -07:00
Can Balioglu	8e6d8991b2	[torch/elastic] Fix the agent store key prefix used by workers (#61590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61590 This PR fixes the bug where the state of the first run of a failed training job leaks to the secondary runs due to constant worker key prefix. ghstack-source-id: 133494239 Test Plan: Run the existing integ tests. Reviewed By: SciPioneer Differential Revision: D29682743 fbshipit-source-id: d96ecadcfe5b6563225ee19f5d0776c7f935393a	2021-07-13 14:57:27 -07:00
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Can Balioglu	4ee761c2c5	[2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329 This PR is part of a stack that addresses the GitHub issue #41614; it introduces: - A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair. - Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature. Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output. ghstack-source-id: 130676389 Test Plan: Run the existing tests since there are no behavioral changes. Reviewed By: rohan-varma Differential Revision: D28424978 fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29	2021-06-05 07:50:04 -07:00
Chester Liu	e31265dfb3	Fix path handling on Win32 in rendezvous.py (#57000 ) Summary: Fixes test failure after https://github.com/pytorch/pytorch/issues/56598 Introduced by https://github.com/pytorch/pytorch/issues/45335. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57000 Reviewed By: zou3519 Differential Revision: D28030360 Pulled By: seemethere fbshipit-source-id: 4871d51e6b80dceef8bf95c6c658441287575f63	2021-04-29 13:55:11 -07:00
Aliaksandr Ivanou	8f663170bd	[17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687 The diff makes sure that users can transfer the following parameters: * master_addr * master_port * node_rank * use_env The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0 The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener. The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter. Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test Reviewed By: cbalioglu, wilson100hong Differential Revision: D27643206 fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea	2021-04-14 19:33:26 -07:00
Samuel Marks	e6779d4357	[*.py] Rename "Arguments:" to "Args:" (#49736 ) Summary: I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings. ```sh (pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" \| paste -s -d+ -- \| bc)"; done Args: 1095 Arguments: 0336 ``` It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per: - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md) - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md) - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst) Therefore, only `Args:` is valid. This PR replaces them throughout the codebase. PS: For related PRs, see tensorflow/tensorflow/pull/45420 PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736 Reviewed By: albanD Differential Revision: D25710534 Pulled By: soumith fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619	2020-12-28 09:34:47 -08:00
Joe Zhu	92f376147c	Enable TCPStore on Windows (#47749 ) Summary: Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance. Related RFC is https://github.com/pytorch/pytorch/issues/47659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749 Reviewed By: bdhirsh Differential Revision: D25220401 Pulled By: mrshenli fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529	2020-12-03 08:32:01 -08:00
Xu Zhao	49f0e5dfeb	Fix typing errors in torch.distributed.*, close issue #42967 . (#47534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534 Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24952497 Pulled By: xuzhao9 fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e	2020-11-16 23:27:59 -08:00
Xu Zhao	73a3e70b24	Add type annotations for torch._C._distributed_c10d module. (#46623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46623 Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D24761606 Pulled By: xuzhao9 fbshipit-source-id: 827eaf2502e381ee24d36741c1613b4c08208569	2020-11-06 01:28:48 -08:00
gunandrose4u	f07ac6a004	Fix Windows build failure after DDP PR merged (#45335 ) Summary: Fixes #{issue number} This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335 Reviewed By: zou3519 Differential Revision: D23931471 Pulled By: mrshenli fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494	2020-09-25 12:37:50 -07:00
Mike Ruberry	103fa3894a	Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only Test Plan: revert-hammer Differential Revision: D23841786 (`0122299f9b`) Original commit changeset: 334ba1ed73ef fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f	2020-09-24 22:44:33 -07:00
gunandrose4u	0122299f9b	Enable distributed package on windows, Gloo backend supported only (#42897 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42095 For test case part will be committed to this PR later mrshenli, please help to review Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897 Reviewed By: osalpekar Differential Revision: D23841786 Pulled By: mrshenli fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3	2020-09-24 21:13:55 -07:00
Rohan Varma	6cb9e6b015	Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434 Reland of https://github.com/pytorch/pytorch/pull/33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98558377 Test Plan: Added UT test_tcp_store_timeout_set Differential Revision: D19935390 fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a	2020-02-19 17:17:17 -08:00
Rohan Varma	d4e4beddc4	Revert D19871946: [distributed] pass in timeout to TCP store when initializing Test Plan: revert-hammer Differential Revision: D19871946 Original commit changeset: dd002180c4c8 fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2	2020-02-16 19:37:44 -08:00
Rohan Varma	df47a3abe0	[distributed] pass in timeout to TCP store when initializing (#33325 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325 Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time. Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all. ghstack-source-id: 98401875 Test Plan: Added a UT Differential Revision: D19871946 fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117	2020-02-16 17:59:44 -08:00
Chenyang Yu	20e5c90d82	accept url query when rank or wolrd_size is specified (#32016 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32016 The previously logic will raise exception when there is query in url when rank or world_size is specified The fix will parse the url and stitch rank and world_size into url.query and regenerate the url. Test Plan: f161291877 Differential Revision: D19337929 fbshipit-source-id: 6bb3a07716dda5233553804000b706052ff18db8	2020-01-10 18:27:06 -08:00
Shihao Xu	80e3f17301	Resubmit "Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s" (#30093 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093 https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. ghstack-source-id: 94197295 Test Plan: ### OSS RPC + RRef tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc ``` ### Prototype RRef tests ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent ``` ### Dist autograd ``` buck test mode/dev-nosan caffe2/test:dist_autograd_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test ``` Differential Revision: D18595578 fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065	2019-11-19 18:52:30 -08:00
Edward Yang	1dda8186ae	Revert D18549919: Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s Test Plan: revert-hammer Differential Revision: D18549919 Original commit changeset: b9f3f1a41d1f fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee	2019-11-19 08:14:40 -08:00
Shihao Xu	21dc1d4543	Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s (#29972 ) Summary: https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. closes https://github.com/pytorch/pytorch/issues/29031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972 Differential Revision: D18549919 Pulled By: xush6528 fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b	2019-11-19 01:00:08 -08:00
Shihao Xu	59402f51cf	Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226 # Goal Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`. The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string. We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`. # Solution - Put argument appending inside of `rendezvous` function. - Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function. - Use the `rendezvous` function for any `RpcAgent`. Test Plan: ``` buck test mode/dev-nosan caffe2/test:c10d ``` ``` buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc ``` ``` buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss ``` Differential Revision: D5524494 fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef	2019-10-23 21:51:08 -07:00
Teng Li	b4bc55beef	TCP init method race condition fix (#15684 ) Summary: This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it. This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself. This should fix: https://github.com/pytorch/pytorch/issues/15638 Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage. I had to make rendezvous test in c10d the world size of 1, since it is a single process code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684 Differential Revision: D13570904 Pulled By: teng-li fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589	2019-01-18 02:29:38 -08:00
Teng Li	0d3cb91d8c	Make env init_method support both env and args for rank and size (#14494 ) Summary: Fixing: https://github.com/pytorch/pytorch/issues/14446 This was a supported behavior in old torch.distributed. We want to support it in the new release. Test should cover all combination of scenario when we have either env or arg set up for rank or size or both Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494 Differential Revision: D13253433 Pulled By: teng-li fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848	2018-11-29 18:48:20 -08:00
Teng Li	97036d3c30	FileStore auto deletes file and FileStore::add bug fix (#13708 ) Summary: This addressed: https://github.com/pytorch/pytorch/issues/11874 and we will have the identical file init_method behavior as the previous THD file init. Also the FileStore::add bug is pretty annoying. Two bugs: (1) Add doesn't append to the end of the file. (2) Cache doesn't get updated. Both are fixed and tests are covered. I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708 Reviewed By: pietern Differential Revision: D12972810 Pulled By: teng-li fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48	2018-11-14 01:34:22 -08:00
Pieter Noordhuis	52472508e9	Add env:// rendezvous test (#11782 ) Summary: A missing environment variable raised a missing key error. Now it raises a more descriptive error of the actual problem, for example: ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782 Differential Revision: D9888962 Pulled By: pietern fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6	2018-09-19 09:56:06 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00

35 Commits