pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Carlos Mocholi	aade4fbd55	Expose the rendezvous keepalive arguments (#145228 ) Enables support for this: ```python from torch.distributed.launcher.api import LaunchConfig config = LaunchConfig( ..., rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5}, ) ``` These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks. Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228 Approved by: https://github.com/wconstab	2025-03-03 19:11:56 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Junjie Wang (PyTorch)	13eb3b3f6f	[Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702 ) Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now. Test Plan: CI. Differential Revision: D65453312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702 Approved by: https://github.com/XilunWu	2024-11-05 15:28:03 +00:00
fduwjj	40c825d773	[reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768 Approved by: https://github.com/kwen2501, https://github.com/atalman	2024-09-26 17:37:07 +00:00
PyTorch MergeBot	706eda5cd8	Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 )" This reverts commit `5033a1ca0d`. Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))	2024-09-24 22:24:26 +00:00
Amin Alam	1266be21f4	deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141 ) Fix to #136140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141 Approved by: https://github.com/kwen2501	2024-09-24 07:26:10 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Xilun Wu	e7731b3f8a	[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 ) D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via 1) explicit argument passing in user code when instantiating `MastRendezvousHandler` 2) pass `--use_libuv` command line argument to `torchrun`. The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch. PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type: when `USE_LIBUV="0"`, the non-libuv backend will be used. when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option. Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882 Approved by: https://github.com/shuqiangzhang	2024-09-03 19:43:21 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Kurman Karabukaev	bb13fad7aa	Share TCPStore by default when using c10d rdzv handler (#128096 ) Summary: Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail. Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server. Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler. Any handler would like to manage tcp store has to: - Return true on `use_agent_store` property - `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call) Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change) Test Plan: `cat ~/workspace/dist-demo/stores.py` ~~~ import torch import logging import sys import torch.distributed as dist import torch import os import time logger = logging.getLogger(__name__) logger.addHandler(logging.StreamHandler(sys.stderr)) logger.setLevel(logging.INFO) def _run_test(store): if dist.get_rank() == 1: logger.info("Rank %s is sleeping", dist.get_rank()) time.sleep(5) key = "lookup_key" logger.info("Checking key %s in store on rank %s", key, dist.get_rank()) store.check([key]) else: logger.info("rank %s done", dist.get_rank()) def main() -> None: use_gpu = torch.cuda.is_available() dist.init_process_group(backend="nccl" if use_gpu else "gloo") dist.barrier() logger.info(f"Hello World from rank {dist.get_rank()}") host = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] world_size = os.environ['WORLD_SIZE'] logger.info("testing TCPStore") store = dist.TCPStore( host_name=host, port=int(port), world_size=int(world_size), ) _run_test(store) if __name__ == "__main__": main() ~~~ With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option) ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: *************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ************************************* Hello World from rank 1 Hello World from rank 2 Hello World from rank 0 testing TCPStore testing TCPStore testing TCPStore rank 2 done Rank 1 is sleeping rank 0 done Checking key lookup_key in store on rank 1 ~~~ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro c-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: ************************************* Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *************************************** Hello World from rank 0 Hello World from rank 2 Hello World from rank 1 testing TCPStore testing TCPStore testing TCPStore rank 0 done rank 2 done Rank 1 is sleeping Checking key lookup_key in store on rank 1 [rank1]: Traceback (most recent call last): [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module> [rank1]: main() [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main [rank1]: _run_test(store) [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test [rank1]: store.check([key]) [rank1]: torch.distributed.DistNetworkError: Connection reset by peer E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python Traceback (most recent call last): File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module> main() File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main run(args) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run elastic_launch( File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/kurman/workspace/dist-demo/stores.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-05_17:40:22 host : devgpu011.cln5.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 2279237) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ~~~ Differential Revision: D58180193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096 Approved by: https://github.com/shuqiangzhang	2024-06-12 21:49:42 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Kurman Karabukaev	a60b566d37	[TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066 ) Summary: Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity. RFC: https://github.com/pytorch/pytorch/issues/114097 Test Plan: Integration tests Differential Revision: D52343874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066 Approved by: https://github.com/zdevito	2024-01-18 01:16:55 +00:00
Kazuaki Ishizaki	91973e1c31	Issue113185 (#113523 ) Fixes #113185 I have fixed the given docstring errors. The followings are the outputs with numbers before and after the changes: Pull Request resolved: https://github.com/pytorch/pytorch/pull/113523 Approved by: https://github.com/kit1980	2023-11-14 22:25:28 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Chris Zheng	5d37890b8e	Update torchrun and TorchElastic to take optional `local_addr` param to allow skip local IP lookup if specified (#88922 ) Summary: Update dynamic renderzvous nodes to use rendezvous hostname if provided. For PR: https://github.com/pytorch/pytorch/issues/85300 Before: For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address. For example, https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256 ``` return _NodeDesc(socket.getfqdn(), os.getpid(), local_id) ``` Now: If user specifies the hostname, each node will respect the given hostname. For example, `socket.getfqdn(<hostname>) ` Test Plan: Unit tests. Differential Revision: D41204028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922 Approved by: https://github.com/d4l3k	2022-12-21 03:55:01 +00:00
PyTorch MergeBot	9db3c517de	Add __all__ for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80240 Approved by: https://github.com/rohan-varma	2022-06-27 17:11:12 +00:00
Neel Pragnesh Gandhi	f2369f12f9	Add logging for dynamic rendezvous (#61822 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822 Added scuba logging to the following files: - dynamic_rendezvous.py - c10d_rendezvous_backend.py NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed. Test Plan: The following tests can be run. ``` buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test ``` ``` buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test ``` ``` buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test ``` Reviewed By: aivanou Differential Revision: D29643774 fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd	2021-07-26 09:39:09 -07:00
Can Balioglu	44c442293f	[torch/elastic] Fix the edge case when no node is alive (#59663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663 This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore. Test Plan: Run the existing and newly-introduced unit tests. Reviewed By: tierex Differential Revision: D28971809 fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c	2021-06-09 15:31:50 -07:00
Can Balioglu	f0a5500722	[torch/elastic] Add logging to the sanitize function of RendezvousStateHolder (#58169 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58169 This PR adds logging to the `_sanitize()` function of `RendezvousStateHolder` to output the nodes that had no recent heartbeat and are considered "dead". ghstack-source-id: 128798389 Test Plan: Run the existing tests. Reviewed By: tierex Differential Revision: D28333394 fbshipit-source-id: ba0a398a759815e4224b58323c0e743eb383f723	2021-05-12 18:53:55 -07:00
Can Balioglu	028f2f62ac	[torch/elastic] Update the rendezvous docs (#58160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160 This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend. ghstack-source-id: 128783809 Test Plan: Visually verified the correct Reviewed By: tierex Differential Revision: D28384996 fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592	2021-05-12 16:54:28 -07:00
Kimish Patel	b7d674eb21	Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs Test Plan: revert-hammer Differential Revision: D28331386 (`e4418b67c7`) Original commit changeset: 95dd32146222 fbshipit-source-id: 5522d4a09bc06ac42943eec9aa8bf5292cc778b2	2021-05-11 18:10:46 -07:00
Can Balioglu	e4418b67c7	[torch/elastic] Update the rendezvous docs (#57973 ) Summary: This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973 Reviewed By: kiukchung Differential Revision: D28331386 Pulled By: cbalioglu fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b	2021-05-11 15:32:50 -07:00
Can Balioglu	bf6e3425b0	[23/n] [torch/elastic] Introduce the implementation of DynamicRendezvousHandler (#57151 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151 This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs. ghstack-source-id: 127685212 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D28060531 fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f	2021-05-03 18:32:43 -07:00
Can Balioglu	a357fc8a4b	[22/n] [torch/elastic] Introduce a new from_backend static constructor for DynamicRendezvousHandler (#57150 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150 This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility. ghstack-source-id: 127685183 Test Plan: Run the updated unit tests. Reviewed By: tierex Differential Revision: D28060336 fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd	2021-05-03 18:32:42 -07:00
Can Balioglu	4a10bd3b58	[21/n] [torch/elastic] Introduce _RendezvousJoinOp (#57149 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149 This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`. ghstack-source-id: 127685142 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D28059785 fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a	2021-05-03 18:32:40 -07:00
Can Balioglu	81ef683cb3	[20/n] [torch/elastic] Introduce _RendezvousExitOp (#57148 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148 This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`. ghstack-source-id: 127685094 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D28059764 fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71	2021-05-03 18:32:38 -07:00
Can Balioglu	baf8f4c0a6	[19/n] [torch/elastic] Introduce _RendezvousKeepAliveOp (#57147 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147 This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`. ghstack-source-id: 127685037 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D28059733 fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6	2021-05-03 18:32:37 -07:00
Can Balioglu	3e024fcfc9	[18/n] [torch/elastic] Introduce _RendezvousCloseOp (#57146 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146 This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`. ghstack-source-id: 127684991 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D28059693 fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2	2021-05-03 18:32:35 -07:00
Can Balioglu	aa5d35e1d7	[17/n] [torch/elastic] Introduce _DistributedRendezvousOpExecutor (#57145 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145 This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`. ghstack-source-id: 127684945 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D28059417 fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84	2021-05-03 18:31:23 -07:00
Can Balioglu	1a6f827ae6	[16/n] [torch/elastic] Introduce _RendezvousOpExecutor (#57144 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144 This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous. ghstack-source-id: 127684898 Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition. Reviewed By: tierex Differential Revision: D28059159 fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f	2021-05-03 12:18:27 -07:00
Can Balioglu	76bccfb2e0	[15/n] [torch/elastic] Introduce _RendezvousStateHolder (#56538 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538 This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes. ghstack-source-id: 127684796 Test Plan: Run the existing and new unit tests. Reviewed By: tierex Differential Revision: D27892600 fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962	2021-05-03 12:17:18 -07:00
Can Balioglu	233004b4c8	[13/n] Extend the return type of RendezvousBackend's set_state method (#57142 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142 This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded. ghstack-source-id: 127629538 Test Plan: Run the updated unit tests. Reviewed By: tierex Differential Revision: D28058980 fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03	2021-05-03 11:37:03 -07:00
Can Balioglu	a6f60cf4f0	[12/n] Rename last_keep_alives to last_heartbeats in _RendezvousState (#57141 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141 Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`. ghstack-source-id: 127629442 Test Plan: Run the updated unit tests. Reviewed By: tierex Differential Revision: D28058948 fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e	2021-05-03 11:37:01 -07:00
Can Balioglu	3209364724	[11/n] [torch/elastic] Add heartbeat timeout to RendezvousTimeout (#57140 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140 This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`. ghstack-source-id: 127626815 Test Plan: Run the updated unit tests. Reviewed By: tierex Differential Revision: D28058908 fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f	2021-05-03 11:37:00 -07:00
Can Balioglu	6876e15dbe	[10/n] [torch/elastic] Add comparison operators to _NodeDesc (#57139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139 This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`. ghstack-source-id: 127626783 Test Plan: Run the existing unit tests. Reviewed By: tierex Differential Revision: D28058851 fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310	2021-05-03 11:36:58 -07:00
Can Balioglu	6bf8df6b3b	[9/n] [torch/elastic] Introduce RendezvousSettings (#56537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537 This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`. ghstack-source-id: 127626738 Test Plan: Run the existing unit tests. Reviewed By: tierex Differential Revision: D27890155 fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf	2021-05-03 11:36:04 -07:00
Can Balioglu	853112bbfc	[7/n] [torch/elastic] Rename _Rendezvous to _RendezvousState (#56535 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535 This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes. ghstack-source-id: 126979138 Test Plan: Run the existing unit tests. Reviewed By: H-Huang Differential Revision: D27889894 fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2	2021-04-21 16:01:03 -07:00
Can Balioglu	21d9bc246b	[6/n] [torch/elastic] Reorder type definitions in dynamic_rendezvous.py (#56534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534 This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability. ghstack-source-id: 126979087 Test Plan: Run the existing unit tests. Reviewed By: H-Huang Differential Revision: D27889817 fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23	2021-04-21 16:01:02 -07:00
Can Balioglu	71f9e99e29	[torch/elastic] Introduce aux types required by `DynamicRendezvousHandler` (#55932 ) Summary: This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932 Test Plan: Run the existing and newly-introduced unit/integration tests. Reviewed By: tierex Differential Revision: D27742329 Pulled By: cbalioglu fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198	2021-04-15 11:12:20 -07:00
Can Balioglu	b3dd8cde61	[1/n] [torch/elastic] Introduce `DynamicRendezvousHandler` and `RendezvousBackend`. (#55635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635 This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface. `DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`. ghstack-source-id: 126304697 Test Plan: Run the existing and newly-introduced unit/integration tests. Reviewed By: tierex Differential Revision: D27654478 fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5	2021-04-12 22:18:49 -07:00

45 Commits