pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
Xilun Wu	ed06772e35	[TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062 ) Summary Extend the warning message to be more self-explained Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062 Approved by: https://github.com/shuqiangzhang	2024-09-04 22:05:51 +00:00
Xilun Wu	e7731b3f8a	[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 ) D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via 1) explicit argument passing in user code when instantiating `MastRendezvousHandler` 2) pass `--use_libuv` command line argument to `torchrun`. The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch. PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type: when `USE_LIBUV="0"`, the non-libuv backend will be used. when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option. Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882 Approved by: https://github.com/shuqiangzhang	2024-09-03 19:43:21 +00:00
PyTorch MergeBot	d52aff3e73	Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633 )" This reverts commit `136b19b062`. Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))	2024-08-28 15:49:18 +00:00
Sathyanarayanan Saravanamuthu	136b19b062	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/wconstab	2024-08-27 07:09:41 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Kurman Karabukaev	3b440f358c	[elastic collectives API] add missing rank tracing support (#132818 ) Optional option to detect missing ranks (that can be mapped to host info via `rank_tracing_decoder` lambda argument) in store barrier operation. This approach has been used in some form already, moving it to collectives API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132818 Approved by: https://github.com/d4l3k	2024-08-09 22:55:04 +00:00
Cheng Ni	27c9262d29	Fix stdout / stderr typing in SubprocessHandler (#132071 ) Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`. Test Plan: CI Differential Revision: D60319648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071 Approved by: https://github.com/Skylion007	2024-07-31 02:51:11 +00:00
Tristan Rice	a8985a97f9	elastic/store: use wait instead of get for barrier (#130148 ) Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first. Test Plan: CI + existing tests -- no behavior change Differential Revision: D59396199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148 Approved by: https://github.com/kurman, https://github.com/wconstab	2024-07-08 19:53:42 +00:00
Kurman Karabukaev	c9ceae3fac	Use JK for mast rdzv handler tcpstore handling and additional logging (#129603 ) Summary: Use JK to control the release instead of using env variable to toggle the feature. Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time. Test Plan: CI Differential Revision: D59071544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603 Approved by: https://github.com/d4l3k	2024-06-27 03:34:52 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Kurman Karabukaev	bb13fad7aa	Share TCPStore by default when using c10d rdzv handler (#128096 ) Summary: Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail. Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server. Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler. Any handler would like to manage tcp store has to: - Return true on `use_agent_store` property - `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call) Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change) Test Plan: `cat ~/workspace/dist-demo/stores.py` ~~~ import torch import logging import sys import torch.distributed as dist import torch import os import time logger = logging.getLogger(__name__) logger.addHandler(logging.StreamHandler(sys.stderr)) logger.setLevel(logging.INFO) def _run_test(store): if dist.get_rank() == 1: logger.info("Rank %s is sleeping", dist.get_rank()) time.sleep(5) key = "lookup_key" logger.info("Checking key %s in store on rank %s", key, dist.get_rank()) store.check([key]) else: logger.info("rank %s done", dist.get_rank()) def main() -> None: use_gpu = torch.cuda.is_available() dist.init_process_group(backend="nccl" if use_gpu else "gloo") dist.barrier() logger.info(f"Hello World from rank {dist.get_rank()}") host = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] world_size = os.environ['WORLD_SIZE'] logger.info("testing TCPStore") store = dist.TCPStore( host_name=host, port=int(port), world_size=int(world_size), ) _run_test(store) if __name__ == "__main__": main() ~~~ With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option) ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: *************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ************************************* Hello World from rank 1 Hello World from rank 2 Hello World from rank 0 testing TCPStore testing TCPStore testing TCPStore rank 2 done Rank 1 is sleeping rank 0 done Checking key lookup_key in store on rank 1 ~~~ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro c-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: ************************************* Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *************************************** Hello World from rank 0 Hello World from rank 2 Hello World from rank 1 testing TCPStore testing TCPStore testing TCPStore rank 0 done rank 2 done Rank 1 is sleeping Checking key lookup_key in store on rank 1 [rank1]: Traceback (most recent call last): [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module> [rank1]: main() [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main [rank1]: _run_test(store) [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test [rank1]: store.check([key]) [rank1]: torch.distributed.DistNetworkError: Connection reset by peer E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python Traceback (most recent call last): File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module> main() File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main run(args) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run elastic_launch( File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/kurman/workspace/dist-demo/stores.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-05_17:40:22 host : devgpu011.cln5.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 2279237) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ~~~ Differential Revision: D58180193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096 Approved by: https://github.com/shuqiangzhang	2024-06-12 21:49:42 +00:00
loganthomas	583a56d5a8	DOC: add docstring to construct_and_record_rdzv_event() (#128189 ) Fixes #127902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189 Approved by: https://github.com/kurman	2024-06-10 22:17:33 +00:00
Andrea Frittoli	83941482f7	Add docstring for the torch.distributed.elastic.utils.distributed.get_free_port function (#128133 ) Fixes: #127914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128133 Approved by: https://github.com/H-Huang	2024-06-10 18:10:58 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Tristan Rice	597922ba21	Reapply "distributed debug handlers (#126601 )" (#127805 ) This reverts commit `7646825c3e`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805 Approved by: https://github.com/PaliC	2024-06-04 19:44:30 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
PyTorch MergeBot	7646825c3e	Revert "distributed debug handlers (#126601 )" This reverts commit `3d541835d5`. Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))	2024-05-31 01:21:24 +00:00
Tristan Rice	3d541835d5	distributed debug handlers (#126601 ) This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-05-30 02:21:08 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Kostas Tsiampouris	2863c76b1f	[torch-distributed] Make log directory creation idempotent (#126496 ) Summary: https://docs.python.org/3/library/os.html#os.makedirs > If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists. Test Plan: Existing tests Differential Revision: D57471577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126496 Approved by: https://github.com/d4l3k	2024-05-18 00:17:13 +00:00
albanD	af9acc4168	Fix public binding to actually traverse modules (#126103 ) The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103 Approved by: https://github.com/suo	2024-05-15 19:36:03 +00:00
Kiuk Chung	92eb1731d4	[torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969 ) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for at-least-one child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > NOTE: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > NOTE: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969 Approved by: https://github.com/d4l3k	2024-05-15 00:13:08 +00:00
Sathyanarayanan Saravanamuthu	0bde9c08ef	Prevent rendezvous shutdown on worker restarts (#124819 ) Fixes #123678 #### Summary When the rank leaves and joins back, the workers are restarted and while restarting the rendezvous is shut down. This change prevents rendezvous shutdown during worker restarts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124819 Approved by: https://github.com/malfet, https://github.com/kurman, https://github.com/eqy	2024-05-09 02:40:31 +00:00
Michael Suo	21aaac47e7	[torchelastic] add timing events to different stages of rendezvous (#125636 ) Summary: as title Test Plan: unit tests. Launched a test job and observed scuba results: {F1506543300} Differential Revision: D57018103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125636 Approved by: https://github.com/d4l3k	2024-05-08 01:14:23 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Gagan Jain	02e7800b3f	[Torch][Timer] Skip expired timer logging for empty expired timers (#125039 ) Summary: same as title Test Plan: unit tests Differential Revision: D56636566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125039 Approved by: https://github.com/kurman	2024-04-30 18:28:49 +00:00
Tristan Rice	dc4c75ba72	elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982 ) Summary: This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts. This uses 2 approaches for different areas: * local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers. * exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add. At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds. Test Plan: This is testing using many small tests running on a remote cluster. {D56549942} ``` torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1 ``` Differential Revision: D56605193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-27 02:21:44 +00:00
Aaron Orenstein	a8574a9719	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-26 15:35:53 +00:00
PyTorch MergeBot	1ac60484c1	Revert "Fix global flake8 issues (#124771 )" This reverts commit `f01275934b`. Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
Aaron Orenstein	f01275934b	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-25 14:25:00 +00:00
Gagan Jain	c5e567c573	[Torch][Timer] Adding debug info logging interface for expired timers (#123883 ) Summary: Adding function to log additional debug information before killing the expired watchdog timers. Additional information like stack trace can be added in the debug function using worker process IDs from expired timers. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test Differential Revision: D56044153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883 Approved by: https://github.com/kurman	2024-04-25 01:15:52 +00:00
Tristan Rice	952a00eda7	torchelastic: change monitor_interval default to 0.1 (#124692 ) This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core. Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU. As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035). ## Motivation This setting is used to control how frequently we poll for failed processes in elastic. * For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry. * @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things ## Results This is measured in cores (100% is a single core under full load). \| monitor_interval (s) \| nproc-per-node \| CPU util (highest observed) \| \| -------------------- \| -------------- \| --------------------------- \| \| 1.0 \| 10 \| 0.2% \| \| 0.1 \| 1 \| 0.4% \| \| 0.1 \| 10 \| 0.4% \| \| 0.01 \| 10 \| 0.9% \| \| 0.001 \| 10 \| 4.0% \| \| 0.1 \| 100 \| 0.5% \| \| 0.1 \| 1000 \| 2.2% \| \| 0.1 \| 10000 \| 15.7% \| \| 0.1 \| 100000 \| 45.4% \| ## Methodology ```sh # run command $ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py # wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest $ top -b -d 10 -c \| rg 'torchrun.wait ``` wait.py ```py import time time.sleep(1060) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-24 01:44:41 +00:00
Kurman Karabukaev	1c4ad87396	[TorchElastic] Option to enable TCPStore libuv backed (#124684 ) Summary: Libuv backed isn't enabled in PTD by default now. Add an option to enable libuv backed to improve scaling of the rendezvous process. Tries not to make assumption on the default libuv settings in TCPStore since it may change in the next release. Test Plan: CI Differential Revision: D56435815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124684 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2024-04-23 23:12:17 +00:00
Aaron Gokaslan	c5fafe9f48	[BE]: TRY002 - Ban raising vanilla exceptions (#124570 ) Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR. I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570 Approved by: https://github.com/ezyang	2024-04-21 22:26:40 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Gagan Jain	016ca546aa	Adding health check server hook in torch elastic (#122750 ) (#123504 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55837899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504 Approved by: https://github.com/kurman	2024-04-11 19:10:56 +00:00
PyTorch MergeBot	ecb2418dd6	Revert "Adding health check server hook in torch elastic (#122750 )" This reverts commit `61d431fab0`. Reverted https://github.com/pytorch/pytorch/pull/122750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122750#issuecomment-2041104931))	2024-04-06 14:31:07 +00:00
Gagan Jain	61d431fab0	Adding health check server hook in torch elastic (#122750 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55108182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750 Approved by: https://github.com/kurman	2024-04-05 23:17:30 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Gagan Jain	9347a79f1c	[Watchdog Timer] Clear timer for already terminated process (#122324 ) Summary: handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry. removing the non-existent process during clear timer. Test Plan: unit tests Differential Revision: D55099773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324 Approved by: https://github.com/d4l3k	2024-03-22 15:48:03 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
Kurman Karabukaev	b0cfa96e82	[Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942 ) Summary: Expose an option to users to specify name of the LogsSpec implementation to use. - Has to be defined in entrypoints under `torchrun.logs_specs` group. - Must implement LogsSpec defined in prior PR/diff. Test Plan: unit test+local tests Reviewed By: ezyang Differential Revision: D54180838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942 Approved by: https://github.com/ezyang	2024-03-02 08:07:52 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00

1 2 3 4 5

218 Commits