pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	adcce538b7	Revert "Allow mp.start_processes to create processes in parallel (#133707 )" This reverts commit `3546628a2a`. Reverted https://github.com/pytorch/pytorch/pull/133707 on behalf of https://github.com/ZainRizvi due to sorry but trunk has been consistently broken since this PR was merged. See: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10529617600/job/29191757055) [HUD commit link](`3546628a2a`) ([comment](https://github.com/pytorch/pytorch/pull/133707#issuecomment-2310709523))	2024-08-26 17:31:10 +00:00
Jia Li	3546628a2a	Allow mp.start_processes to create processes in parallel (#133707 ) Summary: Background discussion in https://fb.workplace.com/groups/319878845696681/posts/1226087421742481 and pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to add an option to parallel start processes on pytorch side. Test Plan: Tested aps run in problem and things are in parallel now (next diff) Differential Revision: D61301603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133707 Approved by: https://github.com/d4l3k, https://github.com/ezyang	2024-08-23 17:11:20 +00:00
Kiuk Chung	92eb1731d4	[torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969 ) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for at-least-one child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > NOTE: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > NOTE: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969 Approved by: https://github.com/d4l3k	2024-05-15 00:13:08 +00:00
Yuanhao Ji	e3effa5855	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-17 06:46:02 +00:00
PyTorch MergeBot	52be63eb2c	Revert "Enable UFMT on all of `test/distributed` (#123539 )" This reverts commit `89ac37fe91`. Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))	2024-04-16 06:33:21 +00:00
Yuanhao Ji	89ac37fe91	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-16 03:23:56 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Alexander Grund	99cb807e25	Skip test_wrap_bad if run under pytest (#115070 ) Pytest replaces sys.stdout/stderr by `TextIOWrapper` instances which do not support `fileno()` Hence skip that test in this case Fixes #115069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115070 Approved by: https://github.com/clee2000	2024-02-15 00:10:05 +00:00
Kurman Karabukaev	bae8506589	[TorchElastic] Add option to configure log prefix for each rank (#112357 ) Summary: Add an ability to customize log lines and addtional template like behavior to enrich log information. Motivation: a) Log stream processing/aggregation gains additional value when it includes information about the global rank. Extension to that is that it will be easier to map ranks to hosts from log stream information (less relevant at the moment) b) Users can easily map the failure to the right rank without matching node rank offset+local rank. Implementation - BC change - keeps the logs line prefix as `[<role name><local rank>]:` - Optional env variable TORCHELASTIC_LOG_LINE_HEADER that will be used as a prefix when specified and currently exposes `role_name`, `rank` and `local_rank` variables that will be bound when agent assigns the ranks. Test Plan: CI https://fburl.com/mlhub/mzx5xspv Differential Revision: D50584590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112357 Approved by: https://github.com/kiukchung	2023-11-08 01:00:26 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Sergii Dymchenko	35bf5bac26	Fix "sandcastle_skip_if decorator name is confusing" (#95649 ) Fixes https://github.com/pytorch/pytorch/issues/89473 See the issue https://github.com/pytorch/pytorch/issues/89473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649 Approved by: https://github.com/atalman, https://github.com/malfet	2023-03-03 09:29:40 +00:00
Michael Suo	c978b609f7	[ci] remove IN_CI env var The conventional env var to set is CI. Both circle and GHA set it, so IN_CI is unnecessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/79229 Approved by: https://github.com/janeyx99	2022-06-11 17:16:30 +00:00
Kiuk Chung	b08309ee0a	(torch/elastic) skip logging structured error info if error_file is not set (#73477 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477 resolves https://github.com/pytorch/pytorch/issues/73465 This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`. Test Plan: Induce an error on the agent process by `kill -15 $AGENT_PID` ``` python -m torch.distributed.run \ --nproc_per_node 2 \ --nnodes 1:1 \ --rdzv_backend c10d \ --rdzv_endpoint localhost:29500 \ --monitor_interval 3 test.py ``` Produces {F704936697} In contrast to the duplicated error before: {F704936729} Reviewed By: d4l3k Differential Revision: D34501852 fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8 (cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)	2022-03-01 19:31:44 +00:00
Jane Xu	36d9a74bc6	Enforce that test cases extend from correct TestCase (#67819 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/66903 Main code is in torch/testing/_internal/common_utils.py and everything else is fixing the lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/67819 Reviewed By: H-Huang Differential Revision: D32259978 Pulled By: janeyx99 fbshipit-source-id: 39c5ffbaa510e1e533d6bdacf5c6158a3dd9885d	2021-11-08 18:28:36 -08:00
Jane Xu	eb8b80b76f	Add test owners for elastic tests (#67293 ) Summary: Action following discussion with distributed and r2p team--the tests under elastic in distributed should be owned by oncall: r2p and not distributed. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/67293 Reviewed By: jbschlosser Differential Revision: D31973779 Pulled By: janeyx99 fbshipit-source-id: 05875a7600c6eb1da1310a48e1e32a1a69461c55	2021-10-28 08:32:50 -07:00
Kiuk Chung	df11e2d6f9	(torch/elastic) add fqdn hostname to error printout (#66182 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182 closes https://github.com/pytorch/pytorch/issues/63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(args, *kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9	2021-10-07 01:40:02 -07:00
Pritam Damania	2d671ca41b	[8/N] Remove c10d/ddp fork tests. (#63454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454 Continuation of https://github.com/pytorch/pytorch/pull/63443, this PR removes all fork tests from torch.distributed. ghstack-source-id: 136285511 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D30387872 fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513	2021-08-20 12:23:18 -07:00
Pritam Damania	d565a7bd68	[6/N] Enable opt-asan for elastic and launcher tests. (#63442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442 Continuation of https://github.com/pytorch/pytorch/pull/62051, I've enabled elastic and launcher tests to run in opt-asan mode which is supported with spawn multiprocessing. This allows us to completely get rid of fork based tests from torch.distributed and have all tests run in spawn mode. ghstack-source-id: 136057123 Test Plan: waitforbuildbot Reviewed By: cbalioglu Differential Revision: D30384267 fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb	2021-08-18 10:48:49 -07:00
Pritam Damania	82d81455ae	[2/N] Remove unittest.skip across all of torch.distributed. (#61887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887 1) Introduced a `sandcastle_skip_if` decorator that ensures these tests just get passed on sandcastle. 2) Fixed all test files under `test/distributed` to not use `unittest.skip` Overall goal is to avoid using skips since sandcastle tags these tests as continuously skipping. ghstack-source-id: 134382237 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29784152 fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d	2021-07-27 10:53:23 -07:00
Aliaksandr Ivanou	0c55f1bdec	[torchelastic] Improve process termination logic (#61602 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602 The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT. When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL. Test Plan: unittests, sandcastle Reviewed By: cbalioglu Differential Revision: D29671783 fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a	2021-07-23 11:00:15 -07:00
Aliaksandr Ivanou	8f03018980	[pytorch] Move signal handler test to internal codebase (#60394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60394 Move signal handler test to internal codebase Github issue: https://github.com/pytorch/pytorch/issues/60260 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing:api_test buck test mode/dev-nosan //caffe2/torch/distributed/elastic/multiprocessing/fb/test:api_test Reviewed By: cbalioglu Differential Revision: D29273160 fbshipit-source-id: e4ae72f7f6d54cbba324119fce7446a30a6c37c9	2021-06-21 18:26:41 -07:00
Rong Rong (AI Infra)	510334f34b	[BE] clean up IS_PYTORCH_CI and IN_CI (#60279 ) Summary: `IS_PYTORCH_CI` and `IN_CI` are used randomly, however in some cases IN_CI is not currently set because it only exist in .circleci/scripts/setup_ci_environment.sh. This cleans up the 2 flags and only use IN_CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/60279 Test Plan: CI Reviewed By: seemethere Differential Revision: D29239545 Pulled By: walterddr fbshipit-source-id: a069424a2bb8790a3adfdaf0dc460301026bf8c7	2021-06-20 19:45:07 -07:00
Aliaksandr Ivanou	7fe4c1d0e7	Torchelastic: add multiprocessing tests to ci/cd (#56842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56842 Add elastic multiprocessing test to ci/cd Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/elastic/multiprocessing/... -- --run-disabled Reviewed By: wilson100hong Differential Revision: D27982226 fbshipit-source-id: 1b4e6f1a20867a6aa7ca409e280fdb04e8db198b	2021-05-02 14:03:47 -07:00
Aliaksandr Ivanou	0a72904ab4	Torchelastic: make process failure init error non-fatal (#56739 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739 The diff makes several tiny changes: * Add logs for each worker error file destination * Make sure log_dir is propagated from the launcher * Make ProcessFailure initialization error non-fatal. Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test https://fburl.com/tupperware/0nizb9z8 Reviewed By: borovsky-d, wilson100hong Differential Revision: D27952596 fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a	2021-04-23 00:49:47 -07:00
Aliaksandr Ivanou	f5675f8306	[torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412 The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/ When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it). Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test User workflow: f263531643 Reviewed By: cbalioglu Differential Revision: D27602838 fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1	2021-04-07 09:39:24 -07:00
Brian Hirsh	ae3a876c9c	Revert D27572158: [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process Test Plan: revert-hammer Differential Revision: D27572158 (`e9c6a51100`) Original commit changeset: 9a360468acc9 fbshipit-source-id: 29f7e2cba3e134bc81fb31b7e1dfceb7c1f9d734	2021-04-06 11:41:55 -07:00
Aliaksandr Ivanou	e9c6a51100	[torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process Summary: The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/ When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it). Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test User workflow: f263531643 Reviewed By: cbalioglu, wilson100hong Differential Revision: D27572158 fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b	2021-04-06 11:03:00 -07:00
Kiuk Chung	b03c92a9c5	[2/n][torch/elastic][upstream] Move torchelastic/timer torchelastic/multiprocessing to torch/distributed/elastic (#53574 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53574 Upstreams `torchelastic/timer\|multiprocessing` to `torch/distributed/elastic/timer\|multiprocessing` Test Plan: ``` buck test mode/dev-nosan //caffe2/torch/distributed/elastic/... buck test mode/dev-nosan //caffe2/test/distributed/elastic/... buck test mode/dev-nosan //pytorch/elastic/torchelastic/... buck test mode/dev-nosan //hpc/... buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/... ``` Reviewed By: borovsky-d, wilson100hong Differential Revision: D26899809 fbshipit-source-id: e6dbc2a78282eac296c262b3206a979e3ef1ff53	2021-03-10 12:32:53 -08:00

29 Commits