pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Link Li	995f607c74	fix doc string (#146968 ) Fixes a wrong function name in doc string Pull Request resolved: https://github.com/pytorch/pytorch/pull/146968 Approved by: https://github.com/zackycao, https://github.com/H-Huang	2025-02-12 21:43:16 +00:00
Aaron Gokaslan	292af3cc89	[BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408 ) Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2025-02-04 19:07:04 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
bobrenjc93	88ccf2fa5e	remove allow-untyped-defs from distributed/elastic/multiprocessing/subprocess_handler/handlers.py (#143917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143917 Approved by: https://github.com/Skylion007	2024-12-28 00:13:05 +00:00
bobrenjc93	fda9048ca8	remove allow-untyped-defs from distributed/elastic/multiprocessing/errors/handlers.py (#143869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143869 Approved by: https://github.com/Skylion007	2024-12-27 15:49:19 +00:00
bobrenjc93	dd346dbeab	remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605 Approved by: https://github.com/aorenste	2024-12-20 05:25:01 +00:00
Jane Xu	fd65bd755d	[BE] replace incorrect .. note:: invocations (#142868 ) Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868 Approved by: https://github.com/albanD	2024-12-11 19:58:18 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Cheng Ni	27c9262d29	Fix stdout / stderr typing in SubprocessHandler (#132071 ) Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`. Test Plan: CI Differential Revision: D60319648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071 Approved by: https://github.com/Skylion007	2024-07-31 02:51:11 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Kostas Tsiampouris	2863c76b1f	[torch-distributed] Make log directory creation idempotent (#126496 ) Summary: https://docs.python.org/3/library/os.html#os.makedirs > If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists. Test Plan: Existing tests Differential Revision: D57471577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126496 Approved by: https://github.com/d4l3k	2024-05-18 00:17:13 +00:00
albanD	af9acc4168	Fix public binding to actually traverse modules (#126103 ) The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103 Approved by: https://github.com/suo	2024-05-15 19:36:03 +00:00
Kiuk Chung	92eb1731d4	[torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969 ) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for at-least-one child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > NOTE: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > NOTE: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969 Approved by: https://github.com/d4l3k	2024-05-15 00:13:08 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
Kurman Karabukaev	b0cfa96e82	[Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942 ) Summary: Expose an option to users to specify name of the LogsSpec implementation to use. - Has to be defined in entrypoints under `torchrun.logs_specs` group. - Must implement LogsSpec defined in prior PR/diff. Test Plan: unit test+local tests Reviewed By: ezyang Differential Revision: D54180838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942 Approved by: https://github.com/ezyang	2024-03-02 08:07:52 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Kurman Karabukaev	4240304da4	[TorchElastic] Handle SystemExit with code == 0 (#119697 ) Summary: Fix for a case where --run-path option fails to exit if the script exits with non-error status code. When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit. Test Plan: cat /tmp/script.py ~~~ import sys def main(): exit_code = 1 if len(sys.argv) > 1: exit_code = int(sys.argv[1]) sys.exit(exit_code) if __name__=="__main__": main() ~~~ Case of exit code with 0 (prior behavior - never exits): torchrun --run-path /tmp/script.py 0 ~~~ [2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. (conda:pytorch) ➜ workspace echo $? 0 ~~~ Existing behavior for non-zero exit code still works: torchrun --run-path /tmp/script.py ~~~ (conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py [2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last): [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException( [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1 Traceback (most recent call last): File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main run(args) File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run elastic_launch( File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-12_09:16:25 host : kurman-mbp.dhcp.thefacebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 64668) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ~~~ Differential Revision: D53653874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697 Approved by: https://github.com/wconstab	2024-02-14 03:09:09 +00:00
Jack Zhang	51fb99250b	Fix missing MAST log when there is Unicode non-decodable text in logs (#119298 ) Summary: ## Issue When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102 In the example, the process stopped producing Python logs after 17:20:21 untill the job finished ``` [0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % \| (gpu mem: 25.8GB, free CPU mem: 1387.8GB) I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0] ``` At the end, `UnicodeDecodeError` was thrown at the end with no call stack. ## Fix Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens. Test Plan: f528854819 Differential Revision: D53483644 Co-authored-by: Jack Zhang <jackzh@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298 Approved by: https://github.com/XilunWu	2024-02-07 19:25:43 +00:00
Simon Fan	284b0b5f44	Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-02-07 04:29:54 +00:00
PyTorch MergeBot	a4355d6b9a	Revert "Add --filter-rank to torchrun: allow logs filtering by rank (#118562 )" This reverts commit `73229b4f93`. Reverted https://github.com/pytorch/pytorch/pull/118562 on behalf of https://github.com/xmfan due to breaks MAST precheck, flag naming conflict ([comment](https://github.com/pytorch/pytorch/pull/118562#issuecomment-1924916601))	2024-02-02 23:56:21 +00:00
Simon Fan	73229b4f93	Add --filter-rank to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-01-31 07:40:01 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
Aaron Gokaslan	4bb3a02d02	[BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474 ) Standardizes logging calls to always use logging.exception instead of logging.error where appropriate and enforces it with a lint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114474 Approved by: https://github.com/jansel, https://github.com/malfet	2023-11-27 17:38:08 +00:00
PyTorch MergeBot	8232d4d1c3	Revert "[BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474 )" This reverts commit `d30497f6b6`. Reverted https://github.com/pytorch/pytorch/pull/114474 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I see a bunch of inductor failure after the commit `d30497f6b6`, trying to revert to see if it helps fix the issues ([comment](https://github.com/pytorch/pytorch/pull/114474#issuecomment-1827271887))	2023-11-27 07:36:08 +00:00
Aaron Gokaslan	d30497f6b6	[BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474 ) Standardizes logging calls to always use logging.exception instead of logging.error where appropriate and enforces it with a lint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114474 Approved by: https://github.com/jansel	2023-11-24 23:29:51 +00:00
zdevito	d968c4cac3	[torchelastic] ensure grandchild processes are restarted correctly (#113231 ) When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delivered only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113231 Approved by: https://github.com/H-Huang	2023-11-19 04:05:01 +00:00
Kazuaki Ishizaki	91973e1c31	Issue113185 (#113523 ) Fixes #113185 I have fixed the given docstring errors. The followings are the outputs with numbers before and after the changes: Pull Request resolved: https://github.com/pytorch/pytorch/pull/113523 Approved by: https://github.com/kit1980	2023-11-14 22:25:28 +00:00
Kurman Karabukaev	bae8506589	[TorchElastic] Add option to configure log prefix for each rank (#112357 ) Summary: Add an ability to customize log lines and addtional template like behavior to enrich log information. Motivation: a) Log stream processing/aggregation gains additional value when it includes information about the global rank. Extension to that is that it will be easier to map ranks to hosts from log stream information (less relevant at the moment) b) Users can easily map the failure to the right rank without matching node rank offset+local rank. Implementation - BC change - keeps the logs line prefix as `[<role name><local rank>]:` - Optional env variable TORCHELASTIC_LOG_LINE_HEADER that will be used as a prefix when specified and currently exposes `role_name`, `rank` and `local_rank` variables that will be bound when agent assigns the ranks. Test Plan: CI https://fburl.com/mlhub/mzx5xspv Differential Revision: D50584590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112357 Approved by: https://github.com/kiukchung	2023-11-08 01:00:26 +00:00
Atul Jangra	88244cd7a9	[torchx] Do not terminate parent process if exit code from child isn't valid (#111961 ) Summary: There's no reason to terminate the parent process trying to find the name of the signal received by the child process. Let's make sure this is handled properly, which then will ensure that parent process can process child failures. Test Plan: Unit tests. Reviewed By: aaronenyeshi Differential Revision: D50615419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111961 Approved by: https://github.com/aaronenyeshi	2023-10-25 07:13:28 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Edward Z. Yang	b8b840be3d	Convert logging f-strings to use % format, part five (#98765 ) This does some annoying but simple cases by hand. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98765 Approved by: https://github.com/wanchaol	2023-04-11 13:17:59 +00:00
Edward Z. Yang	5a7aad9681	Convert logging f-strings to use % format, part four (#98705 ) This does multi-line concatenated string literals. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705 Approved by: https://github.com/voznesenskym	2023-04-11 13:17:59 +00:00
Edward Z. Yang	b09722f540	Convert logging f-strings to use % format, part two (#98700 ) This hits multi-line logging strings Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Edward Z. Yang	9a8f71f23e	Convert logging f-strings to use % format (#98697 ) Codemod done with https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with assistance from ChatGPT. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Kazuaki Ishizaki	6514d71add	Fix typos under torch/distributed directory (#98225 ) This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225 Approved by: https://github.com/soulitzer, https://github.com/kit1980	2023-04-05 00:21:33 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Jeffrey Dunn	d779dadda1	Remove stack trace captures from import (#97274 ) Summary: Calls to this function without an argument will get a stack trace at import time. This is expensive, we can just skip it by passing in a value. Test Plan: Wait for tests Differential Revision: D44244345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274 Approved by: https://github.com/kiukchung	2023-03-22 18:34:13 +00:00
Aaron Gokaslan	5471621497	[BE] Remove unnecessary dict comprehensions (#97116 ) Removes unnecessary dict comprehensions that optimize creation of dicts from iterables Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116 Approved by: https://github.com/kit1980	2023-03-20 00:56:57 +00:00
Horace He	5bbec680d7	Fix usages of contextmanager without finally (#96170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-03-08 20:59:27 +00:00
fduwjj	e98a942399	[PTD] Land 'to_std' utility parser fix #93209 (#94023 ) Land https://github.com/pytorch/pytorch/pull/93209 faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94023 Approved by: https://github.com/wz337	2023-02-03 09:04:34 +00:00

1 2

88 Commits