pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Maggie Moss	8f80892359	Use correct pyrefly syntax in suppressions distributed/... (#166241 ) Updates the pyrefy-ignores in the torch/distributed directory to use the correct syntax. No functional changes. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166241 Approved by: https://github.com/oulgen	2025-10-26 04:16:41 +00:00
Phil Hu	cbcb4f7768	[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun (#160712 ) Summary: Part of an effort to extract some important error logs (e.g. [#157996](https://github.com/pytorch/pytorch/pull/157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160712 Approved by: https://github.com/fduwjj	2025-10-23 14:22:21 +00:00
PyTorch MergeBot	d2494cbb2b	Revert "[distributed] Replace assert statements with AssertionError exceptions (#165216 )" This reverts commit `74db92b218`. Reverted https://github.com/pytorch/pytorch/pull/165216 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_pg_wrapper.py::ProcessGroupNCCLWrapperTest::test_debug_level_detail_no_gloo [GH job link](https://github.com/pytorch/pytorch/actions/runs/18492765290/job/52693842750) [HUD commit link](`74db92b218`), note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165216#issuecomment-3402838765))	2025-10-14 17:05:16 +00:00
Rohit Singh Rathaur	74db92b218	[distributed] Replace assert statements with AssertionError exceptions (#165216 ) Replaces 71 assert statements across 11 files in `torch.distributed` with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag. Fixes #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165216 Approved by: https://github.com/albanD	2025-10-14 09:58:59 +00:00
Maggie Moss	7457d139c5	Add pyrefly suppressions to torch/distributed (7/n) (#165002 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 One more PR after this one. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002 Approved by: https://github.com/oulgen	2025-10-09 04:08:25 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
Yuanyuan Chen	c2862c8e66	[distributed] Remove python code older than 3.10 (#163613 ) Because now that the minimum Python version is 3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163613 Approved by: https://github.com/XuehaiPan, https://github.com/kwen2501	2025-09-28 04:15:24 +00:00
Miroslaw Oksiucik	66c0f14ecc	Support XPU in --nproc-per-node option to torchrun (#159474 ) Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto Pull Request resolved: https://github.com/pytorch/pytorch/pull/159474 Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-09-12 08:32:04 +00:00
SandishKumarHN	b498299953	154849 Add support to handle IGUSR1 and SIGUSR2 in multiprocessing (#160690 ) Fixes #154849 This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments. Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api Tests: For validations purpose: test_signal_handling.py, simple_test_api_signal_handling.py, Unit Tests: for launcher changes:launcher/test_api.py for api changes: multiprocessing/test_api.py E2E: test_run.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690 Approved by: https://github.com/fduwjj	2025-09-09 22:23:06 +00:00
Paul de Supinski	7e91394955	Support NUMA Binding for Callable Entrypoints (#160163 ) # Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 \| tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 \| tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](`017259f9c6/torch/distributed/run.py (L870)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163 Approved by: https://github.com/d4l3k	2025-08-12 20:08:49 +00:00
raghavhrishi	7ef3c3357d	NUMA binding integration with elastic agent and torchrun (#149334 ) Implements #148689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334 Approved by: https://github.com/d4l3k Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>	2025-07-25 21:19:49 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit `c2f0292bd5`. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Amandeep Chhabra	e15848669f	[1/n]adding torch.distributed.run option to provide destination for event logging (#154644 ) (#155268 ) Summary: Problem Statement Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run. recording events to default destination: https://fburl.com/code/7f9b0993 The default destination is "null". *Solution* adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION Rollback Plan: Reviewed By: kiukchung Differential Revision: D75183591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268 Approved by: https://github.com/d4l3k	2025-06-09 10:43:52 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Fernando Pérez-García	d79c6f4946	Improve torchrun documentation (#144354 ) Fixes #142042: - #142042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144354 Approved by: https://github.com/c-p-i-o, https://github.com/H-Huang	2025-01-24 20:40:05 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit `6cb186e279`. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Aaron Gokaslan	72c6d13cea	[BE]: Use proper logger in torch.distributed.run (#140547 ) `torch.distributed.run` was improperly using the root logger and ignoring all logging settings and useful debugging info. Now properly uses the correct logger. Will be added to ruff as part of LOG015 soon. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140547 Approved by: https://github.com/XuehaiPan, https://github.com/fegin	2024-11-14 14:49:17 +00:00
Howard Huang	1eedb0a962	fix torchrun log message (#131652 ) fixes https://github.com/pytorch/pytorch/issues/131461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131652 Approved by: https://github.com/awgu	2024-07-25 14:50:10 +00:00
Xuehai Pan	94dc3253a0	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-06-22 18:53:28 +00:00
PyTorch MergeBot	9c929f6ce9	Revert "[BE][Easy] enable UFMT for `torch/distributed/` (#128870 )" This reverts commit `a0e1e20c41`. Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))	2024-06-21 00:38:28 +00:00
Xuehai Pan	a0e1e20c41	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin ghstack dependencies: #128868, #128869	2024-06-18 21:49:08 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Aaron Gokaslan	8cad88e1f3	[BE]: Improve exception typing. Remove NOQAs (#125535 ) Improve some exception typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/125535 Approved by: https://github.com/albanD	2024-05-08 14:07:13 +00:00
Tristan Rice	952a00eda7	torchelastic: change monitor_interval default to 0.1 (#124692 ) This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core. Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU. As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035). ## Motivation This setting is used to control how frequently we poll for failed processes in elastic. * For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry. * @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things ## Results This is measured in cores (100% is a single core under full load). \| monitor_interval (s) \| nproc-per-node \| CPU util (highest observed) \| \| -------------------- \| -------------- \| --------------------------- \| \| 1.0 \| 10 \| 0.2% \| \| 0.1 \| 1 \| 0.4% \| \| 0.1 \| 10 \| 0.4% \| \| 0.01 \| 10 \| 0.9% \| \| 0.001 \| 10 \| 4.0% \| \| 0.1 \| 100 \| 0.5% \| \| 0.1 \| 1000 \| 2.2% \| \| 0.1 \| 10000 \| 15.7% \| \| 0.1 \| 100000 \| 45.4% \| ## Methodology ```sh # run command $ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py # wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest $ top -b -d 10 -c \| rg 'torchrun.wait ``` wait.py ```py import time time.sleep(1060) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-24 01:44:41 +00:00
Aaron Gokaslan	c5fafe9f48	[BE]: TRY002 - Ban raising vanilla exceptions (#124570 ) Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR. I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570 Approved by: https://github.com/ezyang	2024-04-21 22:26:40 +00:00
Xuehai Pan	0eab740db3	[Docs][Distributed] Add migration notes for `--local-rank` option style change for `torchrun` in PyTorch 2.0 (#109480 ) Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480 Approved by: https://github.com/ezyang	2024-04-16 05:51:57 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Kurman Karabukaev	b0cfa96e82	[Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942 ) Summary: Expose an option to users to specify name of the LogsSpec implementation to use. - Has to be defined in entrypoints under `torchrun.logs_specs` group. - Must implement LogsSpec defined in prior PR/diff. Test Plan: unit test+local tests Reviewed By: ezyang Differential Revision: D54180838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942 Approved by: https://github.com/ezyang	2024-03-02 08:07:52 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Simon Fan	284b0b5f44	Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-02-07 04:29:54 +00:00
PyTorch MergeBot	a4355d6b9a	Revert "Add --filter-rank to torchrun: allow logs filtering by rank (#118562 )" This reverts commit `73229b4f93`. Reverted https://github.com/pytorch/pytorch/pull/118562 on behalf of https://github.com/xmfan due to breaks MAST precheck, flag naming conflict ([comment](https://github.com/pytorch/pytorch/pull/118562#issuecomment-1924916601))	2024-02-02 23:56:21 +00:00
Simon Fan	73229b4f93	Add --filter-rank to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-01-31 07:40:01 +00:00
Wanchao Liang	d416e5b34f	[torchrun] fix incorrect warning for non static backend (#114335 ) This PR fixes a incorrect warning for non static rdzv backend, the warning should only be thrown when the rdzv endpoint not specified. error repro from @stas00 ``` $ cat test.py import torch $ python -u -m torch.distributed.run --nproc_per_node=1 --rdzv_endpoint localhost:6000 --rdzv_backend c10d test.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114335 Approved by: https://github.com/H-Huang	2023-11-22 20:09:14 +00:00
ooooo	a8097ed479	Fix docstring errors in _composable_state.py, remote_device.py, value_ranges.py, utils.py, run.py, rendezvous.py, launch.py, argparse_util.py, __init__.py, _cycles.py (#112953 ) Fixes #112639 ```txt torch/utils/_sympy/value_ranges.py torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:68 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:86 in public method `tighten`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:90 in public method `__and__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:103 in public method `__or__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:118 in public method `unknown`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:122 in public method `wrap`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`: D400: First line should end with a period (not ')') torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`: D400: First line should end with a period (not ')') torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`: D400: First line should end with a period (not 'g') torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`: D400: First line should end with a period (not '0') torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`: D403: First word of the first line should be properly capitalized ('Fn', not 'fn') torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`: D205: 1 blank line required between summary line and description (found 0) torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`: D400: First line should end with a period (not ':') torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`: D400: First line should end with a period (not 'e') torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`: D205: 1 blank line required between summary line and description (found 0) torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`: D400: First line should end with a period (not 's') torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`: D210: No whitespaces allowed surrounding docstring text torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:489 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:511 in public method `load`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:514 in public method `store`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:517 in public method `reduction`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:558 in public method `square`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:562 in public method `neg`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:577 in public method `sub`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`: D103: Missing docstring in public function 36 torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:68 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:86 in public method `tighten`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:90 in public method `__and__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:103 in public method `__or__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:118 in public method `unknown`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:122 in public method `wrap`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`: D205: 1 blank line required between summary line and description (found 0) torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`: D400: First line should end with a period (not 's') torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`: D210: No whitespaces allowed surrounding docstring text torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:491 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:513 in public method `load`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:516 in public method `store`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:519 in public method `reduction`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:560 in public method `square`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:564 in public method `neg`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:579 in public method `sub`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`: D103: Missing docstring in public function 28 torch/utils/viz/_cycles.py torch/utils/viz/_cycles.py:14 in public function `observe_garbage`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:207 in public function `object_annotation`: D205: 1 blank line required between summary line and description (found 0) torch/utils/viz/_cycles.py:207 in public function `object_annotation`: D400: First line should end with a period (not 'g') torch/utils/viz/_cycles.py:256 in public class `Node`: D101: Missing docstring in public class torch/utils/viz/_cycles.py:262 in public function `create_graph`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:308 in public function `escape`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:335 in public function `to_dot`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:406 in public function `to_html`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`: D205: 1 blank line required between summary line and description (found 0) torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`: D400: First line should end with a period (not 'p') torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`: D401: First line should be in imperative mood; try rephrasing (found 'Reference') 14 torch/utils/viz/_cycles.py:14 in public function `observe_garbage`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:256 in public class `Node`: D101: Missing docstring in public class torch/utils/viz/_cycles.py:262 in public function `create_graph`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:308 in public function `escape`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:335 in public function `to_dot`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:406 in public function `to_html`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`: D103: Missing docstring in public function 9 torch/distributed/argparse_util.py torch/distributed/argparse_util.py:1 at module level: D100: Missing docstring in public module torch/distributed/argparse_util.py:13 in public class `env`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/argparse_util.py:13 in public class `env`: D400: First line should end with a period (not 'g') torch/distributed/argparse_util.py:13 in public class `env`: D412: No blank lines allowed between a section header and its content ('Example') torch/distributed/argparse_util.py:43 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:56 in public method `__call__`: D102: Missing docstring in public method torch/distributed/argparse_util.py:61 in public class `check_env`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/argparse_util.py:61 in public class `check_env`: D400: First line should end with a period (not 's') torch/distributed/argparse_util.py:61 in public class `check_env`: D412: No blank lines allowed between a section header and its content ('Example') torch/distributed/argparse_util.py:97 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:102 in public method `__call__`: D102: Missing docstring in public method 11 torch/distributed/argparse_util.py:1 at module level: D100: Missing docstring in public module torch/distributed/argparse_util.py:43 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:56 in public method `__call__`: D102: Missing docstring in public method torch/distributed/argparse_util.py:97 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:102 in public method `__call__`: D102: Missing docstring in public method 5 torch/distributed/_composable_state.py torch/distributed/_composable_state.py:20 in private function `_get_module_state`: D202: No blank lines allowed after function docstring (found 1) torch/distributed/_composable_state.py:20 in private function `_get_module_state`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/_composable_state.py:20 in private function `_get_module_state`: D400: First line should end with a period (not '`') 3 0 torch/distributed/launch.py torch/distributed/launch.py:1 at module level: D205: 1 blank line required between summary line and description (found 0) torch/distributed/launch.py:1 at module level: D400: First line should end with a period (not 'd') torch/distributed/launch.py:156 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/launch.py:171 in public function `launch`: D103: Missing docstring in public function torch/distributed/launch.py:180 in public function `main`: D103: Missing docstring in public function 5 torch/distributed/launch.py:157 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/launch.py:172 in public function `launch`: D103: Missing docstring in public function torch/distributed/launch.py:181 in public function `main`: D103: Missing docstring in public function 3 torch/distributed/remote_device.py torch/distributed/remote_device.py:1 at module level: D100: Missing docstring in public module torch/distributed/remote_device.py:81 in private method `worker_name`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/remote_device.py:81 in private method `worker_name`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/distributed/remote_device.py:88 in private method `rank`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/remote_device.py:88 in private method `rank`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/distributed/remote_device.py:95 in private method `device`: D200: One-line docstring should fit on one line with quotes (found 3) torch/distributed/remote_device.py:95 in private method `device`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') 7 torch/distributed/remote_device.py:1 at module level: D100: Missing docstring in public module torch/distributed/remote_device.py:85 in private method `rank`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/remote_device.py:85 in private method `rank`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') 3 torch/distributed/rendezvous.py torch/distributed/rendezvous.py:1 at module level: D100: Missing docstring in public module torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`: D401: First line should be in imperative mood (perhaps 'Register', not 'Registers') torch/distributed/rendezvous.py:88 in public function `rendezvous`: D103: Missing docstring in public function torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`: D400: First line should end with a period (not 'r') 5 torch/distributed/rendezvous.py:1 at module level: D100: Missing docstring in public module torch/distributed/rendezvous.py:89 in public function `rendezvous`: D103: Missing docstring in public function 2 torch/distributed/run.py torch/distributed/run.py:9 at module level: D205: 1 blank line required between summary line and description (found 0) torch/distributed/run.py:9 at module level: D400: First line should end with a period (not '`') torch/distributed/run.py:393 in public function `get_args_parser`: D202: No blank lines allowed after function docstring (found 1) torch/distributed/run.py:393 in public function `get_args_parser`: D401: First line should be in imperative mood; try rephrasing (found 'Helper') torch/distributed/run.py:610 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/run.py:615 in public function `parse_min_max_nnodes`: D103: Missing docstring in public function torch/distributed/run.py:629 in public function `determine_local_world_size`: D103: Missing docstring in public function torch/distributed/run.py:670 in public function `get_rdzv_endpoint`: D103: Missing docstring in public function torch/distributed/run.py:677 in public function `get_use_env`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/run.py:677 in public function `get_use_env`: D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves') torch/distributed/run.py:689 in public function `config_from_args`: D103: Missing docstring in public function torch/distributed/run.py:770 in public function `run_script_path`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/run.py:770 in public function `run_script_path`: D401: First line should be in imperative mood (perhaps 'Run', not 'Runs') torch/distributed/run.py:781 in public function `run`: D103: Missing docstring in public function torch/distributed/run.py:804 in public function `main`: D103: Missing docstring in public function 15 torch/distributed/run.py:611 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/run.py:616 in public function `parse_min_max_nnodes`: D103: Missing docstring in public function torch/distributed/run.py:630 in public function `determine_local_world_size`: D103: Missing docstring in public function torch/distributed/run.py:671 in public function `get_rdzv_endpoint`: D103: Missing docstring in public function torch/distributed/run.py:691 in public function `config_from_args`: D103: Missing docstring in public function torch/distributed/run.py:784 in public function `run`: D103: Missing docstring in public function torch/distributed/run.py:807 in public function `main`: D103: Missing docstring in public function 7 torch/distributed/__init__.py torch/distributed/__init__.py:1 at module level: D104: Missing docstring in public package torch/distributed/__init__.py:8 in public function `is_available`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/__init__.py:8 in public function `is_available`: D400: First line should end with a period (not ',') torch/distributed/__init__.py:8 in public function `is_available`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') 4 torch/distributed/__init__.py:1 at module level: D104: Missing docstring in public package 1 torch/distributed/utils.py:1 at module level: D100: Missing docstring in public module torch/distributed/utils.py:16 in private function `_pack_kwargs`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:16 in private function `_pack_kwargs`: D400: First line should end with a period (not ')') torch/distributed/utils.py:47 in private function `_cast_forward_inputs`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:88 in private function `_recursive_to`: D200: One-line docstring should fit on one line with quotes (found 3) torch/distributed/utils.py:141 in private function `_p_assert`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:141 in private function `_p_assert`: D209: Multi-line docstring closing quotes should be on a separate line torch/distributed/utils.py:141 in private function `_p_assert`: D400: First line should end with a period (not 't') torch/distributed/utils.py:141 in private function `_p_assert`: D401: First line should be in imperative mood; try rephrasing (found 'This') torch/distributed/utils.py:275 in private function `_sync_module_states`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:275 in private function `_sync_module_states`: D400: First line should end with a period (not 'n') torch/distributed/utils.py:275 in private function `_sync_module_states`: D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs') torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`: D400: First line should end with a period (not 'y') torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`: D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes') 15 torch/distributed/utils.py:1 at module level: D100: Missing docstring in public module 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953 Approved by: https://github.com/weifengpy	2023-11-08 01:13:09 +00:00
Kurman Karabukaev	bae8506589	[TorchElastic] Add option to configure log prefix for each rank (#112357 ) Summary: Add an ability to customize log lines and addtional template like behavior to enrich log information. Motivation: a) Log stream processing/aggregation gains additional value when it includes information about the global rank. Extension to that is that it will be easier to map ranks to hosts from log stream information (less relevant at the moment) b) Users can easily map the failure to the right rank without matching node rank offset+local rank. Implementation - BC change - keeps the logs line prefix as `[<role name><local rank>]:` - Optional env variable TORCHELASTIC_LOG_LINE_HEADER that will be used as a prefix when specified and currently exposes `role_name`, `rank` and `local_rank` variables that will be bound when agent assigns the ranks. Test Plan: CI https://fburl.com/mlhub/mzx5xspv Differential Revision: D50584590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112357 Approved by: https://github.com/kiukchung	2023-11-08 01:00:26 +00:00
Kunal Bhalla	af229ecd34	[RFC] Change --standalone to bind to a random port (#107734 ) Given standalone generates args anyways, it seems like it would be more convenient if it explicitly used a random port by default instead of trying to use 29400. That way users can directly go with `--standalone` instead of having to spell out `--rdzv-backend=c10d --rdzv-endpoint=localhost:0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107734 Approved by: https://github.com/H-Huang	2023-08-25 22:13:44 +00:00
shibo19	0af3203c72	fix torchrun script for custom device (#105443 ) Fixes #ISSUE_NUMBER as the title,add torchrun support for custom device Pull Request resolved: https://github.com/pytorch/pytorch/pull/105443 Approved by: https://github.com/kumpera	2023-07-31 05:46:23 +00:00
Edward Z. Yang	5a7aad9681	Convert logging f-strings to use % format, part four (#98705 ) This does multi-line concatenated string literals. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705 Approved by: https://github.com/voznesenskym	2023-04-11 13:17:59 +00:00
Edward Z. Yang	9a8f71f23e	Convert logging f-strings to use % format (#98697 ) Codemod done with https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with assistance from ChatGPT. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Kazuaki Ishizaki	6514d71add	Fix typos under torch/distributed directory (#98225 ) This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225 Approved by: https://github.com/soulitzer, https://github.com/kit1980	2023-04-05 00:21:33 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Jeffrey Dunn	d779dadda1	Remove stack trace captures from import (#97274 ) Summary: Calls to this function without an argument will get a stack trace at import time. This is expensive, we can just skip it by passing in a value. Test Plan: Wait for tests Differential Revision: D44244345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274 Approved by: https://github.com/kiukchung	2023-03-22 18:34:13 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Chris Zheng	5d37890b8e	Update torchrun and TorchElastic to take optional `local_addr` param to allow skip local IP lookup if specified (#88922 ) Summary: Update dynamic renderzvous nodes to use rendezvous hostname if provided. For PR: https://github.com/pytorch/pytorch/issues/85300 Before: For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address. For example, https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256 ``` return _NodeDesc(socket.getfqdn(), os.getpid(), local_id) ``` Now: If user specifies the hostname, each node will respect the given hostname. For example, `socket.getfqdn(<hostname>) ` Test Plan: Unit tests. Differential Revision: D41204028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922 Approved by: https://github.com/d4l3k	2022-12-21 03:55:01 +00:00

1 2

70 Commits