pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Lucas Pasqualin	8452f41305	Adds allreduce to inductor remap (#115950 ) Fixes #115728 Implements a rewrite path for allreduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950 Approved by: https://github.com/wconstab	2023-12-18 22:00:22 +00:00
Chien-Chin Huang	8c57fde21f	Let all_reduce_coalesced accept one tensor as well (#115650 ) This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function. This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying about converting the input to a list. Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650 Approved by: https://github.com/wconstab ghstack dependencies: #115523, #115302, #115648, #115649	2023-12-13 21:32:01 +00:00
Pavan Balaji	afa62d6237	[nccl-pg] Pass group global rank information to NCCL PG (#114736 ) We were only passing a subset of the group creation information to the NCCL PG. We are specifically missing the information on which global ranks belong to a particular PG. This allows the NCCL PG to use this additional information for things like better trace logging. Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736 Approved by: https://github.com/kwen2501	2023-12-13 18:02:51 +00:00
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
Howard Huang	99f06c0cc2	[BE] update errors to be more descriptive (#115443 ) we call `_check_single_tensor` and `_check_tensor_list` as validation but don't print out the param types that were invalid Pull Request resolved: https://github.com/pytorch/pytorch/pull/115443 Approved by: https://github.com/XilunWu	2023-12-11 21:21:10 +00:00
Chip Turner	937d616e82	Re-enable type checking for distributed_c10d.py (#115223 ) Re-enable type checking for distributed_c10d.py Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since. Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however. #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223 Approved by: https://github.com/wconstab	2023-12-09 11:07:54 +00:00
Chip Turner	78b945484b	[c10d] Extend NCCL communicator splitting to more use cases (#114916 ) Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world. This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank. This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number). And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916 Approved by: https://github.com/kwen2501	2023-12-07 15:13:01 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
Chip Turner	066e072524	Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385 ) - [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889) - Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit` Fixes cause of revert of original PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385 Approved by: https://github.com/huydhn	2023-11-23 07:00:00 +00:00
PyTorch MergeBot	b927a4e2ca	Revert "Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 )" This reverts commit `64a5372e6c`. Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk `4d07428ede` ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))	2023-11-22 17:43:51 +00:00
Chip Turner	64a5372e6c	Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 ) Currently `ncclCommInitRankConfig` is always used when creating new communicator groups. This is wasteful as it creates non-shared pairs of endpoint queues as well as costs time to re-establish communication. This change is transparent and opportunistic; when `dist.new_group` is called, it will use the existing, healthy world process group to select the right ranks to include in the process group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889 Approved by: https://github.com/kwen2501	2023-11-21 21:03:52 +00:00
Ke Wen	dc65f6c601	[c10d] Remove deprecated multi-gpu-per-thread APIs (#114156 ) As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156 Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 03:50:23 +00:00
Shengbao Zheng	e53da90fe6	[Execution Trace] record global rank in pg_config_info (#113316 ) Summary: pg_config_info is used to dump pg information in Execution Trace(ET). For trace analysis purpose and PARAM replay benchmark, global rank is more meaningful than group ranks. p.s. ranks is a map of global rank: group rank Test Plan: Tested in HPC Differential Revision: D51136587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113316 Approved by: https://github.com/XilunWu	2023-11-09 20:04:43 +00:00
Ke Wen	bb7ac12cbf	[ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896 ) Summary: Follows PR #111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896 Approved by: https://github.com/wanchaol	2023-11-07 15:44:04 +00:00
Will Constable	ff51f94e32	[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094 Approved by: https://github.com/fduwjj	2023-11-07 05:34:26 +00:00
PyTorch MergeBot	75adb9f371	Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 )" This reverts commit `f9d47e1381`. Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor `f9d47e1381` https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))	2023-11-06 22:49:53 +00:00
Will Constable	f9d47e1381	Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: #112611, #112803	2023-11-06 20:48:39 +00:00
Sahdev Zala	c6ecd018d5	Fix docstring errors (#112693 ) This PR reduces docstring erros to 0 from total 128. This can be verified by running, pydocstyle path-to-distributed_c10d.py --count Where, path-to-distributed_c10d.py is `torch/distributed/distributed_c10d.py` BEFORE the PR: `pydocstyle torch/distributed/distributed_c10d.py --count` 128 AFTER the PR: `pydocstyle torch/distributed/distributed_c10d.py --count` 0 Fixes #112640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112693 Approved by: https://github.com/H-Huang	2023-11-06 18:45:05 +00:00
Will Constable	65b74c9254	Make init_process_group timeout kwarg override pg_options (#112611 ) This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112611 Approved by: https://github.com/H-Huang	2023-11-03 23:13:03 +00:00
Aaron Gokaslan	cb856b08b2	[BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496 ) Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496 Approved by: https://github.com/malfet	2023-10-19 21:56:36 +00:00
Shengbao Zheng	8899abde32	[PyTorch][ET] Improve Process Groups Mapping Info Collection (#110908 ) Summary: Process Groups Mapping info collection was introduced in D46321690. Improve the mapping info collected there: - replace pg_id (a unique ID for the PG object) with pg_names (a unique name for each pg and shared by all ranks) - add number of pg info with group_count - reduce the length of pg_config_info to avoid being truncated(max length of 4096, now doubled ) by - migrating ranks(a map from global ranks to group ranks) with the list of global ranks of a pg, since we currently don't use group rank id - using an empty rank list to indicate that all ranks are involved in a pg and adding a field of group_size to show how many ranks are involved Test Plan: Tested in HPC ``` buck2 run mode/opt //hpc/torchrec/models/ads:cmf_10x_launcher -- launcher=local data_loader=random data_loader.num_batches=100 checkpoint=model_store max_ind_range=10 launcher.num_trainers=8 ``` Example output in ET ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140688385794048, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140688386762752, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140682531798720, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_0\", \"backend_id\": 140672678002688, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140672678007616, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_1\", \"backend_id\": 140672678012544, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Before the change, pg_config_info of >128 rank will be truncated, e.g. ``` "inputs": ["[{\"pg_id\": 140321146893696, \"backend_id\": 140321113854976, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321074662400, \"backend_id\": 140321100033024, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321154994304, \"backend_id\": 140319780290048, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\""], "input_shapes": [[]], "input_types": ["String"], ``` After the change the length reduced ``` "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140551405059072, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140551399745536, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140578999821184, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_0\", \"backend_id\": 140559197777152, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140549119076736, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_1\", \"backend_id\": 140571995143424, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"], ``` Reviewed By: louisfeng, fduwjj Differential Revision: D50048147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110908 Approved by: https://github.com/fduwjj	2023-10-19 21:37:19 +00:00
Ke Wen	18cc8a92ac	[ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431 ) For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way. To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`). This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now. Cc: @awgu @janeyx99 @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431 Approved by: https://github.com/H-Huang	2023-10-19 00:41:09 +00:00
PyTorch MergeBot	1e70f4d02c	Revert "Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 )" This reverts commit `bb1424d46e`. Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))	2023-10-16 23:03:26 +00:00
Will Constable	bb1424d46e	Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 ) This reverts commit `314a502eb0`. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072 Approved by: https://github.com/malfet ghstack dependencies: #111061	2023-10-12 16:59:23 +00:00
PyTorch MergeBot	314a502eb0	Revert "Reland "[C10] PG observability hooks. (#108815 )" (#110907 )" This reverts commit `7678cd22af`. Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this `7678cd22af` ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))	2023-10-11 00:23:42 +00:00
Will Constable	7678cd22af	Reland "[C10] PG observability hooks. (#108815 )" (#110907 ) This reverts commit `ff0358b038`. (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907 Approved by: https://github.com/fduwjj	2023-10-10 20:09:40 +00:00
Edward Z. Yang	de3ae93e9b	Include rank of default PG in C++ log messages (#110623 ) I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test. The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work. This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-10 00:26:52 +00:00
Kazuaki Ishizaki	b5f9696d81	Fix typo under torch directory (#110824 ) This PR fixes typo `the the` of comments and exception messages in files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824 Approved by: https://github.com/H-Huang	2023-10-09 19:16:43 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit `0c7a877745`. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
Howard Huang	0949d97c16	fix batch_isend_irecv example incorrect usage (#110408 ) mismatched dtypes silently leads to wrong outputs in nccl ``` 1:recv_tensor=tensor([0., 0.], device='cuda:1') 0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408 Approved by: https://github.com/awgu, https://github.com/Neilblaze	2023-10-04 22:57:03 +00:00
Rohan Varma	40be6b72e1	[ez] Type function in distributed_c10d (#110435 ) This function returns a `torch.device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435 Approved by: https://github.com/awgu	2023-10-03 17:54:04 +00:00
Rodrigo Kumpera	c26270c733	[C10D] Even more store scalability work. (#109218 ) Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks. Make the minimum wait time in _store_based_barrier to be adaptative based on the number of ranks. Longer timeouts give more room for the store to do productive work when swamped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218 Approved by: https://github.com/XilunWu ghstack dependencies: #109217	2023-09-22 21:27:09 +00:00
Howard Huang	600d0d0284	Add "cuda" to MPI backend capabilities (#109614 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/109543 Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests. Differential Revision: D49420438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614 Approved by: https://github.com/XilunWu	2023-09-21 13:34:58 +00:00
Rodrigo Kumpera	881bfbf21d	[c10d] Add tests for usig libuv through init_process_group. (#108661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-09-20 16:02:20 +00:00
Rodrigo Kumpera	2bca5f2af7	[C10D] Track pg name in c++. (#108813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813 Approved by: https://github.com/wconstab	2023-09-15 01:10:29 +00:00
Brian Vaughan	bb14805bcd	fix an incorrect indent in documentation (#108273 ) doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument). Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273 Approved by: https://github.com/awgu, https://github.com/kit1980	2023-09-11 21:27:52 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
wz337	264df88a2d	[C10D][Logger]Add more info to c10d logger (#107331 ) This PR adds pg_name and world_size to c10d logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331 Approved by: https://github.com/kumpera	2023-08-28 15:10:56 +00:00
Codle	42738c56a0	Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509 ) The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element. Before (30G tensor)： <img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944"> After (46G tensor): <img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5"> Test Code: ```python if __name__ == "__main__": dist.init_process_group(backend='nccl') torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count()) fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4) if dist.get_rank() == 0: state_dict = {"fake_tensor": fake_tensor} else: state_dict = {} object_list = [state_dict] dist.broadcast_object_list(object_list, src=0) print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys()) dist.barrier() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509 Approved by: https://github.com/awgu	2023-08-23 17:19:10 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Rodrigo Kumpera	bbf03561a9	[functional collectives] Move back to registering finalizers on wrappers. (#107250 ) We cannot use inner tensors for finalizers as they are uncollective until waited. This PR adds a bunch of tests for the observable behavior we want, including the necessary scafold for us to test code for their waitiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250 Approved by: https://github.com/wconstab	2023-08-17 21:08:28 +00:00
Shen Li	45128ab67c	[Reland] Add OnCompletion Hook to ProcessGroup (#106988 ) (#107233 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233 Approved by: https://github.com/kumpera	2023-08-15 17:35:14 +00:00
PyTorch MergeBot	fd214aa8be	Revert "Add OnCompletion Hook to ProcessGroup (#106988 )" This reverts commit `ba1da47e8f`. Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error. The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))	2023-08-15 08:24:33 +00:00
Shen Li	ba1da47e8f	Add OnCompletion Hook to ProcessGroup (#106988 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988 Approved by: https://github.com/kumpera, https://github.com/H-Huang ghstack dependencies: #107140, #107141, #107160	2023-08-15 04:32:23 +00:00
Bruce Jiang	2624da638d	Support third-party devices to use the init_process_group method with… (#107113 ) …out specifying the Backend When init_process_group is not been done before, it will automatically apply init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113 Approved by: https://github.com/wanchaol	2023-08-15 03:46:07 +00:00

1 2 3 4 5 ...

283 Commits