pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
Howard Huang	99f06c0cc2	[BE] update errors to be more descriptive (#115443 ) we call `_check_single_tensor` and `_check_tensor_list` as validation but don't print out the param types that were invalid Pull Request resolved: https://github.com/pytorch/pytorch/pull/115443 Approved by: https://github.com/XilunWu	2023-12-11 21:21:10 +00:00
Chip Turner	937d616e82	Re-enable type checking for distributed_c10d.py (#115223 ) Re-enable type checking for distributed_c10d.py Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since. Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however. #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223 Approved by: https://github.com/wconstab	2023-12-09 11:07:54 +00:00
Chip Turner	78b945484b	[c10d] Extend NCCL communicator splitting to more use cases (#114916 ) Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world. This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank. This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number). And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916 Approved by: https://github.com/kwen2501	2023-12-07 15:13:01 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
Chip Turner	066e072524	Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385 ) - [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889) - Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit` Fixes cause of revert of original PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385 Approved by: https://github.com/huydhn	2023-11-23 07:00:00 +00:00
PyTorch MergeBot	b927a4e2ca	Revert "Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 )" This reverts commit `64a5372e6c`. Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk `4d07428ede` ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))	2023-11-22 17:43:51 +00:00
Chip Turner	64a5372e6c	Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 ) Currently `ncclCommInitRankConfig` is always used when creating new communicator groups. This is wasteful as it creates non-shared pairs of endpoint queues as well as costs time to re-establish communication. This change is transparent and opportunistic; when `dist.new_group` is called, it will use the existing, healthy world process group to select the right ranks to include in the process group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889 Approved by: https://github.com/kwen2501	2023-11-21 21:03:52 +00:00
Ke Wen	dc65f6c601	[c10d] Remove deprecated multi-gpu-per-thread APIs (#114156 ) As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156 Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 03:50:23 +00:00
Shengbao Zheng	e53da90fe6	[Execution Trace] record global rank in pg_config_info (#113316 ) Summary: pg_config_info is used to dump pg information in Execution Trace(ET). For trace analysis purpose and PARAM replay benchmark, global rank is more meaningful than group ranks. p.s. ranks is a map of global rank: group rank Test Plan: Tested in HPC Differential Revision: D51136587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113316 Approved by: https://github.com/XilunWu	2023-11-09 20:04:43 +00:00
Ke Wen	bb7ac12cbf	[ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896 ) Summary: Follows PR #111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896 Approved by: https://github.com/wanchaol	2023-11-07 15:44:04 +00:00
Will Constable	ff51f94e32	[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094 Approved by: https://github.com/fduwjj	2023-11-07 05:34:26 +00:00
PyTorch MergeBot	75adb9f371	Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 )" This reverts commit `f9d47e1381`. Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor `f9d47e1381` https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))	2023-11-06 22:49:53 +00:00
Will Constable	f9d47e1381	Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: #112611, #112803	2023-11-06 20:48:39 +00:00
Sahdev Zala	c6ecd018d5	Fix docstring errors (#112693 ) This PR reduces docstring erros to 0 from total 128. This can be verified by running, pydocstyle path-to-distributed_c10d.py --count Where, path-to-distributed_c10d.py is `torch/distributed/distributed_c10d.py` BEFORE the PR: `pydocstyle torch/distributed/distributed_c10d.py --count` 128 AFTER the PR: `pydocstyle torch/distributed/distributed_c10d.py --count` 0 Fixes #112640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112693 Approved by: https://github.com/H-Huang	2023-11-06 18:45:05 +00:00
Will Constable	65b74c9254	Make init_process_group timeout kwarg override pg_options (#112611 ) This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112611 Approved by: https://github.com/H-Huang	2023-11-03 23:13:03 +00:00
Aaron Gokaslan	cb856b08b2	[BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496 ) Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496 Approved by: https://github.com/malfet	2023-10-19 21:56:36 +00:00
Shengbao Zheng	8899abde32	[PyTorch][ET] Improve Process Groups Mapping Info Collection (#110908 ) Summary: Process Groups Mapping info collection was introduced in D46321690. Improve the mapping info collected there: - replace pg_id (a unique ID for the PG object) with pg_names (a unique name for each pg and shared by all ranks) - add number of pg info with group_count - reduce the length of pg_config_info to avoid being truncated(max length of 4096, now doubled ) by - migrating ranks(a map from global ranks to group ranks) with the list of global ranks of a pg, since we currently don't use group rank id - using an empty rank list to indicate that all ranks are involved in a pg and adding a field of group_size to show how many ranks are involved Test Plan: Tested in HPC ``` buck2 run mode/opt //hpc/torchrec/models/ads:cmf_10x_launcher -- launcher=local data_loader=random data_loader.num_batches=100 checkpoint=model_store max_ind_range=10 launcher.num_trainers=8 ``` Example output in ET ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140688385794048, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140688386762752, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140682531798720, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_0\", \"backend_id\": 140672678002688, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140672678007616, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_1\", \"backend_id\": 140672678012544, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Before the change, pg_config_info of >128 rank will be truncated, e.g. ``` "inputs": ["[{\"pg_id\": 140321146893696, \"backend_id\": 140321113854976, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321074662400, \"backend_id\": 140321100033024, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321154994304, \"backend_id\": 140319780290048, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\""], "input_shapes": [[]], "input_types": ["String"], ``` After the change the length reduced ``` "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140551405059072, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140551399745536, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140578999821184, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_0\", \"backend_id\": 140559197777152, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140549119076736, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_1\", \"backend_id\": 140571995143424, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"], ``` Reviewed By: louisfeng, fduwjj Differential Revision: D50048147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110908 Approved by: https://github.com/fduwjj	2023-10-19 21:37:19 +00:00
Ke Wen	18cc8a92ac	[ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431 ) For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way. To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`). This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now. Cc: @awgu @janeyx99 @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431 Approved by: https://github.com/H-Huang	2023-10-19 00:41:09 +00:00
PyTorch MergeBot	1e70f4d02c	Revert "Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 )" This reverts commit `bb1424d46e`. Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))	2023-10-16 23:03:26 +00:00
Will Constable	bb1424d46e	Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 ) This reverts commit `314a502eb0`. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072 Approved by: https://github.com/malfet ghstack dependencies: #111061	2023-10-12 16:59:23 +00:00
PyTorch MergeBot	314a502eb0	Revert "Reland "[C10] PG observability hooks. (#108815 )" (#110907 )" This reverts commit `7678cd22af`. Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this `7678cd22af` ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))	2023-10-11 00:23:42 +00:00
Will Constable	7678cd22af	Reland "[C10] PG observability hooks. (#108815 )" (#110907 ) This reverts commit `ff0358b038`. (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907 Approved by: https://github.com/fduwjj	2023-10-10 20:09:40 +00:00
Edward Z. Yang	de3ae93e9b	Include rank of default PG in C++ log messages (#110623 ) I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test. The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work. This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-10 00:26:52 +00:00
Kazuaki Ishizaki	b5f9696d81	Fix typo under torch directory (#110824 ) This PR fixes typo `the the` of comments and exception messages in files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824 Approved by: https://github.com/H-Huang	2023-10-09 19:16:43 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit `0c7a877745`. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
Howard Huang	0949d97c16	fix batch_isend_irecv example incorrect usage (#110408 ) mismatched dtypes silently leads to wrong outputs in nccl ``` 1:recv_tensor=tensor([0., 0.], device='cuda:1') 0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408 Approved by: https://github.com/awgu, https://github.com/Neilblaze	2023-10-04 22:57:03 +00:00
Rohan Varma	40be6b72e1	[ez] Type function in distributed_c10d (#110435 ) This function returns a `torch.device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435 Approved by: https://github.com/awgu	2023-10-03 17:54:04 +00:00
Rodrigo Kumpera	c26270c733	[C10D] Even more store scalability work. (#109218 ) Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks. Make the minimum wait time in _store_based_barrier to be adaptative based on the number of ranks. Longer timeouts give more room for the store to do productive work when swamped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218 Approved by: https://github.com/XilunWu ghstack dependencies: #109217	2023-09-22 21:27:09 +00:00
Howard Huang	600d0d0284	Add "cuda" to MPI backend capabilities (#109614 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/109543 Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests. Differential Revision: D49420438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614 Approved by: https://github.com/XilunWu	2023-09-21 13:34:58 +00:00
Rodrigo Kumpera	881bfbf21d	[c10d] Add tests for usig libuv through init_process_group. (#108661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-09-20 16:02:20 +00:00
Rodrigo Kumpera	2bca5f2af7	[C10D] Track pg name in c++. (#108813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813 Approved by: https://github.com/wconstab	2023-09-15 01:10:29 +00:00
Brian Vaughan	bb14805bcd	fix an incorrect indent in documentation (#108273 ) doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument). Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273 Approved by: https://github.com/awgu, https://github.com/kit1980	2023-09-11 21:27:52 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit `0e2317479b`. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
wz337	264df88a2d	[C10D][Logger]Add more info to c10d logger (#107331 ) This PR adds pg_name and world_size to c10d logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331 Approved by: https://github.com/kumpera	2023-08-28 15:10:56 +00:00
Codle	42738c56a0	Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509 ) The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element. Before (30G tensor)： <img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944"> After (46G tensor): <img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5"> Test Code: ```python if __name__ == "__main__": dist.init_process_group(backend='nccl') torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count()) fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4) if dist.get_rank() == 0: state_dict = {"fake_tensor": fake_tensor} else: state_dict = {} object_list = [state_dict] dist.broadcast_object_list(object_list, src=0) print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys()) dist.barrier() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509 Approved by: https://github.com/awgu	2023-08-23 17:19:10 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Rodrigo Kumpera	bbf03561a9	[functional collectives] Move back to registering finalizers on wrappers. (#107250 ) We cannot use inner tensors for finalizers as they are uncollective until waited. This PR adds a bunch of tests for the observable behavior we want, including the necessary scafold for us to test code for their waitiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250 Approved by: https://github.com/wconstab	2023-08-17 21:08:28 +00:00
Shen Li	45128ab67c	[Reland] Add OnCompletion Hook to ProcessGroup (#106988 ) (#107233 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233 Approved by: https://github.com/kumpera	2023-08-15 17:35:14 +00:00
PyTorch MergeBot	fd214aa8be	Revert "Add OnCompletion Hook to ProcessGroup (#106988 )" This reverts commit `ba1da47e8f`. Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error. The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))	2023-08-15 08:24:33 +00:00
Shen Li	ba1da47e8f	Add OnCompletion Hook to ProcessGroup (#106988 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988 Approved by: https://github.com/kumpera, https://github.com/H-Huang ghstack dependencies: #107140, #107141, #107160	2023-08-15 04:32:23 +00:00
Bruce Jiang	2624da638d	Support third-party devices to use the init_process_group method with… (#107113 ) …out specifying the Backend When init_process_group is not been done before, it will automatically apply init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113 Approved by: https://github.com/wanchaol	2023-08-15 03:46:07 +00:00
Jirka	858b465d74	fix str splits in single line (#106005 ) Simple formating improvement and two spell fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005 Approved by: https://github.com/H-Huang	2023-08-14 23:07:38 +00:00
Michael Voznesensky	42660015b4	[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886 Approved by: https://github.com/awgu, https://github.com/wconstab ghstack dependencies: #106884	2023-08-11 22:35:50 +00:00
Louis Feng	3a01c056f5	[PyTorch][ET] Collect Process Groups Mapping Info (#104373 ) Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info). Test Plan: Testing in HPC: ``` TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000 ``` Example output in ET: ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Differential Revision: D46321690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373 Approved by: https://github.com/kwen2501	2023-07-25 03:34:53 +00:00
Howard Huang	0ab74044c2	[BE] remove deprecated attributes from distributed_c10d (#105753 ) Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now. Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753 Approved by: https://github.com/rohan-varma	2023-07-24 16:35:08 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Ke Wen	22e8a61d9b	Implement coalesced reduce_scatter_tensor (#103561 ) Map of #101157. This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax: Sync communication style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) ``` Async communication style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the reduce-scatters' results ``` Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561 Approved by: https://github.com/fegin	2023-06-15 20:11:12 +00:00
zhuhong61	50c972bfd2	[c10d] Add xpu to the default device supported by user specified backend (#103410 ) Motivation: For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping. Solution: We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly. Usage: When using xpu device, user can specify backend name only: `dist.init_process_group(backend='ccl')` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-06-12 19:46:33 +00:00
Ke Wen	07104ca99c	[c10d] Make it default that PG do not perform barrier after init (#103033 ) Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init. In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033 Approved by: https://github.com/XilunWu	2023-06-07 06:11:14 +00:00
Ashwin Hari	cf0aa38005	Allow ORT backend for DTensor (#101914 ) fixes #101911 Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend. * `Backend.NAME` attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl. * remove unused `_check_for_nccl_backend` function * add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914 Approved by: https://github.com/wanchaol	2023-06-01 22:37:09 +00:00
shaoyf42	8d7e082300	[c10d] Add is_backend_available for c10d backend. (#101945 ) Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``. There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553 > For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101 to also add their own is_available property It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function. This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`. > Or we could add an Is_available(backend) function, that checks for the backend. Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945 that supports both built-in backends and third-party backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945 Approved by: https://github.com/H-Huang	2023-05-31 22:51:51 +00:00
Wanchao Liang	3ef4d697df	[c10d] default backend need to check for nccl availability (#102470 ) As titled, we can only initialize nccl backend when NCCL is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2023-05-30 19:22:37 +00:00
Wanchao Liang	7b47cd0a6c	[c10d] add fake pg necessary collectives (#102238 ) This PR adds fake pg necessary collectives to enable e2e FSDP run with out multiprocess or multithreading Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238 Approved by: https://github.com/ezyang	2023-05-25 05:01:16 +00:00
Wanchao Liang	9a19262556	[c10d] conslidate barrier after init logic (#102237 ) This PR consolidates the barrier after init logic to allow custom backend to set the env var when creating the pg, so that `init_process_group` would skip barrier Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237 Approved by: https://github.com/ezyang	2023-05-25 05:01:16 +00:00
Edward Z. Yang	c903b12cb8	Add fake process group (#102180 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180 Approved by: https://github.com/wanchaol	2023-05-24 23:27:40 +00:00
Iris	ee95e37a69	[c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912 ) 1. Record time spent for init_process_group, new_group, _store_based_barrier 2. Rename c10d_error_logger to c10d_logger for generalization. 3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py. 4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912 Approved by: https://github.com/fduwjj	2023-05-24 09:36:34 +00:00
Aaron Gokaslan	3e2ea32dab	[BE]: Enable ruff rule TRY302 and apply fixes (#101874 ) Removes useless try statements and unreachable code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874 Approved by: https://github.com/malfet	2023-05-19 17:30:52 +00:00
shaoyf42	97180aca5e	Enables barrier to support the specified device (#99589 ) Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Today, there are two limitations of barrier: One is that barrier does not support custom #device: `fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)` The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device. `789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589 Approved by: https://github.com/kwen2501	2023-05-17 05:26:04 +00:00
Ke Wen	daed3bf8f9	Implement coalesced all_gather_into_tensor (#101157 ) This PR adds support for the following use cases: - Sync style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) ``` - Async style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the all-gather's ``` Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-05-11 20:58:47 +00:00
Ke Wen	0848ed21b8	[c10d] Figure out device to use for object collectives (#100954 ) Fixes https://github.com/pytorch/pytorch/issues/97938 this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But @kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction. the only confliction is `distributed_c10d.py:2653` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954 Approved by: https://github.com/kwen2501	2023-05-11 01:49:09 +00:00
Rodrigo Kumpera	a204f7f518	[c10d] Fix subprocess group handlig in scatter_object_list. (#100552 ) scatter_object_list assumed src was a group rank while all collectives use global ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552 Approved by: https://github.com/fduwjj	2023-05-04 10:04:21 +00:00
Xiaodong Wang	c29ab84115	Fix bug in process_group_name when there is duplicate pgs (#100518 ) Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518 Approved by: https://github.com/kumpera	2023-05-04 02:12:28 +00:00
Animesh Jain	5fbb40669f	[dynamo][moco] Disallow_in_graph distributed APIs (#100071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071 Approved by: https://github.com/jansel, https://github.com/H-Huang	2023-05-02 20:09:25 +00:00
Ke Wen	ae0eb2342d	[Experimental] Remove store barrier after PG init (#99937 ) Store based barrier is not scalable. Experimenting to see if removing it breaks any CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937 Approved by: https://github.com/kumpera, https://github.com/H-Huang	2023-04-27 17:23:10 +00:00
Rodrigo Kumpera	ad21890f8f	[c10d] Scalable PG initiation. (#99931 ) Add use_local_synchronization argument to new_group. When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster. This addressess both scalability and composability problems associated with new_group. Fixes #81291. This is relanding #84224 As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following: new_group use_local_synchronization=False: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.12 \| \| 8 \| 0.25 \| \| 16 \| 0.51 \| \| 32 \| 0.87 \| \| 64 \| 1.50 \| \| 128 \| 2.87 \| new_group use_local_synchronization=True: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.05 \| \| 8 \| 0.04 \| \| 16 \| 0.03 \| \| 32 \| 0.03 \| \| 64 \| 0.04 \| \| 128 \| 0.04 \| Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128. Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3. Setup: 1 AWS host, backend gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931 Approved by: https://github.com/xw285cornell	2023-04-27 13:44:02 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Ke Wen	3a09aa5977	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-24 21:27:26 +00:00
medivh-xp	39590d06c5	Make new_subgroups avaliable for non-cuda depend backend (#99706 ) The `new_subgroups` allows for the easy creation of sub-communication groups, but it currently requires CUDA availability. For communications that do not rely on CUDA, such as the CPU-based gloo or custom communication backends, I still hope to be able to use it, such as with the CPU-based gloo (which is also the case when using a custom backend): ```python import os import torch import torch.distributed as dist import torch.multiprocessing as mp def gloo_process(rank_id, world_size, group_size, mp_lock): assert not torch.cuda.is_available() def lock_print(args, kwargs): with mp_lock: print(args, *kwargs, flush=True) os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group('gloo', rank=rank_id, world_size=world_size) subgroup, _ = dist.new_subgroups(group_size) subgroup_ranks = list(range(subgroup.rank() group_size, (subgroup.rank() + 1) * group_size)) lock_print(f"Rank {rank_id} initialized in subgroup_{subgroup.rank()}: {subgroup_ranks}") tensor = torch.Tensor([rank_id + 1]) subgroup.broadcast(tensor, root=0) lock_print(f"After broadcast, rank {rank_id} in subgroup_{subgroup.rank()}:{subgroup_ranks} got {tensor}") if __name__ == "__main__": world_size = 4 group_size = 2 processes = [] mp.set_start_method("spawn") mp_lock = mp.Lock() for rank in range(world_size): p = mp.Process(target=gloo_process, args=(rank, world_size, group_size, mp_lock)) p.start() processes.append(p) for p in processes: p.join() ``` ```bash Rank 0 assigned to subgroup_0: [0, 1] Rank 1 assigned to subgroup_1: [2, 3] Rank 2 assigned to subgroup_0: [0, 1] Rank 3 assigned to subgroup_1: [2, 3] After broadcast, rank 2 in subgroup_0:[0, 1] got tensor([3.]) After broadcast, rank 3 in subgroup_1:[2, 3] got tensor([3.]) After broadcast, rank 1 in subgroup_1:[2, 3] got tensor([1.]) After broadcast, rank 0 in subgroup_0:[0, 1] got tensor([1.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99706 Approved by: https://github.com/kumpera	2023-04-24 18:22:59 +00:00
PyTorch MergeBot	9861ec9785	Revert "[c10d] Faster coalescing (#98793 )" This reverts commit `db456ab83d`. Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-21 09:15:04 +00:00
Ke Wen	db456ab83d	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-19 20:17:58 +00:00
Howard Huang	760967a284	Update _store_based_barrier implementation to reduce load on rank 0 (#98000 ) Summary: Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker Added optional logging_interval arg to _store_based_barrier Test Plan: ``` pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier ``` Reviewed By: rohan-varma Differential Revision: D44430531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000 Approved by: https://github.com/kumpera	2023-04-11 14:25:29 +00:00
Edward Z. Yang	b09722f540	Convert logging f-strings to use % format, part two (#98700 ) This hits multi-line logging strings Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Howard Huang	61c74ab0f8	Fix MPI rank and world size pg initialization (#98545 ) Fixes https://github.com/pytorch/pytorch/issues/97507 Test command `pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545 Approved by: https://github.com/malfet	2023-04-07 21:57:31 +00:00
Rohan Varma	8a29afe98a	[RFC] Add warning about object-based collectives for GPU tensors to docs. (#97702 ) Using GPU tensors in these collectives have caused SEVs, user confusion, and slowness in the past. These APIs were only designed to communicate arbitrary python objects, and GPU tensors should either be copied to CPU first or use the regular collecitves. Add a warning indicating so. Differential Revision: [D44435849](https://our.internmc.facebook.com/intern/diff/D44435849/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97702 Approved by: https://github.com/kumpera	2023-04-06 23:47:35 +00:00
Howard Huang	3b6e94cb8c	[small] replace with .format() with f-strings (#98514 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98514 Approved by: https://github.com/awgu	2023-04-06 18:58:56 +00:00
Kazuaki Ishizaki	6514d71add	Fix typos under torch/distributed directory (#98225 ) This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225 Approved by: https://github.com/soulitzer, https://github.com/kit1980	2023-04-05 00:21:33 +00:00
Edward Z. Yang	5df59f957f	Fix G001,G002,G003 in logs to % syntax (#97812 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812 Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos	2023-04-01 01:43:33 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Howard Huang	ac7329b323	Add exceptionhandler to more distributed_c10d APIs (#96770 ) Summary: Adding exception handler to a few more APIs so that internal errors are logged to the c10d errors scuba table Test Plan: sandcastle Differential Revision: D44068557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96770 Approved by: https://github.com/wz337	2023-03-15 20:31:46 +00:00
Howard Huang	02fa2291f7	Add support for custom backend (#95072 ) Fixes https://github.com/pytorch/pytorch/issues/92344 A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`. Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072 Approved by: https://github.com/kwen2501	2023-03-02 21:41:49 +00:00
Howard Huang	c0fa0669f6	Update isend/irecv warning messages for nccl (#95236 ) Summary: nccl backend does not support `tag` as mentioned in https://github.com/pytorch/pytorch/issues/94819. Adding a note in the documentation for it. Example: <img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png"> Differential Revision: D43475756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95236 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2023-02-22 22:00:13 +00:00
Rodrigo Kumpera	641cb4243c	Fix c10d regression during cleanup. (#94988 ) This fixes a regression introduced earlier today with a change to c10d global state. It must be cleaned up in destroy_process_group or root PG and its Store will stay alive. Fixes regression in test_c10d_nccl.py :: RendezvousEnvTest.test_common_errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/94988 Approved by: https://github.com/H-Huang, https://github.com/wanchaol, https://github.com/malfet	2023-02-16 19:12:00 +00:00
Rodrigo Kumpera	e22d791287	[PTD] Introduce tracing friendly collectives. (#93990 ) This change adds torch.distributed.traceable_collectives. This experimental API enables collectives to be fully traced by dynamo and FX. See #93173 for the RFC Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990 Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang	2023-02-16 15:35:01 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Howard Huang	8b3e3f937d	Update documentation init_process_group optional backend (#94543 ) Update documentation for `init_process_group()` to mention the `backend` argument is optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543 Approved by: https://github.com/kwen2501	2023-02-13 21:45:38 +00:00
Howard Huang	f45c196653	Update backend config to be under _World (#94191 ) All the c10d process group state is under `_World`, so this is BE work to include a missing map Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191 Approved by: https://github.com/kumpera	2023-02-09 20:48:42 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
Iris	f54fd6fb28	[c10d] Update get_backend() in exception_handler (#94063 ) Currently, get_backend() and get_world_size() would always return the default value if no pg group argument is passed. This fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94063 Approved by: https://github.com/H-Huang	2023-02-04 19:39:36 +00:00
Ching-Hsiang Chu	1fa68d40b8	[pytorch] fix backend_type for backend/PG plugin (#93129 ) Summary: For backend/PG plugin, use `ProcessGroup.BackendType.CUSTOM` to avoid uninitialized variable during `pg._register_backend` later Test Plan: CI/CD and internal tests Differential Revision: D42793222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93129 Approved by: https://github.com/H-Huang	2023-01-30 23:16:08 +00:00
Howard Huang	2503a4a7c6	Fix MPI backend PG initialization (#92847 ) Fixes #92573 Add test to check that all default backends can be initialized to prevent the above from regressing in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847 Approved by: https://github.com/rohan-varma	2023-01-24 23:24:41 +00:00
Andrew Gu	cb67d9460b	[PT-D] Fix `send`, `recv` return type (#92152 ) - `send` returns `None`. - `recv` returns the sender rank if valid or -1 otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92152 Approved by: https://github.com/wz337	2023-01-14 01:09:49 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
Howard Huang	99aec69f58	[BE] remove Backend.TCP (#91314 ) Remove Backend.TCP which is unused. Fixes a task in https://github.com/pytorch/pytorch/issues/90544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91314 Approved by: https://github.com/awgu	2022-12-23 15:48:29 +00:00
Sergii Dymchenko	365071c73c	Fix non-existing parameters in docstrings in torch/distributed (#91116 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116 Approved by: https://github.com/huydhn	2022-12-22 02:37:31 +00:00
Sadra Barikbin	97f514f38e	Fix two typos in `torch.distributed.distributed_c10d.py::broadcast_object_list` (#91237 ) Fixes #91236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91237 Approved by: https://github.com/malfet, https://github.com/H-Huang	2022-12-21 19:45:08 +00:00
Howard Huang	7a0f29b776	Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj	2022-12-16 23:15:00 +00:00
Rohan Varma	793a999ce0	Hybrid Sharded Data Parallel (#89915 ) Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. Acknowledgements - @awgu 's excellent prototype: `5ad3a16d48` - @liangluofb For ideation, feedback, and initial implementation and experimentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915 Approved by: https://github.com/awgu	2022-12-08 16:18:03 +00:00
Howard Huang	ee907375fa	[small] Update error message (#89294 ) Summary: `RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].` to `RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].` Test Plan: sandcastle Differential Revision: D41405238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294 Approved by: https://github.com/awgu	2022-11-19 00:21:14 +00:00
Masaki Kozuki	63e16216d8	[c10d] Implement `__instancecheck__` for `c10d::ReduceOp` (#88275 ) Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - #81272 - #84243 - #87191 - #87303 - #87555 Ref: - https://github.com/pybind/pybind11/issues/2696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275 Approved by: https://github.com/wanchaol	2022-11-15 13:21:41 +00:00
Iris	68fd8f3706	[BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004 ) This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py). Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004 Approved by: https://github.com/H-Huang	2022-11-15 06:13:17 +00:00
Howard Huang	1d54ce9d5d	[14/N] Refactor _new_process_group_helper() to remove repeated code (#88351 ) Changes: - refactor parts of `_new_process_group_helper()` to remove repeated code Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351 Approved by: https://github.com/kwen2501	2022-11-10 19:27:17 +00:00
Kurt Mohler	ee28b865ee	Deprecate TypedStorage, its derived classes, and all of their public methods (#85303 ) Part of #85302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303 Approved by: https://github.com/ezyang	2022-11-08 18:11:01 +00:00
Rodrigo Kumpera	6663ae5537	[2/n] Thread PG: add class _World to distributed_c10d.py (#781 ) (#88471 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/781 Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 This change ensures BC by keeping the global variables around and have the default _World wrap it. I have relinked this diff to a new github PR, so that I can update it. The original PR is > Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348 Differential Revision: D40236769 Pulled By: yhcharles Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471 Approved by: https://github.com/gnadathur, https://github.com/rohan-varma	2022-11-07 17:56:40 +00:00
Kazuaki Ishizaki	2ddefbdc3c	Fix typos used in documents under torch directory (#88300 ) This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300 Approved by: https://github.com/lezcano	2022-11-02 09:38:13 +00:00
Iris Zhang	0cf572ff6c	[C10D][BE] Add exception handlers to c10d collectives function (#87643 ) (#87988 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643 1. Add a decorator function exception_handlers to c10d collectives. 2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler. ``` python3 test/distributed/test_c10d_error_logger.py ``` Test Plan: Test in OSS. Reviewed By: H-Huang Differential Revision: D40281632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988 Approved by: https://github.com/H-Huang	2022-10-29 04:38:34 +00:00
Masaki Kozuki	aa8248cc9a	Reenable `isinstance` with `torch.distributed.ReduceOp` (#87303 ) tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: https://github.com/pytorch/pytorch/issues/87191 cc @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303 Approved by: https://github.com/wanchaol	2022-10-21 15:05:36 +00:00
PyTorch MergeBot	f451e824f3	Revert " C10D extension to enable per-thread PG (#86348 )" This reverts commit `97abc21f2b`. Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests `97abc21f2b`	2022-10-14 01:26:46 +00:00
Rodrigo Kumpera	97abc21f2b	C10D extension to enable per-thread PG (#86348 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 This change ensures BC by keeping the global variables around and have the default _World wrap it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348 Approved by: https://github.com/rohan-varma	2022-10-13 22:23:28 +00:00
Louis Feng	55479fe80e	Enable capturing of comm collective parameters (#98 ) (#85368 ) Summary: X-link: https://github.com/facebookresearch/torch_ucc/pull/98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368 Approved by: https://github.com/H-Huang	2022-10-11 04:38:26 +00:00
Jesus Magana	c670bad72f	Update dist.scatter() documentation (#86069 ) Update documentation for dist. scatter Fixes #84566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86069 Approved by: https://github.com/rohan-varma, https://github.com/H-Huang	2022-10-03 17:22:08 +00:00
Ke Wen	05d1128106	[c10d] Start deprecating *_multigpu APIs (#85961 ) ### Deprecation reasons: - For most users training is on one GPU per process so these APIs are rarely used - They added one more API dimension - They can be expressed in a composed manner - They are not abstracted – specific to GPU - They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-10-01 00:59:39 +00:00
Ke Wen	463283e016	[c10d] Start deprecating _coalesced APIs (#85959 ) - We consider that general users need not to use the `_coalesced` APIs unless there is an extreme concern about performance. - We are investigating using a context manager named `coalescing_manager` which wrap around multiple individual collectives to compose the coalescing hint, rather than giving each collective a *_coalesced variant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85959 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-10-01 00:55:27 +00:00
Ke Wen	ade1c19612	Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867 ) This is a twin PR similar to the one for `all_gather_into_tensor` (#85686). The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686. Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867 Approved by: https://github.com/crcrpar, https://github.com/H-Huang	2022-09-30 05:48:16 +00:00
Saliya Ekanayake	941d7a31f6	Pass group ranks and options to third party distributed backends (#73164 ) Fixes #73163 PyTorch's [_new_process_group_helper()](`9f541aa3ac/torch/distributed/distributed_c10d.py (L633)`) does not pass group's participating ranks to the backend. This PR adds the above capability. Also, refactors some variables for better clarity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73164 Approved by: https://github.com/kumpera	2022-09-29 17:28:58 +00:00
PyTorch MergeBot	6fae62b35f	Revert "C10D extension to enable per-thread PG (#84153 )" This reverts commit `5cbffbbac9`. Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff	2022-09-29 13:51:05 +00:00
Ke Wen	775a22c7c6	Add all_gather_into_tensor in place of _all_gather_base (#85686 ) ### Description - This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning. - The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors. - This PR also adds deprecation warning to `_all_gather_base`. ### Issue `_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output. There are, however, two "blockers" that make the merge difficult: (i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list. (ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information. In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name. ### Testing Added tests: - `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example: ``` >>> tensor_in tensor([1, 2], device='cuda:0') # Rank 0 tensor([3, 4], device='cuda:1') # Rank 1 >>> tensor_out tensor([1, 2, 3, 4], device='cuda:0') # Rank 0 tensor([1, 2, 3, 4], device='cuda:1') # Rank 1 ``` - `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example: ``` >>> tensor_out2 tensor([[1, 2], [3, 4]], device='cuda:0') # Rank 0 tensor([[1, 2], [3, 4]], device='cuda:1') # Rank 1 ``` The output form is determined by the shape of the output tensor passed by the user, no flag used. Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686 Approved by: https://github.com/rohan-varma, https://github.com/crcrpar	2022-09-27 22:50:22 +00:00
Rodrigo Kumpera	5cbffbbac9	C10D extension to enable per-thread PG (#84153 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153 Approved by: https://github.com/rohan-varma	2022-09-27 21:42:31 +00:00
Rodrigo Kumpera	7dcc723d35	[c10d] Ensure collectives are called with the same dtype for all tensor params. (#84664 ) While passing tensors with different dtypes don't crash, they don't produce sensible results. We see data tearing instead of casting. It's not clear we want to support transparent casting so, for now, we fail when such input is presented. Fixes #84525 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/84664 Approved by: https://github.com/rohan-varma	2022-09-15 22:32:51 +00:00
Salahuddin	6bd7d0f856	doc string fixed in torch.distributed.reduce_scatter (#84983 ) Fixes #84865 Previous `torch.distributed.reduce_scatter`: ``` def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False): """ Reduces, then scatters a list of tensors to all processes in a group. Args: output (Tensor): Output tensor. input_list (list[Tensor]): List of tensors to reduce and scatter. group (ProcessGroup, optional): The process group to work on. If None, the default process group will be used. async_op (bool, optional): Whether this op should be an async op. ``` Fixed: ``` def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False): """ Reduces, then scatters a list of tensors to all processes in a group. Args: output (Tensor): Output tensor. input_list (list[Tensor]): List of tensors to reduce and scatter. op (optional): One of the values from ``torch.distributed.ReduceOp`` enum. Specifies an operation used for element-wise reductions group (ProcessGroup, optional): The process group to work on. If None, the default process group will be used. async_op (bool, optional): Whether this op should be an async op. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84983 Approved by: https://github.com/H-Huang	2022-09-15 18:17:10 +00:00
Rodrigo Kumpera	38192f63cd	Add __all__ for a few distributed modules plus a little typing (reland) (#84872 ) This handles distributed_c10d, which is massive and ddp_comm_hooks. This relands #84119 with the required fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872 Approved by: https://github.com/rohan-varma	2022-09-13 21:57:49 +00:00
PyTorch MergeBot	219ff26172	Revert "Add __all__ for a few distributed modules plus a little typing (#84119 )" This reverts commit `6f21680563`. Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448	2022-09-09 20:01:07 +00:00
Rodrigo Kumpera	6f21680563	Add __all__ for a few distributed modules plus a little typing (#84119 ) This handles distributed_c10d, which is massive and ddp_comm_hooks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119 Approved by: https://github.com/rohan-varma	2022-09-08 23:28:31 +00:00
Rodrigo Kumpera	e96fb5d58c	[c10d] Fix docstring of scatter_object_list (#84596 ) The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84596 Approved by: https://github.com/H-Huang	2022-09-07 14:49:45 +00:00
Masaki Kozuki	ab6c57217a	Add NCCL PreMul Sum to c10d `redce` ops (#84243 ) This is based on #81272 but this conforms to TorchScript Compiler ## TODO - [ ] Update `abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73)` to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit. cc @ptrblck @kwen2501 @aazzolini cc @zasdfgbnm for visibility to the TODO above Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243 Approved by: https://github.com/kwen2501	2022-09-02 21:57:45 +00:00
Rodrigo Kumpera	7a348a1d4a	Fix internal breakage caused by #82134 (#84363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84363 Approved by: https://github.com/rohan-varma, https://github.com/mehtanirav	2022-09-01 17:54:10 +00:00
Rodrigo Kumpera	65dc5dd3f3	[c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134 ) Those functions enable membership introspection into a ProcessGroup. A common scenario that needs this is library code that consumes a PG but doesn't create it, which means it likely doesn't know the global ranks used to create it. Translating from local to global is necessary when using c10d collectives like broadcast so if your library code adopts the convention of using local rank 0, it needs to the following: ```python import torch.distributed as dist my_pg: dist.ProcessGroup = ... def my_library_bcast(tensor) dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg) ``` This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134 Approved by: https://github.com/rohan-varma	2022-08-30 17:45:00 +00:00
PyTorch MergeBot	1f61c39ac4	Revert "Support NCCL Premul Sum (#81272 )" This reverts commit `432c508e71`. Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2022-08-25 05:01:37 +00:00
Masaki Kozuki	432c508e71	Support NCCL Premul Sum (#81272 ) This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum. The major changes include - convert enum ReduceOp to struct - add premul sum specific paths to init.cpp and Ops.cpp. note: - For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed The commit titled "add nccl premul" whose current hash is `cb99ad6744` was authored by @mcarilli and @ptrblck. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272 Approved by: https://github.com/kwen2501	2022-08-24 04:53:25 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
Xiang Gao	cda210e23b	UCC PG build in CI (#81583 ) - Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system - Install UCX and UCC in CUDA dockers - Build PyTorch with `USE_UCC=1` in pipelines - Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583 Approved by: https://github.com/vtlam, https://github.com/malfet	2022-08-10 00:23:47 +00:00
Aashaka Shah	24a084eda6	[c10d] Fix async error in batch_isend_irecv (#82450 ) Summary: `batch_isend_irecv` previously required the use of `torch.cuda.synchronize` to avoid data race conditions. This was because the ncclStreams were recorderd in the returned ncclWork object _before_ a ncclGroupEnd by the `_batch_p2p_manager` was issued. Thus, the `req.wait()` was effectively waiting on nothing, leading to the later operators working on incorrect intermediate data. This fix: - keeps track of ncclStreams to wait on, and records them in the work objects after the batch manager issues a ncclGroupEnd - renames the `_batch_p2p_manager` to `_coalescing_manager` for generality - removes the explicit check for NCCL backend inside `_batch_p2p_manager` in distributed_c10.py and moves the manager start/end to ProcessGroup.hpp, in order to transparently work with all process groups Test Plan: Modified the unittest for `batch_isend_irecv` to check that received tensors are the same as expected tensors. Verified that the test fails before the change, and passes after the change. Differential Revision: D38100789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82450 Approved by: https://github.com/kwen2501	2022-08-08 17:50:22 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
Terry Lam	54bdaf76d6	[PFC] Native UCC process group for Pytorch (#79918 ) Summary: This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library. The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically, - USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries. - USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME. Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party. Test Plan: Passed Torch-UCC tests that invoke UCC process group. For example: $ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda ... Test allreduce: succeeded Differential Revision: D36973688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918 Approved by: https://github.com/kwen2501, https://github.com/kingchc	2022-07-12 14:45:44 +00:00
Nikita Shulga	09df27fe45	Revert "Revert "[distributed] Handle object collectives and NCCL. (#79034 )"" This reverts commit `279634f384`.	2022-06-15 10:04:37 -07:00
PyTorch MergeBot	279634f384	Revert "[distributed] Handle object collectives and NCCL. (#79034 )" This reverts commit `4ebb326b75`. Reverted https://github.com/pytorch/pytorch/pull/79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-06-15 16:16:21 +00:00
Rodrigo Kumpera	4ebb326b75	[distributed] Handle object collectives and NCCL. (#79034 ) This fixes all object collectives under NCCL and adds some automated tests for them. This PR does not fix sending tensors using object collectives. It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79034 Approved by: https://github.com/rohan-varma	2022-06-13 19:23:39 +00:00
Michael Suo	fb0f285638	[lint] upgrade mypy to latest version Fixes https://github.com/pytorch/pytorch/issues/75927. Had to fix some bugs and add some ignores. To check if clean: ``` lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753 Approved by: https://github.com/malfet	2022-05-03 20:51:34 +00:00
PyTorch MergeBot	3d7428d9ac	Revert "[lint] upgrade mypy to latest version" This reverts commit `9bf18aab94`. Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo	2022-05-03 20:01:18 +00:00
Michael Suo	9bf18aab94	[lint] upgrade mypy to latest version Fixes https://github.com/pytorch/pytorch/issues/75927. Had to fix some bugs and add some ignores. To check if clean: ``` lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753 Approved by: https://github.com/malfet	2022-05-03 19:43:28 +00:00
Ke Wen	d6f22abbcc	[PyTorch Distributed] Fix batch_isend_irecv Summary: `batch_isend_irecv` previously only worked for two-rank cases, otherwise it would hang, e.g. pytorch/pytorch#73960. This Diff extends `batch_isend_irecv` to support more than two ranks. The fix is by treating the operation more like a collective rather than two-rank P2P when selecting the communicator, since there can be more ranks participating in the batch call than "my" rank and "my" peer. Rules: - If `batch_isend_irecv` is the first collective call (including collectives and all-to-all) in the `group` given as the argument, then all ranks of the `group` are expected to participate in this call. - Otherwise, if it is not the first collective call in the `group` (i.e. the communicator has been initialized), then batched P2P communication involving only subset of processes of the `group` is allowed. Test Plan: Added p2p_tests.py testing the following patterns: + sendrecv_neighbor(input, output) # Ring like neighbor exchange + sendrecv_ripple(input, output) # Exchange with growing distance (pytorch/pytorch#73960) + sendrecv_P2P(input, output) # Single P2P operation + isendrecv_P2P(input, output) # Single non-blocking P2P operation + isendrecv_P2P_batch(input, output, 0) # batched P2P between only two ranks + isendrecv_P2P_batch(input, output, 1) # batched P2P within a new group created for two ranks Differential Revision: D35122664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74701 Approved by: https://github.com/mingzhe09088, https://github.com/osalpekar	2022-04-13 05:55:00 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Yuxin Wu	1ed4653e89	Stop writing logs to root logger (#72649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/72648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649 Reviewed By: soulitzer Differential Revision: D34172113 Pulled By: mrshenli fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf (cherry picked from commit `c14297cee6`)	2022-02-11 21:30:53 +00:00
Rohan Varma	678c08bb55	[PG Wrapper] Small fix (#72657 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72657 _ProcessGroupWrapper check needs to be gated on Gloo availability, this fails when gloo is not avail_ProcessGroupWrapper check needs to be gated on Gloo availability, this fails when gloo is not avail. ghstack-source-id: 148837056 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34144848 fbshipit-source-id: 42a04918b968247f3259cd2cde5438e1265b04fe (cherry picked from commit `ba5de98939`)	2022-02-11 15:59:13 +00:00
Wanchao Liang	8551989bff	[c10d] Enable gather_object on nccl (#71623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71623 Enable gather_object on the nccl backend, since we already support `dist.gather` on nccl. This requires user to set the current device properly. ghstack-source-id: 147754836 Test Plan: distributed_nccl_spawn -r test_gather_object Reviewed By: zou3519 Differential Revision: D33701042 fbshipit-source-id: 39cff22947a7cac69d0c923b956dc10f25353a6f (cherry picked from commit `6e6eff497f`)	2022-01-27 14:59:55 -08:00
Shen Li	7bc220e060	Update distributed.rst for ProcessGroup Extensions (#71482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71482 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D33745986 Pulled By: mrshenli fbshipit-source-id: fe2d0491901bf00be09deb5c556bc1e1d359b725 (cherry picked from commit `be5104bfd7`)	2022-01-25 00:30:08 +00:00
Stephan Uphoff	e1e43c4e71	Prevent sum overflow in broadcast_object_list (#70605 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70605 broadcast_object_list casted the sum of all object lengths to int from long causing overflows. Test Plan: Add a Tensor with >2GB storage requirement (in distributed_test.py) to object broadcast. This Tensor is only added if test are running at Meta as github tests will oom. Without fix the length will overflow and the program will request a negative sized Tensor: ``` RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417] ``` With fix it will pass the test. Test used on server with GPUs: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object buck test mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --local -- broadcast_object Reviewed By: r-barnes Differential Revision: D33405741 fbshipit-source-id: 972165f8297b3f5d475636e6127ed4a49adacab1	2022-01-05 09:07:39 -08:00
Michael Suo	b7b32b56f1	Revert D33281300: Prevent sum overflow in broadcast_object_list Test Plan: revert-hammer Differential Revision: D33281300 (`807f9a828c`) Original commit changeset: 1bc83e8624ed Original Phabricator Diff: D33281300 (`807f9a828c`) fbshipit-source-id: beb81a9cbfba405a61b11dfaa8e39c9601f45643	2021-12-27 19:01:53 -08:00
Stephan Uphoff	807f9a828c	Prevent sum overflow in broadcast_object_list (#70336 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70336 broadcast_object_list casted the sum of all object lengths to int from long causing overflows. Test Plan: Increased size of Tensor used in object transfers to have >2GB storage requirement (in distributed_test.py) Without fix the length will overflow and the program will request a negative sized Tensor: ``` RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417] ``` With fix it will pass the test. Test used on server with GPUs: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object Differential Revision: D33281300 fbshipit-source-id: 1bc83e8624edc14e747eeced7bc8a7a10e443ee4	2021-12-27 16:17:53 -08:00
s-kumano	ff53ed24d2	fix NameError of docstring in broadcast_object_list (#69810 ) Summary: This PR fixes NameError of docstring in broadcast_object_list. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/69810 Reviewed By: kimishpatel Differential Revision: D33143167 Pulled By: jbschlosser fbshipit-source-id: 99c076466ae4b4a332763b7546028c5097b417d7	2021-12-16 10:50:45 -08:00
Bryan Reese	4670f0f2c5	Set non-default backend names to lower case (#69400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69400 Hopefully this makes naming more consistent. Without this change, some tests will fail for plugins since values can be set to upper case in some cases. This should prevent that and make lookup and comparison consistent. Test Plan: Check the signals. There is no specific test for this, but all tests should pass. Reviewed By: mrshenli Differential Revision: D32836529 fbshipit-source-id: 1b7d2b64e04fe0391b710aa6ed6d1e47df9027a3	2021-12-07 07:58:46 -08:00
Rohan Varma	cb14a258a2	[c10d] Fix object-based collectives for debug mode (#68223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223 DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA. Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl. ghstack-source-id: 143242023 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D32366840 fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5	2021-11-13 04:18:31 -08:00
Shen Li	18955d3564	Raise warning when calling collectives on non-member group objects (#67639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639 Due to BC considerations, we cannot directly error out, as that might break existing applications. Raise warnings first to improve debuggability. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D32075151 Pulled By: mrshenli fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363	2021-11-02 20:04:07 -07:00
Shen Li	ce6f4b3a02	Setup c10d extension Backend class attr the same way as builtin ones (#66991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66991 Currently, c10d extensions uses Backend.NAME to store the creator function. However, builtin ones use that same field to store the name. This commit makes c10d extensions comply with builtin ones, and uses a dedicated `_plugins` field to store creator functions. Thanks bryanmr for pointing this out. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D31820307 Pulled By: mrshenli fbshipit-source-id: 259769ebfc80c0c9fc44d25498c8d19a3a09d1bc	2021-10-21 12:35:07 -07:00
Yi Wang	12137db5e3	Fix the slowdown of _object_to_tensor since 1.9 (#65721 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65721 #Closes: https://github.com/pytorch/pytorch/issues/65696 The bug is introduced in https://github.com/pytorch/pytorch/pull/55861, and it causes 100X slowdown since 1.9. ghstack-source-id: 139128267 Test Plan: Performance test: ``` import time from torch.distributed.distributed_c10d import _object_to_tensor start = time.time() _object_to_tensor("x" * 50_000_000) print("Time:", time.time() - start) ``` Reviewed By: rohan-varma Differential Revision: D31219794 fbshipit-source-id: 1abec38f9d51361c1eab6ad5efd87b589322e208	2021-09-27 19:22:10 -07:00
Shen Li	2a81e8b8f1	Let all_reduce_coalesced and all_gather_coalesced return Future objects (#64722 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64722 `all_reduce_coalesced` and `all_gather_coalesced` are never publicly released in our API docs. So, I would assume the blast radius to be small. The motivation for this change to allow implementing `all_reduce_coalesced` and `all_gather_coalesced` by re-using `allreduce` and `allgather` C++ cores and perform flatten and copy only on the Python side. With that, we can then remove `all_reduce_coalesced` and `all_gather_coalesced` from C++ ProcessGroup APIs. For the async mode, the copy-back logic after the communication will need to be chained as a callback on the returned Future and use the chained child Future as the return value (otherwise, we will need to wrap the child Future into another work handle). This PR tries to test if we can directly return a Future without breaking tests and internal use cases. If yes, it will make the consolidation a lot easier. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D30830994 Pulled By: mrshenli fbshipit-source-id: dcde0ed9245e9e8fee357b3588b07d540a4b6318	2021-09-10 07:45:25 -07:00
mrshenli	101a626330	Improve `distributed.get_rank()` API docstring (#63296 ) Summary: See discussion in https://pytorch.slack.com/archives/CBHSWPNM7/p1628792389008600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/63296 Reviewed By: cbalioglu Differential Revision: D30332042 Pulled By: mrshenli fbshipit-source-id: 3a642fda2e106fd35b67709ed2adb60e408854c2	2021-08-27 11:34:55 -07:00
Kiuk Chung	9d95d48567	(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: https://github.com/pytorch/pytorch/issues/63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5	2021-08-25 22:57:43 -07:00
Gao, Xiang	2d103025a5	Adding warning on isend about modifying after send (#61875 ) Summary: This is a standard limitation on communication collective libraries. For example: https://www.open-mpi.org/doc/v4.0/man3/MPI_Isend.3.php ``` A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes. ``` http://openucx.github.io/ucx/api/latest/html/group___u_c_p___c_o_m_m.html#ga8323878b60f426c630d4ff8996ede3cc ``` The user should not modify any part of the buffer after this operation is called, until the operation completes. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61875 Reviewed By: suo Differential Revision: D29783720 Pulled By: mrshenli fbshipit-source-id: 78fd047c74449f77b906f3766a6c2bc29499847d	2021-07-29 07:37:18 -07:00
Marjan Fariborz	994434ad16	Adding complex number support for all_to_all/scatter (#61299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61299 Modifying all_to_all and scatter to support complex numbers as well as float numbers. Test Plan: buck run //caffe2/test/distributed:distributed_gloo_fork -- test_name --print-passing-details --run-disabled Reviewed By: wanchaol Differential Revision: D29563938 fbshipit-source-id: 59e436b3fa1aee3d5195cbcffd39587e642c76b9	2021-07-20 15:45:34 -07:00
Yu Guo	a50a389ca6	Revert D29701479: [pytorch][PR] Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` Test Plan: revert-hammer Differential Revision: D29701479 (`9b5d9b4049`) Original commit changeset: c8d5f9057b32 fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994	2021-07-15 10:03:08 -07:00
Andrew Gu	9b5d9b4049	Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` (#61539 ) Summary: Revised version of https://github.com/pytorch/pytorch/issues/60573. Overview: This makes two changes: - It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively. - It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`. The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device. In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU. Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows: `source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU` Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes: `destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539 Test Plan: I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct. The existing `ZeroRedundancyOptimizer` tests pass. ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` The existing `broadcast_object_list()` test passes: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list ``` Reviewed By: zou3519 Differential Revision: D29701479 Pulled By: andwgu fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9	2021-07-14 17:36:30 -07:00
Bo Wang	ab27399566	Make broadcast_object_list accept a device parameter. (#61305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61305 Part I (this PR): Add dist_device argument to broadcast_object_list API Part II: andwgu@ will deprecate _broadcast_object with the newly introduced API Also include the changes to _object_to_tensor()/_tensor_to_object() with PR 60573 Context: https://github.com/pytorch/pytorch/issues/60062 Test Plan: Run the following on DevGpus with two cuda devices $python setup.py develop --- run this build on DevGPU $BACKEND='nccl' WORLD_SIZE=2 with-proxy python test/distributed/test_distributed_fork.py TestDistBackendWithFork.test_broadcast_object_list --v $BACKEND='gloo' WORLD_SIZE=2 with-proxy python test/distributed/test_distributed_fork.py TestDistBackendWithFork.test_broadcast_object_list --v Build with distributed on: USE_DISTRIBUTE=1 python setup.py develop Test on CPU devvm: $ with-proxy python test/distributed/optim/test_zero_redundancy_optimizer.py Imported from OSS Differential Revision: D29566538 D29566538 Reviewed By: iramazanli, mrshenli Pulled By: bowangbj fbshipit-source-id: 0bea52442551c5194acba85eadda16ba2ec4b6ef	2021-07-14 11:43:17 -07:00
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Ruilin Chen	38c3116813	[hierarchical sharding 5/n] enable table-wise -> col-wise sharding in embedding table lookup Summary: This diff add table-wise -> col-wise sharding support in GroupedShardedEmbeddingBag. Changes includes: 1. Add necessary member variables set up. 2. Create new fast kernel and add fast kernel lookup support 3. Add intra-host all2all and cross-host all2all logic. Test Plan: UT ``` buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_spawn ``` ``` buck test caffe2/torch/fb/hpc/tests:model_sharder_test ``` QPS check: ``` buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 10000 --num-dpp-worker-threads 16 --num-readers 100 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "["table_based", "column_based"]" --flow-entitlement ads_global_qps ``` with diff: dec inline_cvr: table-wise -> table-wise (82K): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_d0a0cba5?version=0&tab=status&env=PRODUCTION table-wise -> column-wise (80k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_b1ac5873 column-wise: dec inline_cvr: gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623827677%2F127.0.0.1%2Flibkineto_activities_4550.json.gz&bucket=gpu_traces https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_a79e1522 (81k) https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_2dacc13e (88k) row-wise(62k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_4e349cab table-wise(90k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_5d51b608 10x ctr_mbl_feed: ``` buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 128 --use-shrunk-model false --model-version=ctr_mbl_oct_2020_10x_3tb --num-dpp-worker-threads 16 --num-readers 200 --fast-kernel table_batched --max-batches 5000000 --hpc-identity ads_model_platform --table-partition column_based --flow-entitlement ads_global_tc_mimo ``` column-wise: https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_f05fb306?version=0&tab=status&env=PRODUCTION (290k) w/o diff: dec inline_cvr: column-wise (87K): gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623864444%2F127.0.0.1%2Flibkineto_activities_4451.json.gz&bucket=gpu_traces https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_e1315f14 row-wise (60k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_8fcc0adf table-wise (91k): https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_cb94ff41 10x ctr_mbl_feed: https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_203ef35b?version=0&tab=status&env=PRODUCTION (281k) NE check(use deterministic reading D28711400) ``` buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 100000 --num-dpp-worker-threads 16 --num-readers 64 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "[table_based, column_based]" --flow-entitlement ads_global_qps --use-deterministic-model --use-deterministic-reading --model-entity-id 995557193 ``` w/o this diff: ``` I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|lifetime_ne 0.8660048340401448 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|window_ne 0.8660048340401447 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps\|total_examples 1867776.0 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps\|window_qps 491.5199890136719 ``` https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION w this diff: ``` I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|lifetime_ne 0.8660048340401448 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne\|window_ne 0.8660048340401447 I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps\|total_examples 1867776.0 ``` https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION Reviewed By: JadeNie Differential Revision: D28689126 fbshipit-source-id: 1c7879d4e3ee2b90aaf2a89e87f7b827d54173b3	2021-06-17 22:25:25 -07:00
clint	78011bc0ce	typofix (torch.zero to torch.zeros) in docstring (#59703 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59703 Reviewed By: ezyang Differential Revision: D29145998 Pulled By: H-Huang fbshipit-source-id: f2670502170aa100fb02408046b7f6850f9379cf	2021-06-15 21:12:42 -07:00
Yi Wang	48ea7c808d	[C10d] Support subgroups (#59111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111 Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization. Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`. Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available. #Closes: https://github.com/pytorch/pytorch/issues/53962 ghstack-source-id: 130975027 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed Reviewed By: rohan-varma Differential Revision: D28495672 fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0	2021-06-09 22:35:11 -07:00
Can Balioglu	4ee761c2c5	[2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329 This PR is part of a stack that addresses the GitHub issue #41614; it introduces: - A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair. - Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature. Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output. ghstack-source-id: 130676389 Test Plan: Run the existing tests since there are no behavioral changes. Reviewed By: rohan-varma Differential Revision: D28424978 fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29	2021-06-05 07:50:04 -07:00
Luca Wehrstedt	8f4cfaa9db	Fix race condition in TP agent (#58753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753 TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing. One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways. Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++. ghstack-source-id: 130583775 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D28603754 fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290	2021-06-04 06:53:42 -07:00
Liang Luo	77de640f4b	[torch distributed] Implementing reduce_scatter_base (#57567 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567 Support flattened reduce_scatter. Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d Reviewed By: zhaojuanmao Differential Revision: D27876281 fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298	2021-06-03 17:17:53 -07:00
Rohan Varma	19bcbfc5cf	[c10d] Use pg wrapper in detailed debug mode (#58281 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129817857 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D28402301 fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede	2021-05-25 09:55:52 -07:00
Rohan Varma	cf395c0718	[c10d] Introduce ProcessGroupWrapper (#58224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224 Adds C++ implementation of ProcessGroupWrapper. It wraps an underlying ProcessGroup and does debug checks before dispatching the collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071. Concretely, on each collective, we: 1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another) 2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out. This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence. Once all of this passes we simply dispatch the collective to the underlying pg. Added `ProcessGroupWrapperTest` in python to comprehensively test these changes. ghstack-source-id: 129735687 Test Plan: ci Reviewed By: zhaojuanmao Differential Revision: D28023981 fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64	2021-05-24 20:09:51 -07:00
Rohan Varma	071d49a970	Document monitored barrier (#58322 ) Summary: Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322 Reviewed By: SciPioneer Differential Revision: D28595405 Pulled By: rohan-varma fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa	2021-05-21 19:04:57 -07:00
Yi Wang	314a578154	Clang format distributed_c10d.py (#58435 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58435 Prepare for #53962 ghstack-source-id: 129171617 Test Plan: N/A Reviewed By: zhaojuanmao Differential Revision: D28490326 fbshipit-source-id: 2ed3c5850788b9702a8020f6ee6d0b579625bf89	2021-05-17 16:47:35 -07:00
Rohan Varma	e90fcffb65	[c10d] Log when store based barrier succeeds (#57711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57711 Seeing some hangs/issues around store based barrier internally, would be good to have this log to indicate whether store based barrier has completed successfully or not for a particular rank to debug further. ghstack-source-id: 128605600 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28249087 fbshipit-source-id: 644e5780519017ae780c3bc78bbe5def322db3f8	2021-05-10 21:09:40 -07:00
Liang Luo	c37095760d	[torch distributed] Implementing all_gather_base (#56315 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315 This diff implements the all_gather_base in pytorch distributed. Test Plan: dist.all_gather_base(output, input)... Reviewed By: agolynski, amylittleyang Differential Revision: D27488999 fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936	2021-04-23 14:16:47 -07:00
Wanchao Liang	a970e525fd	make ProcessGroup.Options.timeout argument private in python (#56531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531 per discussions in https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need to make sure our API not confusing user by passing in both timeout in argument and timeout in processgroup.options. This PR tries to make the `ProcessGroup.Options.timeout` be a private field, and only be used in our test utils, for both `init_process_group` and `new_group`, we still allow user pass `timeout` as a separate argument. Since `ProcessGroupGloo.Options` only have a `timeout` config, both functions will not allow passing in options for the GLOO backend. This way we still preserve the only `timeout` API, and only allow user to use `ProcessGroupNCCL.Options` when needed. cc pritamdamania87 rohan-varma Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27893395 Pulled By: wanchaol fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5	2021-04-21 17:55:10 -07:00
Rohan Varma	b7d5a0cf10	[c10d] sequence number in process group (#55319 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319 Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability. The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization. This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning. ghstack-source-id: 127011277 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27562769 fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b	2021-04-21 10:59:24 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Rohan Varma	ce05b7a324	[c10d] Remove deprecated use of torch.LongTensor, torch.ByteTensor (#55861 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55861 APIs such as torch.LongTensor and torch.ByteTensor are deprecated and the recommended API is torch.tensor(args, dtype=...). Use this API in distributed_c10d. ghstack-source-id: 126777875 Test Plan: CI Reviewed By: pbelevich Differential Revision: D27726600 fbshipit-source-id: 07eb8168d93697593589002c93c3903ce29431ef	2021-04-18 14:12:02 -07:00
Rohan Varma	bbc4c775bb	[reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990 Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master. Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms. ghstack-source-id: 126478371 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27758424 fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63	2021-04-14 12:26:54 -07:00
Rohan Varma	48c73d24b8	Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do Test Plan: revert-hammer Differential Revision: D27523060 (`a5290adea5`) Original commit changeset: fa05e4f8ad8a fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a	2021-04-13 21:33:09 -07:00
Rohan Varma	a5290adea5	[c10d] monitored_barrier: ensure all ranks pass or none do (#55197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197 From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics. In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier. This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements: Nonzero ranks call send() Nonzero ranks call recv() Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy Once all ranks are acknowledged as healthy: Rank 0 calls send() to all nonzero ranks to unblock them Modified unittests to ensure the all or nothing failure behavior ghstack-source-id: 126413088 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27523060 fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6	2021-04-13 19:01:25 -07:00
Rohan Varma	19f15317a0	[BE][Docs] Improve dist.new_group doc (#55660 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55660 Noticed this doc was missing clarification on nccl env vars that init_process_group docs have. Also, specify default behavior when backend=None is passed in. ghstack-source-id: 126251116 Test Plan: Ci Reviewed By: SciPioneer Differential Revision: D27672208 fbshipit-source-id: 2e79d297174e135173bceb059450ea267367bde4	2021-04-11 16:16:18 -07:00
Szymon Migacz	8e78a1b084	[Resubmit] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#52757 ) Summary: Resubmit of https://github.com/pytorch/pytorch/pull/51739 Fixes https://github.com/pytorch/pytorch/issues/51428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/52757 Reviewed By: cbalioglu Differential Revision: D26646843 fbshipit-source-id: df4962ef86ea465307e39878860b9fbbcc958d52	2021-04-06 11:32:26 -07:00
Rohan Varma	19a0eb4cdb	[c10d] Monitored barrier: option to collect all failed ranks (#55010 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010 Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one. This is done by passing in a flag `wait_all_ranks=True`. ghstack-source-id: 125699839 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27447787 fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb	2021-04-04 21:39:54 -07:00
Rohan Varma	d185719455	Expose dist.monitored_barrier() API (#53787 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53787 Per title, exposes a python-based monitored barrier API that we can use as part of debugability and may be useful for user applications. ghstack-source-id: 125124315 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26965127 fbshipit-source-id: 6c7826e63758462e3e5111f28cced54cba76a758	2021-03-29 14:15:37 -07:00
Jeff Yang	0435059ddf	docs: fix docstring signature in `all_reduce_multigpu` (#54665 ) Summary: fixes https://github.com/pytorch/pytorch/issues/43500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/54665 Reviewed By: ezyang Differential Revision: D27340481 Pulled By: rohan-varma fbshipit-source-id: d53c36b41dd26c7a791d3674a5b4b67daaadae13	2021-03-26 11:08:32 -07:00
Wanchao Liang	133000fe7a	[distributed] add processgroup options as argument (#53663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53663 This add the processgroup option as an optional argument to new_group and init_processgroup, this allows user to pass in a initialized processgroup option for gloo and nccl. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D26968857 Pulled By: wanchaol fbshipit-source-id: 2ff73a009120b85e83ecde7c69956b731902abc2	2021-03-18 01:04:17 -07:00
Michael Suo	87b6702833	[distributed] make the pickler in distributed_c10d pluggable (#53060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53060 As title. We would like to use alternative pickler/unpickler implementations, to make it possible to send objects over the wire that are coming from a torch.package Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D26737317 Pulled By: suo fbshipit-source-id: 6bdef9824e48ef657dcad72cc5a9114e6612ea4a	2021-03-01 21:37:48 -08:00
Howard Huang	b56f59ea20	Revert D26599390: [pytorch][PR] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py Test Plan: revert-hammer Differential Revision: D26599390 (`075bbe0d6a`) Original commit changeset: d822658076f7 fbshipit-source-id: 6c4421f4de99794ea66780175af549cef9410a20	2021-02-24 05:38:34 -08:00
Szymon Migacz	075bbe0d6a	Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#51739 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51739 Reviewed By: bdhirsh Differential Revision: D26599390 fbshipit-source-id: d822658076f7b08ebfde3dc9994159539490fda0	2021-02-23 22:30:37 -08:00
Rohan Varma	c255628134	[Collective APIs] Make python object collective API args consistent (#50625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625 Make API signatures consistent and provide default argument similar to the tensor collectives. ghstack-source-id: 120718121 Test Plan: CI Reviewed By: wanchaol Differential Revision: D25932012 fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20	2021-01-30 19:47:16 -08:00
Pritam Damania	16e5af41da	Fix store based barrier to only use 'add'. (#49930 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49930 Certain store implementations don't work well when we use get() and add() on the same key. To avoid this issue, we only use add() in the store based barrier. The buggy store implementations can't be properly fixed due to legacy reasons. Test Plan: 1) unit tests. 2) waitforbuildbot Reviewed By: osalpekar Differential Revision: D25725386 fbshipit-source-id: 1535e2629914de7f78847b730f8764f92cde67e7	2021-01-05 12:46:24 -08:00
Jagadish Krishnamoorthy	c115957df0	[distributed] Provide parameter to pass GPU ID in barrier function (#49069 ) Summary: For a multi GPU node, rank and corresponding GPU mapping can be different. Provide optional parameter to specify the GPU device number for the allreduce operation in barrier function. Add test cases to validate barrier device_ids. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Fixes https://github.com/pytorch/pytorch/issues/48110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069 Reviewed By: mrshenli Differential Revision: D25658528 Pulled By: rohan-varma fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec	2021-01-05 11:27:54 -08:00

... 2 3 4 5 6 ...

430 Commits