Commit Graph

430 Commits

Author SHA1 Message Date
fduwjj
40ce9a4cfb [c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2023-12-12 20:52:43 +00:00
Howard Huang
99f06c0cc2 [BE] update errors to be more descriptive (#115443)
we call `_check_single_tensor` and `_check_tensor_list` as validation but don't print out the param types that were invalid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115443
Approved by: https://github.com/XilunWu
2023-12-11 21:21:10 +00:00
Chip Turner
937d616e82 Re-enable type checking for distributed_c10d.py (#115223)
Re-enable type checking for distributed_c10d.py

Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.

Note: the backwards compatibility linter does not like some of these changes.  But they were incorrect before.  This needs human verification, however.

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
2023-12-09 11:07:54 +00:00
Chip Turner
78b945484b [c10d] Extend NCCL communicator splitting to more use cases (#114916)
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.

This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.

This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number).  And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
2023-12-07 15:13:01 +00:00
Chip Turner
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
Chip Turner
066e072524 Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385)
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`

Fixes cause of revert of original PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
2023-11-23 07:00:00 +00:00
PyTorch MergeBot
b927a4e2ca Revert "Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)"
This reverts commit 64a5372e6c.

Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk 4d07428ede ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))
2023-11-22 17:43:51 +00:00
Chip Turner
64a5372e6c Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
2023-11-21 21:03:52 +00:00
Ke Wen
dc65f6c601 [c10d] Remove deprecated multi-gpu-per-thread APIs (#114156)
As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document.  The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156
Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 03:50:23 +00:00
Shengbao Zheng
e53da90fe6 [Execution Trace] record global rank in pg_config_info (#113316)
Summary:
pg_config_info is used to dump pg information in Execution Trace(ET). For trace analysis purpose and PARAM replay benchmark, global rank is more meaningful than group ranks.

p.s. ranks is a map of global rank: group rank

Test Plan: Tested in HPC

Differential Revision: D51136587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113316
Approved by: https://github.com/XilunWu
2023-11-09 20:04:43 +00:00
Ke Wen
bb7ac12cbf [ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896)
Summary: Follows PR #111431, save memory for DTensor init

Test Plan: Sandcastle

Reviewed By: wanchaol

Differential Revision: D50985365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896
Approved by: https://github.com/wanchaol
2023-11-07 15:44:04 +00:00
Will Constable
ff51f94e32 [Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
2023-11-07 05:34:26 +00:00
PyTorch MergeBot
75adb9f371 Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)"
This reverts commit f9d47e1381.

Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor f9d47e1381 https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))
2023-11-06 22:49:53 +00:00
Will Constable
f9d47e1381 Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
2023-11-06 20:48:39 +00:00
Sahdev Zala
c6ecd018d5 Fix docstring errors (#112693)
This PR reduces docstring erros to 0 from total 128. This can be verified by running, pydocstyle path-to-distributed_c10d.py --count

Where, path-to-distributed_c10d.py is `torch/distributed/distributed_c10d.py`

BEFORE the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
128
AFTER the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
0

Fixes #112640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112693
Approved by: https://github.com/H-Huang
2023-11-06 18:45:05 +00:00
Will Constable
65b74c9254 Make init_process_group timeout kwarg override pg_options (#112611)
This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112611
Approved by: https://github.com/H-Huang
2023-11-03 23:13:03 +00:00
Aaron Gokaslan
cb856b08b2 [BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496)
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
2023-10-19 21:56:36 +00:00
Shengbao Zheng
8899abde32 [PyTorch][ET] Improve Process Groups Mapping Info Collection (#110908)
Summary:
Process Groups Mapping info collection was introduced in D46321690.

Improve the mapping info collected there:
- replace pg_id (a unique ID for the PG object) with pg_names (a unique name for each pg and shared by all ranks)
- add number of pg info with group_count
- reduce the length of pg_config_info to avoid being truncated(max length of 4096, now doubled ) by
  - migrating ranks(a map from global ranks to group ranks) with the list of global ranks of a pg, since we currently don't use group rank id
  - using an empty rank list to indicate that all ranks are involved in a pg and adding a field of group_size to show how many ranks are involved

Test Plan:
Tested in HPC
```
buck2 run mode/opt //hpc/torchrec/models/ads:cmf_10x_launcher -- launcher=local data_loader=random data_loader.num_batches=100 checkpoint=model_store max_ind_range=10 launcher.num_trainers=8
```
Example output in ET
```
{
"name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140688385794048, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140688386762752, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140682531798720, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_0\", \"backend_id\": 140672678002688, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140672678007616, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_1\", \"backend_id\": 140672678012544, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Before the change, pg_config_info of >128 rank will be truncated, e.g.
```
"inputs": ["[{\"pg_id\": 140321146893696, \"backend_id\": 140321113854976, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321074662400, \"backend_id\": 140321100033024, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321154994304, \"backend_id\": 140319780290048, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\""], "input_shapes": [[]], "input_types": ["String"],

```
After the change the length reduced
```
"inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140551405059072, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140551399745536, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140578999821184, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_0\", \"backend_id\": 140559197777152, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140549119076736, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_1\", \"backend_id\": 140571995143424, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"],
```

Reviewed By: louisfeng, fduwjj

Differential Revision: D50048147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110908
Approved by: https://github.com/fduwjj
2023-10-19 21:37:19 +00:00
Ke Wen
18cc8a92ac [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)
For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.

To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`).

This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now.

Cc: @awgu @janeyx99 @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431
Approved by: https://github.com/H-Huang
2023-10-19 00:41:09 +00:00
PyTorch MergeBot
1e70f4d02c Revert "Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)"
This reverts commit bb1424d46e.

Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))
2023-10-16 23:03:26 +00:00
Will Constable
bb1424d46e Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)
This reverts commit 314a502eb0.

Changes since original PR:
Reland 1
 *  rename torch.distributed.hooks to torch.distributed._hooks

Reland 2
 * make _hooks importable even if !distributed.is_available()
 * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
2023-10-12 16:59:23 +00:00
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
Edward Z. Yang
de3ae93e9b Include rank of default PG in C++ log messages (#110623)
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.

The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.

This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 00:26:52 +00:00
Kazuaki Ishizaki
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
Howard Huang
0949d97c16 fix batch_isend_irecv example incorrect usage (#110408)
mismatched dtypes silently leads to wrong outputs in nccl

```
1:recv_tensor=tensor([0., 0.], device='cuda:1')
0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408
Approved by: https://github.com/awgu, https://github.com/Neilblaze
2023-10-04 22:57:03 +00:00
Rohan Varma
40be6b72e1 [ez] Type function in distributed_c10d (#110435)
This function returns a `torch.device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435
Approved by: https://github.com/awgu
2023-10-03 17:54:04 +00:00
Rodrigo Kumpera
c26270c733 [C10D] Even more store scalability work. (#109218)
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.

Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.

Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
2023-09-22 21:27:09 +00:00
Howard Huang
600d0d0284 Add "cuda" to MPI backend capabilities (#109614)
Summary: Fixes https://github.com/pytorch/pytorch/issues/109543

Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests.

Differential Revision: D49420438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614
Approved by: https://github.com/XilunWu
2023-09-21 13:34:58 +00:00
Rodrigo Kumpera
881bfbf21d [c10d] Add tests for usig libuv through init_process_group. (#108661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-09-20 16:02:20 +00:00
Rodrigo Kumpera
2bca5f2af7 [C10D] Track pg name in c++. (#108813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813
Approved by: https://github.com/wconstab
2023-09-15 01:10:29 +00:00
Brian Vaughan
bb14805bcd fix an incorrect indent in documentation (#108273)
doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273
Approved by: https://github.com/awgu, https://github.com/kit1980
2023-09-11 21:27:52 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
wz337
264df88a2d [C10D][Logger]Add more info to c10d logger (#107331)
This PR adds pg_name and world_size to c10d logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331
Approved by: https://github.com/kumpera
2023-08-28 15:10:56 +00:00
Codle
42738c56a0 Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509)
The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element.

Before (30G tensor):
<img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944">

After (46G tensor):
<img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5">

Test Code:
```python
if __name__ == "__main__":
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())

    fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4)

    if dist.get_rank() == 0:
        state_dict = {"fake_tensor": fake_tensor}
    else:
        state_dict = {}
    object_list = [state_dict]
    dist.broadcast_object_list(object_list, src=0)
    print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys())
    dist.barrier()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509
Approved by: https://github.com/awgu
2023-08-23 17:19:10 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
Rodrigo Kumpera
bbf03561a9 [functional collectives] Move back to registering finalizers on wrappers. (#107250)
We cannot use inner tensors for finalizers as they are uncollective until waited.

This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab
2023-08-17 21:08:28 +00:00
Shen Li
45128ab67c [Reland] Add OnCompletion Hook to ProcessGroup (#106988) (#107233)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233
Approved by: https://github.com/kumpera
2023-08-15 17:35:14 +00:00
PyTorch MergeBot
fd214aa8be Revert "Add OnCompletion Hook to ProcessGroup (#106988)"
This reverts commit ba1da47e8f.

Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error.  The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))
2023-08-15 08:24:33 +00:00
Shen Li
ba1da47e8f Add OnCompletion Hook to ProcessGroup (#106988)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988
Approved by: https://github.com/kumpera, https://github.com/H-Huang
ghstack dependencies: #107140, #107141, #107160
2023-08-15 04:32:23 +00:00
Bruce Jiang
2624da638d Support third-party devices to use the init_process_group method with… (#107113)
…out specifying the Backend

When init_process_group is not been done before, it will automatically apply  init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113
Approved by: https://github.com/wanchaol
2023-08-15 03:46:07 +00:00
Jirka
858b465d74 fix str splits in single line (#106005)
Simple formating improvement and two spell fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005
Approved by: https://github.com/H-Huang
2023-08-14 23:07:38 +00:00
Michael Voznesensky
42660015b4 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886
Approved by: https://github.com/awgu, https://github.com/wconstab
ghstack dependencies: #106884
2023-08-11 22:35:50 +00:00
Louis Feng
3a01c056f5 [PyTorch][ET] Collect Process Groups Mapping Info (#104373)
Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info).

Test Plan:
Testing in HPC:
```
TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par  +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000
```
Example output in ET:
```
    {
      "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Differential Revision: D46321690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373
Approved by: https://github.com/kwen2501
2023-07-25 03:34:53 +00:00
Howard Huang
0ab74044c2 [BE] remove deprecated attributes from distributed_c10d (#105753)
Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now.

Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753
Approved by: https://github.com/rohan-varma
2023-07-24 16:35:08 +00:00
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Ke Wen
22e8a61d9b Implement coalesced reduce_scatter_tensor (#103561)
Map of #101157.

This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax:

Sync communication style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])
```

Async communication style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the reduce-scatters' results
```
Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561
Approved by: https://github.com/fegin
2023-06-15 20:11:12 +00:00
zhuhong61
50c972bfd2 [c10d] Add xpu to the default device supported by user specified backend (#103410)
**Motivation:**
For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping.

**Solution:**
We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly.
Usage:
When using xpu device, user can specify backend name only:
`dist.init_process_group(backend='ccl')`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-06-12 19:46:33 +00:00
Ke Wen
07104ca99c [c10d] Make it default that PG do not perform barrier after init (#103033)
Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init.

In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033
Approved by: https://github.com/XilunWu
2023-06-07 06:11:14 +00:00
Ashwin Hari
cf0aa38005 Allow ORT backend for DTensor (#101914)
fixes #101911

Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend.

* `Backend.NAME`  attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl.
* remove unused `_check_for_nccl_backend` function
* add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914
Approved by: https://github.com/wanchaol
2023-06-01 22:37:09 +00:00
shaoyf42
8d7e082300 [c10d] Add is_backend_available for c10d backend. (#101945)
Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``.

There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553
> For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101
to also add their own is_available property

It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function.  This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`.

> Or we could add an Is_available(backend) function, that checks for the backend.

Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945  that supports both built-in backends and third-party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945
Approved by: https://github.com/H-Huang
2023-05-31 22:51:51 +00:00
Wanchao Liang
3ef4d697df [c10d] default backend need to check for nccl availability (#102470)
As titled, we can only initialize nccl backend when NCCL is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2023-05-30 19:22:37 +00:00
Wanchao Liang
7b47cd0a6c [c10d] add fake pg necessary collectives (#102238)
This PR adds fake pg necessary collectives to enable e2e FSDP run
with out multiprocess or multithreading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Wanchao Liang
9a19262556 [c10d] conslidate barrier after init logic (#102237)
This PR consolidates the barrier after init logic to allow custom
backend to set the env var when creating the pg, so that
`init_process_group` would skip barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Edward Z. Yang
c903b12cb8 Add fake process group (#102180)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180
Approved by: https://github.com/wanchaol
2023-05-24 23:27:40 +00:00
Iris
ee95e37a69 [c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912)
1. Record time spent for init_process_group, new_group, _store_based_barrier
2. Rename c10d_error_logger to c10d_logger for generalization.
3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py.
4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912
Approved by: https://github.com/fduwjj
2023-05-24 09:36:34 +00:00
Aaron Gokaslan
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
shaoyf42
97180aca5e Enables barrier to support the specified device (#99589)
Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919

Today, there are two limitations of barrier:
One is that barrier does not support custom  #device:
fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)

The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device.
789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589
Approved by: https://github.com/kwen2501
2023-05-17 05:26:04 +00:00
Ke Wen
daed3bf8f9 Implement coalesced all_gather_into_tensor (#101157)
This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-05-11 20:58:47 +00:00
Ke Wen
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
Rodrigo Kumpera
a204f7f518 [c10d] Fix subprocess group handlig in scatter_object_list. (#100552)
scatter_object_list assumed src was a group rank while all collectives use global ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552
Approved by: https://github.com/fduwjj
2023-05-04 10:04:21 +00:00
Xiaodong Wang
c29ab84115 Fix bug in process_group_name when there is duplicate pgs (#100518)
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518
Approved by: https://github.com/kumpera
2023-05-04 02:12:28 +00:00
Animesh Jain
5fbb40669f [dynamo][moco] Disallow_in_graph distributed APIs (#100071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071
Approved by: https://github.com/jansel, https://github.com/H-Huang
2023-05-02 20:09:25 +00:00
Ke Wen
ae0eb2342d [Experimental] Remove store barrier after PG init (#99937)
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937
Approved by: https://github.com/kumpera, https://github.com/H-Huang
2023-04-27 17:23:10 +00:00
Rodrigo Kumpera
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Ke Wen
3a09aa5977 [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-24 21:27:26 +00:00
medivh-xp
39590d06c5 Make new_subgroups avaliable for non-cuda depend backend (#99706)
The `new_subgroups` allows for the easy creation of sub-communication groups, but it currently requires CUDA availability. For communications that do not rely on CUDA, such as the CPU-based gloo or custom communication backends, I still hope to be able to use it, such as with the CPU-based gloo (which is also the case when using a custom backend):
```python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def gloo_process(rank_id, world_size, group_size, mp_lock):
    assert not torch.cuda.is_available()
    def lock_print(*args, **kwargs):
        with mp_lock:
            print(*args, **kwargs, flush=True)

    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group('gloo', rank=rank_id, world_size=world_size)

    subgroup, _ = dist.new_subgroups(group_size)
    subgroup_ranks = list(range(subgroup.rank() * group_size, (subgroup.rank() + 1) * group_size))
    lock_print(f"Rank {rank_id} initialized in subgroup_{subgroup.rank()}: {subgroup_ranks}")

    tensor = torch.Tensor([rank_id + 1])
    subgroup.broadcast(tensor, root=0)

    lock_print(f"After broadcast, rank {rank_id} in subgroup_{subgroup.rank()}:{subgroup_ranks} got {tensor}")

if __name__ == "__main__":
    world_size = 4
    group_size = 2
    processes = []
    mp.set_start_method("spawn")
    mp_lock = mp.Lock()
    for rank in range(world_size):
        p = mp.Process(target=gloo_process, args=(rank, world_size, group_size, mp_lock))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
```

```bash
Rank 0 assigned to subgroup_0: [0, 1]
Rank 1 assigned to subgroup_1: [2, 3]
Rank 2 assigned to subgroup_0: [0, 1]
Rank 3 assigned to subgroup_1: [2, 3]
After broadcast, rank 2 in subgroup_0:[0, 1] got tensor([3.])
After broadcast, rank 3 in subgroup_1:[2, 3] got tensor([3.])
After broadcast, rank 1 in subgroup_1:[2, 3] got tensor([1.])
After broadcast, rank 0 in subgroup_0:[0, 1] got tensor([1.])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99706
Approved by: https://github.com/kumpera
2023-04-24 18:22:59 +00:00
PyTorch MergeBot
9861ec9785 Revert "[c10d] Faster coalescing (#98793)"
This reverts commit db456ab83d.

Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-21 09:15:04 +00:00
Ke Wen
db456ab83d [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-19 20:17:58 +00:00
Howard Huang
760967a284 Update _store_based_barrier implementation to reduce load on rank 0 (#98000)
Summary:

Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker
Added optional logging_interval arg to _store_based_barrier

Test Plan:
```
pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier
```

Reviewed By: rohan-varma

Differential Revision: D44430531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000
Approved by: https://github.com/kumpera
2023-04-11 14:25:29 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Howard Huang
61c74ab0f8 Fix MPI rank and world size pg initialization (#98545)
Fixes https://github.com/pytorch/pytorch/issues/97507

Test command
`pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545
Approved by: https://github.com/malfet
2023-04-07 21:57:31 +00:00
Rohan Varma
8a29afe98a [RFC] Add warning about object-based collectives for GPU tensors to docs. (#97702)
Using GPU tensors in these collectives have caused SEVs, user
confusion, and slowness in the past. These APIs were only designed to
communicate arbitrary python objects, and GPU tensors should either be copied
to CPU first or use the regular collecitves. Add a warning indicating so.

Differential Revision: [D44435849](https://our.internmc.facebook.com/intern/diff/D44435849/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97702
Approved by: https://github.com/kumpera
2023-04-06 23:47:35 +00:00
Howard Huang
3b6e94cb8c [small] replace with .format() with f-strings (#98514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98514
Approved by: https://github.com/awgu
2023-04-06 18:58:56 +00:00
Kazuaki Ishizaki
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
Edward Z. Yang
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Howard Huang
ac7329b323 Add exceptionhandler to more distributed_c10d APIs (#96770)
Summary: Adding exception handler to a few more APIs so that internal errors are logged to the c10d errors scuba table

Test Plan: sandcastle

Differential Revision: D44068557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96770
Approved by: https://github.com/wz337
2023-03-15 20:31:46 +00:00
Howard Huang
02fa2291f7 Add support for custom backend (#95072)
Fixes https://github.com/pytorch/pytorch/issues/92344

A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`.

Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072
Approved by: https://github.com/kwen2501
2023-03-02 21:41:49 +00:00
Howard Huang
c0fa0669f6 Update isend/irecv warning messages for nccl (#95236)
Summary: nccl backend does not support `tag` as mentioned in https://github.com/pytorch/pytorch/issues/94819. Adding a note in the documentation for it.

Example:

<img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png">

Differential Revision: D43475756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95236
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-02-22 22:00:13 +00:00
Rodrigo Kumpera
641cb4243c Fix c10d regression during cleanup. (#94988)
This fixes a regression introduced earlier today with a change to c10d global state.

It must be cleaned up in destroy_process_group or root PG and its Store will stay alive.

Fixes regression in test_c10d_nccl.py :: RendezvousEnvTest.test_common_errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94988
Approved by: https://github.com/H-Huang, https://github.com/wanchaol, https://github.com/malfet
2023-02-16 19:12:00 +00:00
Rodrigo Kumpera
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Howard Huang
8b3e3f937d Update documentation init_process_group optional backend (#94543)
Update documentation for `init_process_group()` to mention the `backend` argument is optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543
Approved by: https://github.com/kwen2501
2023-02-13 21:45:38 +00:00
Howard Huang
f45c196653 Update backend config to be under _World (#94191)
All the c10d process group state is under `_World`, so this is BE work to include a missing map
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191
Approved by: https://github.com/kumpera
2023-02-09 20:48:42 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Iris
f54fd6fb28 [c10d] Update get_backend() in exception_handler (#94063)
Currently, get_backend() and get_world_size() would always return the default value if no pg group argument is passed. This fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94063
Approved by: https://github.com/H-Huang
2023-02-04 19:39:36 +00:00
Ching-Hsiang Chu
1fa68d40b8 [pytorch] fix backend_type for backend/PG plugin (#93129)
Summary: For backend/PG plugin, use `ProcessGroup.BackendType.CUSTOM` to avoid uninitialized variable during `pg._register_backend` later

Test Plan: CI/CD and internal tests

Differential Revision: D42793222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93129
Approved by: https://github.com/H-Huang
2023-01-30 23:16:08 +00:00
Howard Huang
2503a4a7c6 Fix MPI backend PG initialization (#92847)
Fixes #92573

Add test to check that all default backends can be initialized to prevent the above from regressing in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847
Approved by: https://github.com/rohan-varma
2023-01-24 23:24:41 +00:00
Andrew Gu
cb67d9460b [PT-D] Fix send, recv return type (#92152)
- `send` returns `None`.
- `recv` returns the sender rank if valid or -1 otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92152
Approved by: https://github.com/wz337
2023-01-14 01:09:49 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Howard Huang
99aec69f58 [BE] remove Backend.TCP (#91314)
Remove Backend.TCP which is unused. Fixes a task in https://github.com/pytorch/pytorch/issues/90544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91314
Approved by: https://github.com/awgu
2022-12-23 15:48:29 +00:00
Sergii Dymchenko
365071c73c Fix non-existing parameters in docstrings in torch/distributed (#91116)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116
Approved by: https://github.com/huydhn
2022-12-22 02:37:31 +00:00
Sadra Barikbin
97f514f38e Fix two typos in torch.distributed.distributed_c10d.py::broadcast_object_list (#91237)
Fixes #91236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91237
Approved by: https://github.com/malfet, https://github.com/H-Huang
2022-12-21 19:45:08 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Rohan Varma
793a999ce0 Hybrid Sharded Data Parallel (#89915)
Adds 2 new hybrid sharding strategy to FSDP:
1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across
2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across

These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy.

Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately.

** Acknowledgements **
- @awgu 's excellent prototype: 5ad3a16d48
- @liangluofb For ideation, feedback, and initial implementation and experimentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915
Approved by: https://github.com/awgu
2022-12-08 16:18:03 +00:00
Howard Huang
ee907375fa [small] Update error message (#89294)
Summary:
`RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].`

to

`RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].`

Test Plan: sandcastle

Differential Revision: D41405238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294
Approved by: https://github.com/awgu
2022-11-19 00:21:14 +00:00
Masaki Kozuki
63e16216d8 [c10d] Implement __instancecheck__ for c10d::ReduceOp (#88275)
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- https://github.com/pybind/pybind11/issues/2696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
Approved by: https://github.com/wanchaol
2022-11-15 13:21:41 +00:00
Iris
68fd8f3706 [BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004)
This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py).
Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004
Approved by: https://github.com/H-Huang
2022-11-15 06:13:17 +00:00
Howard Huang
1d54ce9d5d [14/N] Refactor _new_process_group_helper() to remove repeated code (#88351)
Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351
Approved by: https://github.com/kwen2501
2022-11-10 19:27:17 +00:00
Kurt Mohler
ee28b865ee Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)
Part of #85302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
Approved by: https://github.com/ezyang
2022-11-08 18:11:01 +00:00
Rodrigo Kumpera
6663ae5537 [2/n] Thread PG: add class _World to distributed_c10d.py (#781) (#88471)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/781

Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.

I have relinked this diff to a new github PR, so that I can update it. The original PR is
> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348

Differential Revision: D40236769

Pulled By: yhcharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471
Approved by: https://github.com/gnadathur, https://github.com/rohan-varma
2022-11-07 17:56:40 +00:00
Kazuaki Ishizaki
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
Iris Zhang
0cf572ff6c [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643

1. Add a decorator function exception_handlers to  c10d collectives.
2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.

```
python3 test/distributed/test_c10d_error_logger.py
```

Test Plan: Test in OSS.

Reviewed By: H-Huang

Differential Revision: D40281632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
Approved by: https://github.com/H-Huang
2022-10-29 04:38:34 +00:00
Masaki Kozuki
aa8248cc9a Reenable isinstance with torch.distributed.ReduceOp (#87303)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: https://github.com/pytorch/pytorch/issues/87191

cc @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
Approved by: https://github.com/wanchaol
2022-10-21 15:05:36 +00:00
PyTorch MergeBot
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2b.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
Rodrigo Kumpera
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
Louis Feng
55479fe80e Enable capturing of comm collective parameters (#98) (#85368)
Summary:
X-link: https://github.com/facebookresearch/torch_ucc/pull/98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368
Approved by: https://github.com/H-Huang
2022-10-11 04:38:26 +00:00
Jesus Magana
c670bad72f Update dist.scatter() documentation (#86069)
Update documentation for dist. scatter

Fixes #84566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86069
Approved by: https://github.com/rohan-varma, https://github.com/H-Huang
2022-10-03 17:22:08 +00:00
Ke Wen
05d1128106 [c10d] Start deprecating *_multigpu APIs (#85961)
### Deprecation reasons:
- For most users training is on one GPU per process so these APIs are rarely used
- They added one more API dimension
- They can be expressed in a composed manner
- They are not abstracted – specific to GPU
- They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:59:39 +00:00
Ke Wen
463283e016 [c10d] Start deprecating *_coalesced APIs (#85959)
- We consider that general users need not to use the `*_coalesced` APIs unless there is an extreme concern about performance.

- We are investigating using a context manager named `coalescing_manager` which wrap around multiple individual collectives to compose the coalescing hint, rather than giving each collective a *_coalesced variant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85959
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:55:27 +00:00
Ke Wen
ade1c19612 Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867)
This is a twin PR similar to the one for `all_gather_into_tensor` (#85686).
The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686.

Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867
Approved by: https://github.com/crcrpar, https://github.com/H-Huang
2022-09-30 05:48:16 +00:00
Saliya Ekanayake
941d7a31f6 Pass group ranks and options to third party distributed backends (#73164)
Fixes #73163

PyTorch's [_new_process_group_helper()](9f541aa3ac/torch/distributed/distributed_c10d.py (L633)) does not pass group's participating ranks to the backend.

This PR adds the above capability. Also, refactors some variables for better clarity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73164
Approved by: https://github.com/kumpera
2022-09-29 17:28:58 +00:00
PyTorch MergeBot
6fae62b35f Revert "C10D extension to enable per-thread PG (#84153)"
This reverts commit 5cbffbbac9.

Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff
2022-09-29 13:51:05 +00:00
Ke Wen
775a22c7c6 Add all_gather_into_tensor in place of _all_gather_base (#85686)
### Description
- This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning.
- The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors.
- This PR also adds deprecation warning to `_all_gather_base`.

### Issue
`_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output.

There are, however, two "blockers" that make the merge difficult:
(i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list.
(ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information.

In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name.

### Testing
Added tests:
- `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example:
```
        >>> tensor_in
        tensor([1, 2], device='cuda:0') # Rank 0
        tensor([3, 4], device='cuda:1') # Rank 1
        >>> tensor_out
        tensor([1, 2, 3, 4], device='cuda:0') # Rank 0
        tensor([1, 2, 3, 4], device='cuda:1') # Rank 1
```
- `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example:
```
        >>> tensor_out2
        tensor([[1, 2],
                [3, 4]], device='cuda:0') # Rank 0
        tensor([[1, 2],
                [3, 4]], device='cuda:1') # Rank 1
```
The output form is determined by the shape of the output tensor passed by the user, no flag used.

Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686
Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
2022-09-27 22:50:22 +00:00
Rodrigo Kumpera
5cbffbbac9 C10D extension to enable per-thread PG (#84153)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
2022-09-27 21:42:31 +00:00
Rodrigo Kumpera
7dcc723d35 [c10d] Ensure collectives are called with the same dtype for all tensor params. (#84664)
While passing tensors with different dtypes don't crash, they don't produce sensible results.

We see data tearing instead of casting.

It's not clear we want to support transparent casting so, for now, we fail when such input is presented.

Fixes #84525

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84664
Approved by: https://github.com/rohan-varma
2022-09-15 22:32:51 +00:00
Salahuddin
6bd7d0f856 doc string fixed in torch.distributed.reduce_scatter (#84983)
Fixes #84865

Previous `torch.distributed.reduce_scatter`:

```
def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
    """
    Reduces, then scatters a list of tensors to all processes in a group.

    Args:
        output (Tensor): Output tensor.
        input_list (list[Tensor]): List of tensors to reduce and scatter.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.
```

Fixed:

```
def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
    """
    Reduces, then scatters a list of tensors to all processes in a group.

    Args:
        output (Tensor): Output tensor.
        input_list (list[Tensor]): List of tensors to reduce and scatter.
        op (optional): One of the values from
            ``torch.distributed.ReduceOp``
            enum.  Specifies an operation used for element-wise reductions
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84983
Approved by: https://github.com/H-Huang
2022-09-15 18:17:10 +00:00
Rodrigo Kumpera
38192f63cd Add __all__ for a few distributed modules plus a little typing (reland) (#84872)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

This relands #84119 with the required fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872
Approved by: https://github.com/rohan-varma
2022-09-13 21:57:49 +00:00
PyTorch MergeBot
219ff26172 Revert "Add __all__ for a few distributed modules plus a little typing (#84119)"
This reverts commit 6f21680563.

Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448
2022-09-09 20:01:07 +00:00
Rodrigo Kumpera
6f21680563 Add __all__ for a few distributed modules plus a little typing (#84119)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119
Approved by: https://github.com/rohan-varma
2022-09-08 23:28:31 +00:00
Rodrigo Kumpera
e96fb5d58c [c10d] Fix docstring of scatter_object_list (#84596)
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84596
Approved by: https://github.com/H-Huang
2022-09-07 14:49:45 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
Rodrigo Kumpera
7a348a1d4a Fix internal breakage caused by #82134 (#84363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84363
Approved by: https://github.com/rohan-varma, https://github.com/mehtanirav
2022-09-01 17:54:10 +00:00
Rodrigo Kumpera
65dc5dd3f3 [c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134)
Those functions enable membership introspection into a ProcessGroup. A common scenario
that needs this is library code that consumes a PG but doesn't create it, which means
it likely doesn't know the global ranks used to create it.

Translating from local to global is necessary when using c10d collectives like broadcast
so if your library code adopts the convention of using local rank 0, it needs
to the following:

```python
import torch.distributed as dist

my_pg: dist.ProcessGroup = ...

def my_library_bcast(tensor)
    dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg)

```

This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134
Approved by: https://github.com/rohan-varma
2022-08-30 17:45:00 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Xiang Gao
cda210e23b UCC PG build in CI (#81583)
- Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system
- Install UCX and UCC in CUDA dockers
- Build PyTorch with `USE_UCC=1` in pipelines
- Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583
Approved by: https://github.com/vtlam, https://github.com/malfet
2022-08-10 00:23:47 +00:00
Aashaka Shah
24a084eda6 [c10d] Fix async error in batch_isend_irecv (#82450)
Summary:
`batch_isend_irecv` previously required the use of `torch.cuda.synchronize` to avoid data race conditions. This was because the ncclStreams were recorderd in the returned ncclWork object _before_ a ncclGroupEnd by the `_batch_p2p_manager` was issued. Thus, the `req.wait()` was effectively waiting on nothing, leading to the later operators working on incorrect intermediate data.

This fix:
- keeps track of ncclStreams to wait on, and records them in the work objects after the batch manager issues a ncclGroupEnd
- renames the `_batch_p2p_manager` to `_coalescing_manager` for generality
- removes the explicit check for NCCL backend inside `_batch_p2p_manager` in distributed_c10.py and moves the manager start/end to ProcessGroup.hpp, in order to transparently work with all process groups

Test Plan: Modified the unittest for `batch_isend_irecv` to check that received tensors are the same as expected tensors. Verified that the test fails before the change, and passes after the change.

Differential Revision: D38100789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82450
Approved by: https://github.com/kwen2501
2022-08-08 17:50:22 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
Terry Lam
54bdaf76d6 [PFC] Native UCC process group for Pytorch (#79918)
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.

Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.

Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:

$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded

Differential Revision: D36973688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
2022-07-12 14:45:44 +00:00
Nikita Shulga
09df27fe45 Revert "Revert "[distributed] Handle object collectives and NCCL. (#79034)""
This reverts commit 279634f384.
2022-06-15 10:04:37 -07:00
PyTorch MergeBot
279634f384 Revert "[distributed] Handle object collectives and NCCL. (#79034)"
This reverts commit 4ebb326b75.

Reverted https://github.com/pytorch/pytorch/pull/79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-06-15 16:16:21 +00:00
Rodrigo Kumpera
4ebb326b75 [distributed] Handle object collectives and NCCL. (#79034)
This fixes all object collectives under NCCL and adds some automated tests for them.

This PR *does not* fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79034
Approved by: https://github.com/rohan-varma
2022-06-13 19:23:39 +00:00
Michael Suo
fb0f285638 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 20:51:34 +00:00
PyTorch MergeBot
3d7428d9ac Revert "[lint] upgrade mypy to latest version"
This reverts commit 9bf18aab94.

Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo
2022-05-03 20:01:18 +00:00
Michael Suo
9bf18aab94 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 19:43:28 +00:00
Ke Wen
d6f22abbcc [PyTorch Distributed] Fix batch_isend_irecv
Summary:
`batch_isend_irecv` previously only worked for two-rank cases,
otherwise it would hang, e.g. pytorch/pytorch#73960. This Diff extends
`batch_isend_irecv` to support more than two ranks. The fix is by treating
the operation more like a collective rather than two-rank P2P when selecting
the communicator, since there can be more ranks participating in the batch call than "my" rank and "my" peer.

Rules:
- If `batch_isend_irecv` is the first collective call (including collectives and
  all-to-all) in the `group` given as the argument, then all ranks of the
  `group` are expected to participate in this call.
- Otherwise, if it is not the first collective call in the `group` (i.e. the
  communicator has been initialized), then batched P2P communication involving
  only subset of processes of the `group` is allowed.

Test Plan:
Added p2p_tests.py testing the following patterns:
+    sendrecv_neighbor(input, output)       # Ring like neighbor exchange
+    sendrecv_ripple(input, output)         # Exchange with growing distance (pytorch/pytorch#73960)
+    sendrecv_P2P(input, output)            # Single P2P operation
+    isendrecv_P2P(input, output)           # Single non-blocking P2P operation
+    isendrecv_P2P_batch(input, output, 0)  # batched P2P between only two ranks
+    isendrecv_P2P_batch(input, output, 1)  # batched P2P within a new group created for two ranks

Differential Revision: D35122664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74701
Approved by: https://github.com/mingzhe09088, https://github.com/osalpekar
2022-04-13 05:55:00 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Rohan Varma
678c08bb55 [PG Wrapper] Small fix (#72657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72657

_ProcessGroupWrapper check needs to be gated on Gloo availability,
this fails when gloo is not avail_ProcessGroupWrapper check needs to be gated
on Gloo availability, this fails when gloo is not avail.
ghstack-source-id: 148837056

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34144848

fbshipit-source-id: 42a04918b968247f3259cd2cde5438e1265b04fe
(cherry picked from commit ba5de98939)
2022-02-11 15:59:13 +00:00
Wanchao Liang
8551989bff [c10d] Enable gather_object on nccl (#71623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71623

Enable gather_object on the nccl backend, since we already support `dist.gather` on nccl. This requires user to set the current device properly.
ghstack-source-id: 147754836

Test Plan: distributed_nccl_spawn -r test_gather_object

Reviewed By: zou3519

Differential Revision: D33701042

fbshipit-source-id: 39cff22947a7cac69d0c923b956dc10f25353a6f
(cherry picked from commit 6e6eff497f)
2022-01-27 14:59:55 -08:00
Shen Li
7bc220e060 Update distributed.rst for ProcessGroup Extensions (#71482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71482

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D33745986

Pulled By: mrshenli

fbshipit-source-id: fe2d0491901bf00be09deb5c556bc1e1d359b725
(cherry picked from commit be5104bfd7)
2022-01-25 00:30:08 +00:00
Stephan Uphoff
e1e43c4e71 Prevent sum overflow in broadcast_object_list (#70605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70605

broadcast_object_list casted the sum of all object lengths to int from long causing overflows.

Test Plan:
Add a Tensor  with >2GB storage requirement (in distributed_test.py) to object broadcast.

This Tensor is only added if test are running at Meta as github tests will oom.

Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.

Test used on server with GPUs:

buck test  mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object
buck test  mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --local -- broadcast_object

Reviewed By: r-barnes

Differential Revision: D33405741

fbshipit-source-id: 972165f8297b3f5d475636e6127ed4a49adacab1
2022-01-05 09:07:39 -08:00
Michael Suo
b7b32b56f1 Revert D33281300: Prevent sum overflow in broadcast_object_list
Test Plan: revert-hammer

Differential Revision:
D33281300 (807f9a828c)

Original commit changeset: 1bc83e8624ed

Original Phabricator Diff: D33281300 (807f9a828c)

fbshipit-source-id: beb81a9cbfba405a61b11dfaa8e39c9601f45643
2021-12-27 19:01:53 -08:00
Stephan Uphoff
807f9a828c Prevent sum overflow in broadcast_object_list (#70336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70336

broadcast_object_list casted the sum of all object lengths to int from long causing overflows.

Test Plan:
Increased size of Tensor used in object transfers to have  >2GB storage requirement (in distributed_test.py)

Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.

Test used on server with GPUs:

buck test  mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object

Differential Revision: D33281300

fbshipit-source-id: 1bc83e8624edc14e747eeced7bc8a7a10e443ee4
2021-12-27 16:17:53 -08:00
s-kumano
ff53ed24d2 fix NameError of docstring in broadcast_object_list (#69810)
Summary:
This PR fixes NameError of docstring in broadcast_object_list.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69810

Reviewed By: kimishpatel

Differential Revision: D33143167

Pulled By: jbschlosser

fbshipit-source-id: 99c076466ae4b4a332763b7546028c5097b417d7
2021-12-16 10:50:45 -08:00
Bryan Reese
4670f0f2c5 Set non-default backend names to lower case (#69400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69400

Hopefully this makes naming more consistent. Without this change, some tests will fail for plugins since values can be set to upper case in some cases. This should prevent that and make lookup and comparison consistent.

Test Plan: Check the signals. There is no specific test for this, but all tests should pass.

Reviewed By: mrshenli

Differential Revision: D32836529

fbshipit-source-id: 1b7d2b64e04fe0391b710aa6ed6d1e47df9027a3
2021-12-07 07:58:46 -08:00
Rohan Varma
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00
Shen Li
18955d3564 Raise warning when calling collectives on non-member group objects (#67639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639

Due to BC considerations, we cannot directly error out, as that
might break existing applications. Raise warnings first to improve
debuggability.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32075151

Pulled By: mrshenli

fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363
2021-11-02 20:04:07 -07:00
Shen Li
ce6f4b3a02 Setup c10d extension Backend class attr the same way as builtin ones (#66991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66991

Currently, c10d extensions uses Backend.NAME to store the creator
function. However, builtin ones use that same field to store the
name. This commit makes c10d extensions comply with builtin ones,
and uses a dedicated `_plugins` field to store creator functions.

Thanks bryanmr for pointing this out.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31820307

Pulled By: mrshenli

fbshipit-source-id: 259769ebfc80c0c9fc44d25498c8d19a3a09d1bc
2021-10-21 12:35:07 -07:00
Yi Wang
12137db5e3 Fix the slowdown of _object_to_tensor since 1.9 (#65721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65721

#Closes: https://github.com/pytorch/pytorch/issues/65696

The bug is introduced in https://github.com/pytorch/pytorch/pull/55861, and it causes 100X slowdown since 1.9.
ghstack-source-id: 139128267

Test Plan:
Performance test:
```
import time

from torch.distributed.distributed_c10d import _object_to_tensor

start = time.time()
_object_to_tensor("x" * 50_000_000)
print("Time:", time.time() - start)
```

Reviewed By: rohan-varma

Differential Revision: D31219794

fbshipit-source-id: 1abec38f9d51361c1eab6ad5efd87b589322e208
2021-09-27 19:22:10 -07:00
Shen Li
2a81e8b8f1 Let all_reduce_coalesced and all_gather_coalesced return Future objects (#64722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64722

`all_reduce_coalesced` and `all_gather_coalesced` are never publicly
released in our API docs. So, I would assume the blast radius to be small.

The motivation for this change to allow implementing
`all_reduce_coalesced` and `all_gather_coalesced` by re-using `allreduce`
and `allgather` C++ cores and perform flatten and copy only on the Python
side. With that, we can then remove `all_reduce_coalesced` and
`all_gather_coalesced` from C++ ProcessGroup APIs. For the async mode,
the copy-back logic after the communication will need to be chained
as a callback on the returned Future and use the chained child Future
as the return value (otherwise, we will need to wrap the child Future
into another work handle). This PR tries to test if we can directly
return a Future without breaking tests and internal use cases. If yes,
it will make the consolidation a lot easier.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D30830994

Pulled By: mrshenli

fbshipit-source-id: dcde0ed9245e9e8fee357b3588b07d540a4b6318
2021-09-10 07:45:25 -07:00
mrshenli
101a626330 Improve distributed.get_rank() API docstring (#63296)
Summary:
See discussion in https://pytorch.slack.com/archives/CBHSWPNM7/p1628792389008600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63296

Reviewed By: cbalioglu

Differential Revision: D30332042

Pulled By: mrshenli

fbshipit-source-id: 3a642fda2e106fd35b67709ed2adb60e408854c2
2021-08-27 11:34:55 -07:00
Kiuk Chung
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
Gao, Xiang
2d103025a5 Adding warning on isend about modifying after send (#61875)
Summary:
This is a standard limitation on communication collective libraries. For example:

https://www.open-mpi.org/doc/v4.0/man3/MPI_Isend.3.php
```
A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes.
```

http://openucx.github.io/ucx/api/latest/html/group___u_c_p___c_o_m_m.html#ga8323878b60f426c630d4ff8996ede3cc
```
The user should not modify any part of the buffer after this operation is called, until the operation completes.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61875

Reviewed By: suo

Differential Revision: D29783720

Pulled By: mrshenli

fbshipit-source-id: 78fd047c74449f77b906f3766a6c2bc29499847d
2021-07-29 07:37:18 -07:00
Marjan Fariborz
994434ad16 Adding complex number support for all_to_all/scatter (#61299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61299

Modifying all_to_all and scatter to support complex numbers as well as float numbers.

Test Plan: buck run //caffe2/test/distributed:distributed_gloo_fork -- test_name --print-passing-details --run-disabled

Reviewed By: wanchaol

Differential Revision: D29563938

fbshipit-source-id: 59e436b3fa1aee3d5195cbcffd39587e642c76b9
2021-07-20 15:45:34 -07:00
Yu Guo
a50a389ca6 Revert D29701479: [pytorch][PR] Remove _broadcast_object() from ZeroRedundancyOptimizer
Test Plan: revert-hammer

Differential Revision:
D29701479 (9b5d9b4049)

Original commit changeset: c8d5f9057b32

fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994
2021-07-15 10:03:08 -07:00
Andrew Gu
9b5d9b4049 Remove _broadcast_object() from ZeroRedundancyOptimizer (#61539)
Summary:
Revised version of https://github.com/pytorch/pytorch/issues/60573.

**Overview:**
This makes two changes:
- It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively.
- It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`.

The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device.

In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU.

Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows:
`source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU`
Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes:
`destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539

Test Plan:
I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct.

The existing `ZeroRedundancyOptimizer` tests pass.
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```

The existing `broadcast_object_list()` test passes:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list
```

Reviewed By: zou3519

Differential Revision: D29701479

Pulled By: andwgu

fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9
2021-07-14 17:36:30 -07:00
Bo Wang
ab27399566 Make broadcast_object_list accept a device parameter. (#61305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61305

Part I (this PR): Add dist_device argument to broadcast_object_list API
Part II: andwgu@ will deprecate _broadcast_object with the newly introduced API
	 Also include the changes to _object_to_tensor()/_tensor_to_object() with PR 60573

Context: https://github.com/pytorch/pytorch/issues/60062

Test Plan:
Run the following on DevGpus with two cuda devices

$python setup.py develop    --- run this build on DevGPU
$BACKEND='nccl' WORLD_SIZE=2 with-proxy  python test/distributed/test_distributed_fork.py  TestDistBackendWithFork.test_broadcast_object_list --v
$BACKEND='gloo' WORLD_SIZE=2 with-proxy  python test/distributed/test_distributed_fork.py  TestDistBackendWithFork.test_broadcast_object_list --v

Build with distributed on: USE_DISTRIBUTE=1 python setup.py develop
Test on CPU devvm:

$ with-proxy python test/distributed/optim/test_zero_redundancy_optimizer.py

Imported from OSS

Differential Revision:
D29566538
D29566538

Reviewed By: iramazanli, mrshenli

Pulled By: bowangbj

fbshipit-source-id: 0bea52442551c5194acba85eadda16ba2ec4b6ef
2021-07-14 11:43:17 -07:00
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
Ruilin Chen
38c3116813 [hierarchical sharding 5/n] enable table-wise -> col-wise sharding in embedding table lookup
Summary:
This diff add table-wise -> col-wise sharding support in GroupedShardedEmbeddingBag. Changes includes:
1. Add necessary member variables set up.
2. Create new fast kernel and add fast kernel lookup support
3. Add intra-host all2all and cross-host all2all logic.

Test Plan:
UT
```
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_spawn
```
```
buck test caffe2/torch/fb/hpc/tests:model_sharder_test
```
QPS check:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 10000 --num-dpp-worker-threads 16 --num-readers 100 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "["table_based", "column_based"]" --flow-entitlement ads_global_qps
```
with diff:
dec inline_cvr:
table-wise -> table-wise (82K):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_d0a0cba5?version=0&tab=status&env=PRODUCTION

table-wise -> column-wise (80k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_b1ac5873

column-wise:
dec inline_cvr:
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623827677%2F127.0.0.1%2Flibkineto_activities_4550.json.gz&bucket=gpu_traces

https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_a79e1522 (81k)

https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_2dacc13e (88k)

row-wise(62k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_4e349cab

table-wise(90k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_5d51b608

10x ctr_mbl_feed:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 128 --use-shrunk-model false --model-version=ctr_mbl_oct_2020_10x_3tb --num-dpp-worker-threads 16 --num-readers 200 --fast-kernel table_batched --max-batches 5000000 --hpc-identity ads_model_platform --table-partition column_based --flow-entitlement ads_global_tc_mimo
```
column-wise:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_f05fb306?version=0&tab=status&env=PRODUCTION (290k)

w/o diff:
dec inline_cvr:
column-wise (87K):
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623864444%2F127.0.0.1%2Flibkineto_activities_4451.json.gz&bucket=gpu_traces
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_e1315f14

row-wise (60k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_8fcc0adf

table-wise (91k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_cb94ff41

10x ctr_mbl_feed:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_203ef35b?version=0&tab=status&env=PRODUCTION (281k)

NE check(use deterministic reading D28711400)
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 100000 --num-dpp-worker-threads 16 --num-readers 64 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "[table_based, column_based]" --flow-entitlement ads_global_qps --use-deterministic-model --use-deterministic-reading --model-entity-id 995557193
```
w/o this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|total_examples 1867776.0
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|window_qps 491.5199890136719
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION

w this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|total_examples 1867776.0
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION

Reviewed By: JadeNie

Differential Revision: D28689126

fbshipit-source-id: 1c7879d4e3ee2b90aaf2a89e87f7b827d54173b3
2021-06-17 22:25:25 -07:00
clint
78011bc0ce typofix (torch.zero to torch.zeros) in docstring (#59703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59703

Reviewed By: ezyang

Differential Revision: D29145998

Pulled By: H-Huang

fbshipit-source-id: f2670502170aa100fb02408046b7f6850f9379cf
2021-06-15 21:12:42 -07:00
Yi Wang
48ea7c808d [C10d] Support subgroups (#59111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111

Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.

Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.

Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.

#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed

Reviewed By: rohan-varma

Differential Revision: D28495672

fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
2021-06-09 22:35:11 -07:00
Can Balioglu
4ee761c2c5 [2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329

This PR is part of a stack that addresses the GitHub issue #41614; it introduces:

- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.

- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.

Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: rohan-varma

Differential Revision: D28424978

fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
2021-06-05 07:50:04 -07:00
Luca Wehrstedt
8f4cfaa9db Fix race condition in TP agent (#58753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753

TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.

One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.

Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D28603754

fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
2021-06-04 06:53:42 -07:00
Liang Luo
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
Rohan Varma
19bcbfc5cf [c10d] Use pg wrapper in detailed debug mode (#58281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`.

As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs.

Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled.

Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff.
ghstack-source-id: 129817857

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28402301

fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede
2021-05-25 09:55:52 -07:00
Rohan Varma
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
Rohan Varma
071d49a970 Document monitored barrier (#58322)
Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322

Reviewed By: SciPioneer

Differential Revision: D28595405

Pulled By: rohan-varma

fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
2021-05-21 19:04:57 -07:00
Yi Wang
314a578154 Clang format distributed_c10d.py (#58435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58435

Prepare for #53962

ghstack-source-id: 129171617

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D28490326

fbshipit-source-id: 2ed3c5850788b9702a8020f6ee6d0b579625bf89
2021-05-17 16:47:35 -07:00
Rohan Varma
e90fcffb65 [c10d] Log when store based barrier succeeds (#57711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57711

Seeing some hangs/issues around store based barrier internally, would
be good to have this log to indicate whether store based barrier has completed
successfully or not for a particular rank to debug further.
ghstack-source-id: 128605600

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28249087

fbshipit-source-id: 644e5780519017ae780c3bc78bbe5def322db3f8
2021-05-10 21:09:40 -07:00
Liang Luo
c37095760d [torch distributed] Implementing all_gather_base (#56315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315

This diff implements the all_gather_base in pytorch distributed.

Test Plan: dist.all_gather_base(output, input)...

Reviewed By: agolynski, amylittleyang

Differential Revision: D27488999

fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936
2021-04-23 14:16:47 -07:00
Wanchao Liang
a970e525fd make ProcessGroup.Options.timeout argument private in python (#56531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531

per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.

This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.

cc pritamdamania87 rohan-varma

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27893395

Pulled By: wanchaol

fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
2021-04-21 17:55:10 -07:00
Rohan Varma
b7d5a0cf10 [c10d] sequence number in process group (#55319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319

Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.

The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.

This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27562769

fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
2021-04-21 10:59:24 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Rohan Varma
ce05b7a324 [c10d] Remove deprecated use of torch.LongTensor, torch.ByteTensor (#55861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55861

APIs such as torch.LongTensor and torch.ByteTensor are deprecated and
the recommended API is torch.tensor(args, dtype=...). Use this API in
distributed_c10d.
ghstack-source-id: 126777875

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D27726600

fbshipit-source-id: 07eb8168d93697593589002c93c3903ce29431ef
2021-04-18 14:12:02 -07:00
Rohan Varma
bbc4c775bb [reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.

Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758424

fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
2021-04-14 12:26:54 -07:00
Rohan Varma
48c73d24b8 Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do
Test Plan: revert-hammer

Differential Revision:
D27523060 (a5290adea5)

Original commit changeset: fa05e4f8ad8a

fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a
2021-04-13 21:33:09 -07:00
Rohan Varma
a5290adea5 [c10d] monitored_barrier: ensure all ranks pass or none do (#55197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197

From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.

In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.

This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:

Nonzero ranks call send()
Nonzero ranks call recv()

Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them

Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27523060

fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
2021-04-13 19:01:25 -07:00
Rohan Varma
19f15317a0 [BE][Docs] Improve dist.new_group doc (#55660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55660

Noticed this doc was missing clarification on nccl env vars that
init_process_group docs have. Also, specify default behavior when backend=None
is passed in.
ghstack-source-id: 126251116

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27672208

fbshipit-source-id: 2e79d297174e135173bceb059450ea267367bde4
2021-04-11 16:16:18 -07:00
Szymon Migacz
8e78a1b084 [Resubmit] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#52757)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51739
Fixes https://github.com/pytorch/pytorch/issues/51428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52757

Reviewed By: cbalioglu

Differential Revision: D26646843

fbshipit-source-id: df4962ef86ea465307e39878860b9fbbcc958d52
2021-04-06 11:32:26 -07:00
Rohan Varma
19a0eb4cdb [c10d] Monitored barrier: option to collect all failed ranks (#55010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010

Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.

This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27447787

fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
2021-04-04 21:39:54 -07:00
Rohan Varma
d185719455 Expose dist.monitored_barrier() API (#53787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53787

Per title, exposes a python-based monitored barrier API that we can use as part of debugability and may be useful for user applications.
ghstack-source-id: 125124315

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26965127

fbshipit-source-id: 6c7826e63758462e3e5111f28cced54cba76a758
2021-03-29 14:15:37 -07:00
Jeff Yang
0435059ddf docs: fix docstring signature in all_reduce_multigpu (#54665)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54665

Reviewed By: ezyang

Differential Revision: D27340481

Pulled By: rohan-varma

fbshipit-source-id: d53c36b41dd26c7a791d3674a5b4b67daaadae13
2021-03-26 11:08:32 -07:00
Wanchao Liang
133000fe7a [distributed] add processgroup options as argument (#53663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53663

This add the processgroup option as an optional argument to new_group
and init_processgroup, this allows user to pass in a initialized
processgroup option for gloo and nccl.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26968857

Pulled By: wanchaol

fbshipit-source-id: 2ff73a009120b85e83ecde7c69956b731902abc2
2021-03-18 01:04:17 -07:00
Michael Suo
87b6702833 [distributed] make the pickler in distributed_c10d pluggable (#53060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53060

As title. We would like to use alternative pickler/unpickler
implementations, to make it possible to send objects over the wire that
are coming from a torch.package

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D26737317

Pulled By: suo

fbshipit-source-id: 6bdef9824e48ef657dcad72cc5a9114e6612ea4a
2021-03-01 21:37:48 -08:00
Howard Huang
b56f59ea20 Revert D26599390: [pytorch][PR] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py
Test Plan: revert-hammer

Differential Revision:
D26599390 (075bbe0d6a)

Original commit changeset: d822658076f7

fbshipit-source-id: 6c4421f4de99794ea66780175af549cef9410a20
2021-02-24 05:38:34 -08:00
Szymon Migacz
075bbe0d6a Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#51739)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51739

Reviewed By: bdhirsh

Differential Revision: D26599390

fbshipit-source-id: d822658076f7b08ebfde3dc9994159539490fda0
2021-02-23 22:30:37 -08:00
Rohan Varma
c255628134 [Collective APIs] Make python object collective API args consistent (#50625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625

Make API signatures consistent and provide default argument similar to
the tensor collectives.
ghstack-source-id: 120718121

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D25932012

fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20
2021-01-30 19:47:16 -08:00
Pritam Damania
16e5af41da Fix store based barrier to only use 'add'. (#49930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49930

Certain store implementations don't work well when we use get() and
add() on the same key. To avoid this issue, we only use add() in the store
based barrier. The buggy store implementations can't be properly fixed due to
legacy reasons.

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D25725386

fbshipit-source-id: 1535e2629914de7f78847b730f8764f92cde67e7
2021-01-05 12:46:24 -08:00
Jagadish Krishnamoorthy
c115957df0 [distributed] Provide parameter to pass GPU ID in barrier function (#49069)
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.

Add test cases to validate barrier device_ids.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes https://github.com/pytorch/pytorch/issues/48110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069

Reviewed By: mrshenli

Differential Revision: D25658528

Pulled By: rohan-varma

fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
2021-01-05 11:27:54 -08:00