Commit Graph

393 Commits

Author SHA1 Message Date
Edward Z. Yang
3e76a0e9c2 Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-01-25 20:43:18 +00:00
Shengbao Zheng
9c1348feb3 [pytorch][kineto] log process group config in distributed info (#117774)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Test Plan: Tested in HPC

Differential Revision: D52882292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774
Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi
2024-01-25 00:08:10 +00:00
Ke Wen
1e185c7803 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-24 18:42:14 +00:00
PyTorch MergeBot
b5799d9977 Revert "[c10d] Barrier uses stream sync instead of device sync (#117804)"
This reverts commit 0f6bbb1c07.

Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788.  Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))
2024-01-22 21:54:03 +00:00
Ke Wen
0f6bbb1c07 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-22 20:14:51 +00:00
Ke Wen
6d96beb6be [c10d] Remove health check (#117699)
https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called).

If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699
Approved by: https://github.com/wconstab
2024-01-18 02:14:49 +00:00
FFFrog
7b0926cc3e Fix wrong class inheritance in pyi (#116404)
As the title stated.

f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404
Approved by: https://github.com/ezyang, https://github.com/wconstab
2024-01-12 21:25:29 +00:00
Chip Turner
9693b3740b [easy] [c10d] Add documentation for the device_id parameter for init_process_group (#116222)
Follow-up to add missing docs for #114916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116222
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-01-03 19:32:18 +00:00
Aaron Gokaslan
bbe3261dd3 [BE]: Use iterable.chain.from_iterable where possible (#116376)
This is more readable and more efficient when dealing with lots of sequences to chain together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376
Approved by: https://github.com/albanD
2023-12-27 19:20:07 +00:00
fduwjj
f6dfbffb3b [c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238)
For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL.  This is a debugging feature so that we can rule out the bug from c10d level.

<img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-12-25 22:25:38 +00:00
Lucas Pasqualin
8452f41305 Adds allreduce to inductor remap (#115950)
Fixes #115728

Implements a rewrite path for allreduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950
Approved by: https://github.com/wconstab
2023-12-18 22:00:22 +00:00
Chien-Chin Huang
8c57fde21f Let all_reduce_coalesced accept one tensor as well (#115650)
This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function.

This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying  about converting the input to a list.

Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650
Approved by: https://github.com/wconstab
ghstack dependencies: #115523, #115302, #115648, #115649
2023-12-13 21:32:01 +00:00
Pavan Balaji
afa62d6237 [nccl-pg] Pass group global rank information to NCCL PG (#114736)
We were only passing a subset of the group creation information to the
NCCL PG.  We are specifically missing the information on which global
ranks belong to a particular PG.

This allows the NCCL PG to use this additional information for things
like better trace logging.

Test Plan:

OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736
Approved by: https://github.com/kwen2501
2023-12-13 18:02:51 +00:00
fduwjj
40ce9a4cfb [c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2023-12-12 20:52:43 +00:00
Howard Huang
99f06c0cc2 [BE] update errors to be more descriptive (#115443)
we call `_check_single_tensor` and `_check_tensor_list` as validation but don't print out the param types that were invalid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115443
Approved by: https://github.com/XilunWu
2023-12-11 21:21:10 +00:00
Chip Turner
937d616e82 Re-enable type checking for distributed_c10d.py (#115223)
Re-enable type checking for distributed_c10d.py

Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.

Note: the backwards compatibility linter does not like some of these changes.  But they were incorrect before.  This needs human verification, however.

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
2023-12-09 11:07:54 +00:00
Chip Turner
78b945484b [c10d] Extend NCCL communicator splitting to more use cases (#114916)
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.

This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.

This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number).  And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
2023-12-07 15:13:01 +00:00
Chip Turner
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
Chip Turner
066e072524 Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385)
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`

Fixes cause of revert of original PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
2023-11-23 07:00:00 +00:00
PyTorch MergeBot
b927a4e2ca Revert "Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)"
This reverts commit 64a5372e6c.

Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk 4d07428ede ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))
2023-11-22 17:43:51 +00:00
Chip Turner
64a5372e6c Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
2023-11-21 21:03:52 +00:00
Ke Wen
dc65f6c601 [c10d] Remove deprecated multi-gpu-per-thread APIs (#114156)
As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document.  The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156
Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 03:50:23 +00:00
Shengbao Zheng
e53da90fe6 [Execution Trace] record global rank in pg_config_info (#113316)
Summary:
pg_config_info is used to dump pg information in Execution Trace(ET). For trace analysis purpose and PARAM replay benchmark, global rank is more meaningful than group ranks.

p.s. ranks is a map of global rank: group rank

Test Plan: Tested in HPC

Differential Revision: D51136587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113316
Approved by: https://github.com/XilunWu
2023-11-09 20:04:43 +00:00
Ke Wen
bb7ac12cbf [ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896)
Summary: Follows PR #111431, save memory for DTensor init

Test Plan: Sandcastle

Reviewed By: wanchaol

Differential Revision: D50985365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896
Approved by: https://github.com/wanchaol
2023-11-07 15:44:04 +00:00
Will Constable
ff51f94e32 [Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
2023-11-07 05:34:26 +00:00
PyTorch MergeBot
75adb9f371 Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)"
This reverts commit f9d47e1381.

Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor f9d47e1381 https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))
2023-11-06 22:49:53 +00:00
Will Constable
f9d47e1381 Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
2023-11-06 20:48:39 +00:00
Sahdev Zala
c6ecd018d5 Fix docstring errors (#112693)
This PR reduces docstring erros to 0 from total 128. This can be verified by running, pydocstyle path-to-distributed_c10d.py --count

Where, path-to-distributed_c10d.py is `torch/distributed/distributed_c10d.py`

BEFORE the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
128
AFTER the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
0

Fixes #112640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112693
Approved by: https://github.com/H-Huang
2023-11-06 18:45:05 +00:00
Will Constable
65b74c9254 Make init_process_group timeout kwarg override pg_options (#112611)
This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112611
Approved by: https://github.com/H-Huang
2023-11-03 23:13:03 +00:00
Aaron Gokaslan
cb856b08b2 [BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496)
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
2023-10-19 21:56:36 +00:00
Shengbao Zheng
8899abde32 [PyTorch][ET] Improve Process Groups Mapping Info Collection (#110908)
Summary:
Process Groups Mapping info collection was introduced in D46321690.

Improve the mapping info collected there:
- replace pg_id (a unique ID for the PG object) with pg_names (a unique name for each pg and shared by all ranks)
- add number of pg info with group_count
- reduce the length of pg_config_info to avoid being truncated(max length of 4096, now doubled ) by
  - migrating ranks(a map from global ranks to group ranks) with the list of global ranks of a pg, since we currently don't use group rank id
  - using an empty rank list to indicate that all ranks are involved in a pg and adding a field of group_size to show how many ranks are involved

Test Plan:
Tested in HPC
```
buck2 run mode/opt //hpc/torchrec/models/ads:cmf_10x_launcher -- launcher=local data_loader=random data_loader.num_batches=100 checkpoint=model_store max_ind_range=10 launcher.num_trainers=8
```
Example output in ET
```
{
"name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140688385794048, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140688386762752, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140682531798720, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_0\", \"backend_id\": 140672678002688, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140672678007616, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}, {\"pg_name\": \"faa29c0b1e06cd7abc873bd561414911_1\", \"backend_id\": 140672678012544, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7}, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Before the change, pg_config_info of >128 rank will be truncated, e.g.
```
"inputs": ["[{\"pg_id\": 140321146893696, \"backend_id\": 140321113854976, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321074662400, \"backend_id\": 140321100033024, \"backend_config\": \"cuda:nccl\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\": 114, \"115\": 115, \"116\": 116, \"117\": 117, \"118\": 118, \"119\": 119, \"120\": 120, \"121\": 121, \"122\": 122, \"123\": 123, \"124\": 124, \"125\": 125, \"126\": 126, \"127\": 127}}, {\"pg_id\": 140321154994304, \"backend_id\": 140319780290048, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": {\"0\": 0, \"1\": 1, \"2\": 2, \"3\": 3, \"4\": 4, \"5\": 5, \"6\": 6, \"7\": 7, \"8\": 8, \"9\": 9, \"10\": 10, \"11\": 11, \"12\": 12, \"13\": 13, \"14\": 14, \"15\": 15, \"16\": 16, \"17\": 17, \"18\": 18, \"19\": 19, \"20\": 20, \"21\": 21, \"22\": 22, \"23\": 23, \"24\": 24, \"25\": 25, \"26\": 26, \"27\": 27, \"28\": 28, \"29\": 29, \"30\": 30, \"31\": 31, \"32\": 32, \"33\": 33, \"34\": 34, \"35\": 35, \"36\": 36, \"37\": 37, \"38\": 38, \"39\": 39, \"40\": 40, \"41\": 41, \"42\": 42, \"43\": 43, \"44\": 44, \"45\": 45, \"46\": 46, \"47\": 47, \"48\": 48, \"49\": 49, \"50\": 50, \"51\": 51, \"52\": 52, \"53\": 53, \"54\": 54, \"55\": 55, \"56\": 56, \"57\": 57, \"58\": 58, \"59\": 59, \"60\": 60, \"61\": 61, \"62\": 62, \"63\": 63, \"64\": 64, \"65\": 65, \"66\": 66, \"67\": 67, \"68\": 68, \"69\": 69, \"70\": 70, \"71\": 71, \"72\": 72, \"73\": 73, \"74\": 74, \"75\": 75, \"76\": 76, \"77\": 77, \"78\": 78, \"79\": 79, \"80\": 80, \"81\": 81, \"82\": 82, \"83\": 83, \"84\": 84, \"85\": 85, \"86\": 86, \"87\": 87, \"88\": 88, \"89\": 89, \"90\": 90, \"91\": 91, \"92\": 92, \"93\": 93, \"94\": 94, \"95\": 95, \"96\": 96, \"97\": 97, \"98\": 98, \"99\": 99, \"100\": 100, \"101\": 101, \"102\": 102, \"103\": 103, \"104\": 104, \"105\": 105, \"106\": 106, \"107\": 107, \"108\": 108, \"109\": 109, \"110\": 110, \"111\": 111, \"112\": 112, \"113\": 113, \"114\""], "input_shapes": [[]], "input_types": ["String"],

```
After the change the length reduced
```
"inputs": ["[{\"pg_name\": \"0\", \"backend_id\": 140551405059072, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"1\", \"backend_id\": 140551399745536, \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"2\", \"backend_id\": 140578999821184, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_0\", \"backend_id\": 140559197777152, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"3\", \"backend_id\": 140549119076736, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}, {\"pg_name\": \"ea2f9024c70c8b9a25bc06a4723e5805_1\", \"backend_id\": 140571995143424, \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [], \"group_size\": 128, \"group_count\": 4}]"], "input_shapes": [[]], "input_types": ["String"],
```

Reviewed By: louisfeng, fduwjj

Differential Revision: D50048147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110908
Approved by: https://github.com/fduwjj
2023-10-19 21:37:19 +00:00
Ke Wen
18cc8a92ac [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)
For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.

To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`).

This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now.

Cc: @awgu @janeyx99 @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431
Approved by: https://github.com/H-Huang
2023-10-19 00:41:09 +00:00
PyTorch MergeBot
1e70f4d02c Revert "Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)"
This reverts commit bb1424d46e.

Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))
2023-10-16 23:03:26 +00:00
Will Constable
bb1424d46e Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)
This reverts commit 314a502eb0.

Changes since original PR:
Reland 1
 *  rename torch.distributed.hooks to torch.distributed._hooks

Reland 2
 * make _hooks importable even if !distributed.is_available()
 * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
2023-10-12 16:59:23 +00:00
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
Edward Z. Yang
de3ae93e9b Include rank of default PG in C++ log messages (#110623)
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.

The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.

This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 00:26:52 +00:00
Kazuaki Ishizaki
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
Howard Huang
0949d97c16 fix batch_isend_irecv example incorrect usage (#110408)
mismatched dtypes silently leads to wrong outputs in nccl

```
1:recv_tensor=tensor([0., 0.], device='cuda:1')
0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408
Approved by: https://github.com/awgu, https://github.com/Neilblaze
2023-10-04 22:57:03 +00:00
Rohan Varma
40be6b72e1 [ez] Type function in distributed_c10d (#110435)
This function returns a `torch.device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435
Approved by: https://github.com/awgu
2023-10-03 17:54:04 +00:00
Rodrigo Kumpera
c26270c733 [C10D] Even more store scalability work. (#109218)
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.

Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.

Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
2023-09-22 21:27:09 +00:00
Howard Huang
600d0d0284 Add "cuda" to MPI backend capabilities (#109614)
Summary: Fixes https://github.com/pytorch/pytorch/issues/109543

Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests.

Differential Revision: D49420438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614
Approved by: https://github.com/XilunWu
2023-09-21 13:34:58 +00:00
Rodrigo Kumpera
881bfbf21d [c10d] Add tests for usig libuv through init_process_group. (#108661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-09-20 16:02:20 +00:00
Rodrigo Kumpera
2bca5f2af7 [C10D] Track pg name in c++. (#108813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813
Approved by: https://github.com/wconstab
2023-09-15 01:10:29 +00:00
Brian Vaughan
bb14805bcd fix an incorrect indent in documentation (#108273)
doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273
Approved by: https://github.com/awgu, https://github.com/kit1980
2023-09-11 21:27:52 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
wz337
264df88a2d [C10D][Logger]Add more info to c10d logger (#107331)
This PR adds pg_name and world_size to c10d logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331
Approved by: https://github.com/kumpera
2023-08-28 15:10:56 +00:00
Codle
42738c56a0 Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509)
The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element.

Before (30G tensor):
<img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944">

After (46G tensor):
<img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5">

Test Code:
```python
if __name__ == "__main__":
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())

    fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4)

    if dist.get_rank() == 0:
        state_dict = {"fake_tensor": fake_tensor}
    else:
        state_dict = {}
    object_list = [state_dict]
    dist.broadcast_object_list(object_list, src=0)
    print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys())
    dist.barrier()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509
Approved by: https://github.com/awgu
2023-08-23 17:19:10 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
Rodrigo Kumpera
bbf03561a9 [functional collectives] Move back to registering finalizers on wrappers. (#107250)
We cannot use inner tensors for finalizers as they are uncollective until waited.

This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab
2023-08-17 21:08:28 +00:00
Shen Li
45128ab67c [Reland] Add OnCompletion Hook to ProcessGroup (#106988) (#107233)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233
Approved by: https://github.com/kumpera
2023-08-15 17:35:14 +00:00
PyTorch MergeBot
fd214aa8be Revert "Add OnCompletion Hook to ProcessGroup (#106988)"
This reverts commit ba1da47e8f.

Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error.  The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))
2023-08-15 08:24:33 +00:00
Shen Li
ba1da47e8f Add OnCompletion Hook to ProcessGroup (#106988)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988
Approved by: https://github.com/kumpera, https://github.com/H-Huang
ghstack dependencies: #107140, #107141, #107160
2023-08-15 04:32:23 +00:00
Bruce Jiang
2624da638d Support third-party devices to use the init_process_group method with… (#107113)
…out specifying the Backend

When init_process_group is not been done before, it will automatically apply  init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113
Approved by: https://github.com/wanchaol
2023-08-15 03:46:07 +00:00
Jirka
858b465d74 fix str splits in single line (#106005)
Simple formating improvement and two spell fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005
Approved by: https://github.com/H-Huang
2023-08-14 23:07:38 +00:00
Michael Voznesensky
42660015b4 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886
Approved by: https://github.com/awgu, https://github.com/wconstab
ghstack dependencies: #106884
2023-08-11 22:35:50 +00:00
Louis Feng
3a01c056f5 [PyTorch][ET] Collect Process Groups Mapping Info (#104373)
Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info).

Test Plan:
Testing in HPC:
```
TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par  +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000
```
Example output in ET:
```
    {
      "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Differential Revision: D46321690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373
Approved by: https://github.com/kwen2501
2023-07-25 03:34:53 +00:00
Howard Huang
0ab74044c2 [BE] remove deprecated attributes from distributed_c10d (#105753)
Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now.

Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753
Approved by: https://github.com/rohan-varma
2023-07-24 16:35:08 +00:00
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Ke Wen
22e8a61d9b Implement coalesced reduce_scatter_tensor (#103561)
Map of #101157.

This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax:

Sync communication style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])
```

Async communication style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the reduce-scatters' results
```
Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561
Approved by: https://github.com/fegin
2023-06-15 20:11:12 +00:00
zhuhong61
50c972bfd2 [c10d] Add xpu to the default device supported by user specified backend (#103410)
**Motivation:**
For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping.

**Solution:**
We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly.
Usage:
When using xpu device, user can specify backend name only:
`dist.init_process_group(backend='ccl')`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-06-12 19:46:33 +00:00
Ke Wen
07104ca99c [c10d] Make it default that PG do not perform barrier after init (#103033)
Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init.

In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033
Approved by: https://github.com/XilunWu
2023-06-07 06:11:14 +00:00
Ashwin Hari
cf0aa38005 Allow ORT backend for DTensor (#101914)
fixes #101911

Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend.

* `Backend.NAME`  attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl.
* remove unused `_check_for_nccl_backend` function
* add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914
Approved by: https://github.com/wanchaol
2023-06-01 22:37:09 +00:00
shaoyf42
8d7e082300 [c10d] Add is_backend_available for c10d backend. (#101945)
Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``.

There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553
> For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101
to also add their own is_available property

It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function.  This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`.

> Or we could add an Is_available(backend) function, that checks for the backend.

Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945  that supports both built-in backends and third-party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945
Approved by: https://github.com/H-Huang
2023-05-31 22:51:51 +00:00
Wanchao Liang
3ef4d697df [c10d] default backend need to check for nccl availability (#102470)
As titled, we can only initialize nccl backend when NCCL is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2023-05-30 19:22:37 +00:00
Wanchao Liang
7b47cd0a6c [c10d] add fake pg necessary collectives (#102238)
This PR adds fake pg necessary collectives to enable e2e FSDP run
with out multiprocess or multithreading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Wanchao Liang
9a19262556 [c10d] conslidate barrier after init logic (#102237)
This PR consolidates the barrier after init logic to allow custom
backend to set the env var when creating the pg, so that
`init_process_group` would skip barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Edward Z. Yang
c903b12cb8 Add fake process group (#102180)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180
Approved by: https://github.com/wanchaol
2023-05-24 23:27:40 +00:00
Iris
ee95e37a69 [c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912)
1. Record time spent for init_process_group, new_group, _store_based_barrier
2. Rename c10d_error_logger to c10d_logger for generalization.
3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py.
4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912
Approved by: https://github.com/fduwjj
2023-05-24 09:36:34 +00:00
Aaron Gokaslan
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
shaoyf42
97180aca5e Enables barrier to support the specified device (#99589)
Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919

Today, there are two limitations of barrier:
One is that barrier does not support custom  #device:
fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)

The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device.
789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589
Approved by: https://github.com/kwen2501
2023-05-17 05:26:04 +00:00
Ke Wen
daed3bf8f9 Implement coalesced all_gather_into_tensor (#101157)
This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-05-11 20:58:47 +00:00
Ke Wen
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
Rodrigo Kumpera
a204f7f518 [c10d] Fix subprocess group handlig in scatter_object_list. (#100552)
scatter_object_list assumed src was a group rank while all collectives use global ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552
Approved by: https://github.com/fduwjj
2023-05-04 10:04:21 +00:00
Xiaodong Wang
c29ab84115 Fix bug in process_group_name when there is duplicate pgs (#100518)
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518
Approved by: https://github.com/kumpera
2023-05-04 02:12:28 +00:00
Animesh Jain
5fbb40669f [dynamo][moco] Disallow_in_graph distributed APIs (#100071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071
Approved by: https://github.com/jansel, https://github.com/H-Huang
2023-05-02 20:09:25 +00:00
Ke Wen
ae0eb2342d [Experimental] Remove store barrier after PG init (#99937)
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937
Approved by: https://github.com/kumpera, https://github.com/H-Huang
2023-04-27 17:23:10 +00:00
Rodrigo Kumpera
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Ke Wen
3a09aa5977 [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-24 21:27:26 +00:00
medivh-xp
39590d06c5 Make new_subgroups avaliable for non-cuda depend backend (#99706)
The `new_subgroups` allows for the easy creation of sub-communication groups, but it currently requires CUDA availability. For communications that do not rely on CUDA, such as the CPU-based gloo or custom communication backends, I still hope to be able to use it, such as with the CPU-based gloo (which is also the case when using a custom backend):
```python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def gloo_process(rank_id, world_size, group_size, mp_lock):
    assert not torch.cuda.is_available()
    def lock_print(*args, **kwargs):
        with mp_lock:
            print(*args, **kwargs, flush=True)

    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group('gloo', rank=rank_id, world_size=world_size)

    subgroup, _ = dist.new_subgroups(group_size)
    subgroup_ranks = list(range(subgroup.rank() * group_size, (subgroup.rank() + 1) * group_size))
    lock_print(f"Rank {rank_id} initialized in subgroup_{subgroup.rank()}: {subgroup_ranks}")

    tensor = torch.Tensor([rank_id + 1])
    subgroup.broadcast(tensor, root=0)

    lock_print(f"After broadcast, rank {rank_id} in subgroup_{subgroup.rank()}:{subgroup_ranks} got {tensor}")

if __name__ == "__main__":
    world_size = 4
    group_size = 2
    processes = []
    mp.set_start_method("spawn")
    mp_lock = mp.Lock()
    for rank in range(world_size):
        p = mp.Process(target=gloo_process, args=(rank, world_size, group_size, mp_lock))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
```

```bash
Rank 0 assigned to subgroup_0: [0, 1]
Rank 1 assigned to subgroup_1: [2, 3]
Rank 2 assigned to subgroup_0: [0, 1]
Rank 3 assigned to subgroup_1: [2, 3]
After broadcast, rank 2 in subgroup_0:[0, 1] got tensor([3.])
After broadcast, rank 3 in subgroup_1:[2, 3] got tensor([3.])
After broadcast, rank 1 in subgroup_1:[2, 3] got tensor([1.])
After broadcast, rank 0 in subgroup_0:[0, 1] got tensor([1.])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99706
Approved by: https://github.com/kumpera
2023-04-24 18:22:59 +00:00
PyTorch MergeBot
9861ec9785 Revert "[c10d] Faster coalescing (#98793)"
This reverts commit db456ab83d.

Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-21 09:15:04 +00:00
Ke Wen
db456ab83d [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-19 20:17:58 +00:00
Howard Huang
760967a284 Update _store_based_barrier implementation to reduce load on rank 0 (#98000)
Summary:

Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker
Added optional logging_interval arg to _store_based_barrier

Test Plan:
```
pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier
```

Reviewed By: rohan-varma

Differential Revision: D44430531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000
Approved by: https://github.com/kumpera
2023-04-11 14:25:29 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Howard Huang
61c74ab0f8 Fix MPI rank and world size pg initialization (#98545)
Fixes https://github.com/pytorch/pytorch/issues/97507

Test command
`pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545
Approved by: https://github.com/malfet
2023-04-07 21:57:31 +00:00
Rohan Varma
8a29afe98a [RFC] Add warning about object-based collectives for GPU tensors to docs. (#97702)
Using GPU tensors in these collectives have caused SEVs, user
confusion, and slowness in the past. These APIs were only designed to
communicate arbitrary python objects, and GPU tensors should either be copied
to CPU first or use the regular collecitves. Add a warning indicating so.

Differential Revision: [D44435849](https://our.internmc.facebook.com/intern/diff/D44435849/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97702
Approved by: https://github.com/kumpera
2023-04-06 23:47:35 +00:00
Howard Huang
3b6e94cb8c [small] replace with .format() with f-strings (#98514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98514
Approved by: https://github.com/awgu
2023-04-06 18:58:56 +00:00
Kazuaki Ishizaki
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
Edward Z. Yang
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Howard Huang
ac7329b323 Add exceptionhandler to more distributed_c10d APIs (#96770)
Summary: Adding exception handler to a few more APIs so that internal errors are logged to the c10d errors scuba table

Test Plan: sandcastle

Differential Revision: D44068557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96770
Approved by: https://github.com/wz337
2023-03-15 20:31:46 +00:00
Howard Huang
02fa2291f7 Add support for custom backend (#95072)
Fixes https://github.com/pytorch/pytorch/issues/92344

A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`.

Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072
Approved by: https://github.com/kwen2501
2023-03-02 21:41:49 +00:00
Howard Huang
c0fa0669f6 Update isend/irecv warning messages for nccl (#95236)
Summary: nccl backend does not support `tag` as mentioned in https://github.com/pytorch/pytorch/issues/94819. Adding a note in the documentation for it.

Example:

<img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png">

Differential Revision: D43475756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95236
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-02-22 22:00:13 +00:00
Rodrigo Kumpera
641cb4243c Fix c10d regression during cleanup. (#94988)
This fixes a regression introduced earlier today with a change to c10d global state.

It must be cleaned up in destroy_process_group or root PG and its Store will stay alive.

Fixes regression in test_c10d_nccl.py :: RendezvousEnvTest.test_common_errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94988
Approved by: https://github.com/H-Huang, https://github.com/wanchaol, https://github.com/malfet
2023-02-16 19:12:00 +00:00
Rodrigo Kumpera
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Howard Huang
8b3e3f937d Update documentation init_process_group optional backend (#94543)
Update documentation for `init_process_group()` to mention the `backend` argument is optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543
Approved by: https://github.com/kwen2501
2023-02-13 21:45:38 +00:00
Howard Huang
f45c196653 Update backend config to be under _World (#94191)
All the c10d process group state is under `_World`, so this is BE work to include a missing map
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191
Approved by: https://github.com/kumpera
2023-02-09 20:48:42 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Iris
f54fd6fb28 [c10d] Update get_backend() in exception_handler (#94063)
Currently, get_backend() and get_world_size() would always return the default value if no pg group argument is passed. This fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94063
Approved by: https://github.com/H-Huang
2023-02-04 19:39:36 +00:00
Ching-Hsiang Chu
1fa68d40b8 [pytorch] fix backend_type for backend/PG plugin (#93129)
Summary: For backend/PG plugin, use `ProcessGroup.BackendType.CUSTOM` to avoid uninitialized variable during `pg._register_backend` later

Test Plan: CI/CD and internal tests

Differential Revision: D42793222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93129
Approved by: https://github.com/H-Huang
2023-01-30 23:16:08 +00:00
Howard Huang
2503a4a7c6 Fix MPI backend PG initialization (#92847)
Fixes #92573

Add test to check that all default backends can be initialized to prevent the above from regressing in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847
Approved by: https://github.com/rohan-varma
2023-01-24 23:24:41 +00:00
Andrew Gu
cb67d9460b [PT-D] Fix send, recv return type (#92152)
- `send` returns `None`.
- `recv` returns the sender rank if valid or -1 otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92152
Approved by: https://github.com/wz337
2023-01-14 01:09:49 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Howard Huang
99aec69f58 [BE] remove Backend.TCP (#91314)
Remove Backend.TCP which is unused. Fixes a task in https://github.com/pytorch/pytorch/issues/90544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91314
Approved by: https://github.com/awgu
2022-12-23 15:48:29 +00:00
Sergii Dymchenko
365071c73c Fix non-existing parameters in docstrings in torch/distributed (#91116)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116
Approved by: https://github.com/huydhn
2022-12-22 02:37:31 +00:00
Sadra Barikbin
97f514f38e Fix two typos in torch.distributed.distributed_c10d.py::broadcast_object_list (#91237)
Fixes #91236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91237
Approved by: https://github.com/malfet, https://github.com/H-Huang
2022-12-21 19:45:08 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Rohan Varma
793a999ce0 Hybrid Sharded Data Parallel (#89915)
Adds 2 new hybrid sharding strategy to FSDP:
1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across
2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across

These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy.

Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately.

** Acknowledgements **
- @awgu 's excellent prototype: 5ad3a16d48
- @liangluofb For ideation, feedback, and initial implementation and experimentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915
Approved by: https://github.com/awgu
2022-12-08 16:18:03 +00:00
Howard Huang
ee907375fa [small] Update error message (#89294)
Summary:
`RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].`

to

`RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].`

Test Plan: sandcastle

Differential Revision: D41405238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294
Approved by: https://github.com/awgu
2022-11-19 00:21:14 +00:00
Masaki Kozuki
63e16216d8 [c10d] Implement __instancecheck__ for c10d::ReduceOp (#88275)
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- https://github.com/pybind/pybind11/issues/2696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
Approved by: https://github.com/wanchaol
2022-11-15 13:21:41 +00:00
Iris
68fd8f3706 [BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004)
This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py).
Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004
Approved by: https://github.com/H-Huang
2022-11-15 06:13:17 +00:00
Howard Huang
1d54ce9d5d [14/N] Refactor _new_process_group_helper() to remove repeated code (#88351)
Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351
Approved by: https://github.com/kwen2501
2022-11-10 19:27:17 +00:00
Kurt Mohler
ee28b865ee Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)
Part of #85302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
Approved by: https://github.com/ezyang
2022-11-08 18:11:01 +00:00
Rodrigo Kumpera
6663ae5537 [2/n] Thread PG: add class _World to distributed_c10d.py (#781) (#88471)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/781

Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.

I have relinked this diff to a new github PR, so that I can update it. The original PR is
> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348

Differential Revision: D40236769

Pulled By: yhcharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471
Approved by: https://github.com/gnadathur, https://github.com/rohan-varma
2022-11-07 17:56:40 +00:00
Kazuaki Ishizaki
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
Iris Zhang
0cf572ff6c [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643

1. Add a decorator function exception_handlers to  c10d collectives.
2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.

```
python3 test/distributed/test_c10d_error_logger.py
```

Test Plan: Test in OSS.

Reviewed By: H-Huang

Differential Revision: D40281632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
Approved by: https://github.com/H-Huang
2022-10-29 04:38:34 +00:00
Masaki Kozuki
aa8248cc9a Reenable isinstance with torch.distributed.ReduceOp (#87303)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: https://github.com/pytorch/pytorch/issues/87191

cc @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
Approved by: https://github.com/wanchaol
2022-10-21 15:05:36 +00:00
PyTorch MergeBot
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2b.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
Rodrigo Kumpera
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
Louis Feng
55479fe80e Enable capturing of comm collective parameters (#98) (#85368)
Summary:
X-link: https://github.com/facebookresearch/torch_ucc/pull/98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368
Approved by: https://github.com/H-Huang
2022-10-11 04:38:26 +00:00
Jesus Magana
c670bad72f Update dist.scatter() documentation (#86069)
Update documentation for dist. scatter

Fixes #84566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86069
Approved by: https://github.com/rohan-varma, https://github.com/H-Huang
2022-10-03 17:22:08 +00:00
Ke Wen
05d1128106 [c10d] Start deprecating *_multigpu APIs (#85961)
### Deprecation reasons:
- For most users training is on one GPU per process so these APIs are rarely used
- They added one more API dimension
- They can be expressed in a composed manner
- They are not abstracted – specific to GPU
- They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:59:39 +00:00
Ke Wen
463283e016 [c10d] Start deprecating *_coalesced APIs (#85959)
- We consider that general users need not to use the `*_coalesced` APIs unless there is an extreme concern about performance.

- We are investigating using a context manager named `coalescing_manager` which wrap around multiple individual collectives to compose the coalescing hint, rather than giving each collective a *_coalesced variant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85959
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:55:27 +00:00
Ke Wen
ade1c19612 Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867)
This is a twin PR similar to the one for `all_gather_into_tensor` (#85686).
The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686.

Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867
Approved by: https://github.com/crcrpar, https://github.com/H-Huang
2022-09-30 05:48:16 +00:00
Saliya Ekanayake
941d7a31f6 Pass group ranks and options to third party distributed backends (#73164)
Fixes #73163

PyTorch's [_new_process_group_helper()](9f541aa3ac/torch/distributed/distributed_c10d.py (L633)) does not pass group's participating ranks to the backend.

This PR adds the above capability. Also, refactors some variables for better clarity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73164
Approved by: https://github.com/kumpera
2022-09-29 17:28:58 +00:00
PyTorch MergeBot
6fae62b35f Revert "C10D extension to enable per-thread PG (#84153)"
This reverts commit 5cbffbbac9.

Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff
2022-09-29 13:51:05 +00:00
Ke Wen
775a22c7c6 Add all_gather_into_tensor in place of _all_gather_base (#85686)
### Description
- This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning.
- The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors.
- This PR also adds deprecation warning to `_all_gather_base`.

### Issue
`_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output.

There are, however, two "blockers" that make the merge difficult:
(i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list.
(ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information.

In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name.

### Testing
Added tests:
- `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example:
```
        >>> tensor_in
        tensor([1, 2], device='cuda:0') # Rank 0
        tensor([3, 4], device='cuda:1') # Rank 1
        >>> tensor_out
        tensor([1, 2, 3, 4], device='cuda:0') # Rank 0
        tensor([1, 2, 3, 4], device='cuda:1') # Rank 1
```
- `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example:
```
        >>> tensor_out2
        tensor([[1, 2],
                [3, 4]], device='cuda:0') # Rank 0
        tensor([[1, 2],
                [3, 4]], device='cuda:1') # Rank 1
```
The output form is determined by the shape of the output tensor passed by the user, no flag used.

Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686
Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
2022-09-27 22:50:22 +00:00
Rodrigo Kumpera
5cbffbbac9 C10D extension to enable per-thread PG (#84153)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
2022-09-27 21:42:31 +00:00
Rodrigo Kumpera
7dcc723d35 [c10d] Ensure collectives are called with the same dtype for all tensor params. (#84664)
While passing tensors with different dtypes don't crash, they don't produce sensible results.

We see data tearing instead of casting.

It's not clear we want to support transparent casting so, for now, we fail when such input is presented.

Fixes #84525

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84664
Approved by: https://github.com/rohan-varma
2022-09-15 22:32:51 +00:00
Salahuddin
6bd7d0f856 doc string fixed in torch.distributed.reduce_scatter (#84983)
Fixes #84865

Previous `torch.distributed.reduce_scatter`:

```
def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
    """
    Reduces, then scatters a list of tensors to all processes in a group.

    Args:
        output (Tensor): Output tensor.
        input_list (list[Tensor]): List of tensors to reduce and scatter.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.
```

Fixed:

```
def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
    """
    Reduces, then scatters a list of tensors to all processes in a group.

    Args:
        output (Tensor): Output tensor.
        input_list (list[Tensor]): List of tensors to reduce and scatter.
        op (optional): One of the values from
            ``torch.distributed.ReduceOp``
            enum.  Specifies an operation used for element-wise reductions
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84983
Approved by: https://github.com/H-Huang
2022-09-15 18:17:10 +00:00
Rodrigo Kumpera
38192f63cd Add __all__ for a few distributed modules plus a little typing (reland) (#84872)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

This relands #84119 with the required fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872
Approved by: https://github.com/rohan-varma
2022-09-13 21:57:49 +00:00
PyTorch MergeBot
219ff26172 Revert "Add __all__ for a few distributed modules plus a little typing (#84119)"
This reverts commit 6f21680563.

Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448
2022-09-09 20:01:07 +00:00
Rodrigo Kumpera
6f21680563 Add __all__ for a few distributed modules plus a little typing (#84119)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119
Approved by: https://github.com/rohan-varma
2022-09-08 23:28:31 +00:00
Rodrigo Kumpera
e96fb5d58c [c10d] Fix docstring of scatter_object_list (#84596)
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84596
Approved by: https://github.com/H-Huang
2022-09-07 14:49:45 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
Rodrigo Kumpera
7a348a1d4a Fix internal breakage caused by #82134 (#84363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84363
Approved by: https://github.com/rohan-varma, https://github.com/mehtanirav
2022-09-01 17:54:10 +00:00
Rodrigo Kumpera
65dc5dd3f3 [c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134)
Those functions enable membership introspection into a ProcessGroup. A common scenario
that needs this is library code that consumes a PG but doesn't create it, which means
it likely doesn't know the global ranks used to create it.

Translating from local to global is necessary when using c10d collectives like broadcast
so if your library code adopts the convention of using local rank 0, it needs
to the following:

```python
import torch.distributed as dist

my_pg: dist.ProcessGroup = ...

def my_library_bcast(tensor)
    dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg)

```

This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134
Approved by: https://github.com/rohan-varma
2022-08-30 17:45:00 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Xiang Gao
cda210e23b UCC PG build in CI (#81583)
- Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system
- Install UCX and UCC in CUDA dockers
- Build PyTorch with `USE_UCC=1` in pipelines
- Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583
Approved by: https://github.com/vtlam, https://github.com/malfet
2022-08-10 00:23:47 +00:00
Aashaka Shah
24a084eda6 [c10d] Fix async error in batch_isend_irecv (#82450)
Summary:
`batch_isend_irecv` previously required the use of `torch.cuda.synchronize` to avoid data race conditions. This was because the ncclStreams were recorderd in the returned ncclWork object _before_ a ncclGroupEnd by the `_batch_p2p_manager` was issued. Thus, the `req.wait()` was effectively waiting on nothing, leading to the later operators working on incorrect intermediate data.

This fix:
- keeps track of ncclStreams to wait on, and records them in the work objects after the batch manager issues a ncclGroupEnd
- renames the `_batch_p2p_manager` to `_coalescing_manager` for generality
- removes the explicit check for NCCL backend inside `_batch_p2p_manager` in distributed_c10.py and moves the manager start/end to ProcessGroup.hpp, in order to transparently work with all process groups

Test Plan: Modified the unittest for `batch_isend_irecv` to check that received tensors are the same as expected tensors. Verified that the test fails before the change, and passes after the change.

Differential Revision: D38100789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82450
Approved by: https://github.com/kwen2501
2022-08-08 17:50:22 +00:00