Commit Graph

332 Commits

Author SHA1 Message Date
Chien-Chin Huang
e72936c27c [PT2D] Fix the circular import issue (#125618)
As title

Differential Revision: [D57011394](https://our.internmc.facebook.com/intern/diff/D57011394/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125618
Approved by: https://github.com/wz337
2024-05-07 05:10:18 +00:00
Muralidhar Andoorveedu
b96b1e8cff [Distributed] Add P2P versions of *object_list operations (#124379)
This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream.

With this change, sending and receiving arbitrary picklable python objects is possible.

Relevant issue: https://github.com/pytorch/pytorch/issues/3473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-03 23:22:58 +00:00
Chien-Chin Huang
1eb7b8eb60 [PT2D] Ensure the trace rules are correct with distributed (#125333)
Summary:
1. Avoid using `torch._dynamo.disable`.
2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125333
Approved by: https://github.com/yanboliang
2024-05-02 16:28:38 +00:00
feifan
197612c84c ProcessGroupWrapper support custom backend (#124447)
Fixes #ISSUE_NUMBER
In current code, ProcessGroupWrapper works only for `GLOO, NCCL, UCC` when `TORCH_DISTRIBUTED_DEBUG=DETAIL`.
I read the ProcessGroupWrapper code,find that communication_op in ProcessGroupWrapper is just communication_op in origin_backend + runCollectiveChecks in gloo, like allreduce:
82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L406-L411)

`runCollectiveChecks` is used to `collective finger print` for tensors and run gloo's `monitoredBarrier`.
82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L586-L590)
I dont know why ProcessGroupWrapper doesn't work for all backend, but I think custom backend can support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124447
Approved by: https://github.com/kwen2501
2024-05-01 19:59:55 +00:00
Will Constable
8f31988088 [C10D] Document 'tag' limitation for nccl send/recv (#125278)
Existing documentation on isend/irecv also applies to send/recv. This PR
copies the doc/warning to send/recv ops as well.

Note: tag may be supplied, but will be ignored when used with nccl
backend.

Fixes #94819 #125079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125278
Approved by: https://github.com/kwen2501
2024-05-01 02:53:30 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Kiuk Chung
87f44d70b1 [torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233)
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
2024-04-19 04:07:00 +00:00
Shuqiang Zhang
ca6a0e1348 [c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334)
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
2024-04-18 23:42:55 +00:00
Will Constable
1885c3972d [C10D] Add dist.get_node_local_rank helper (#123992)
Fixes #122816

Summarizing the pros/cons of the request and motivation from #122816

- (+) it's really common for users to do 'os.getenv["LOCAL_RANK"]' so we
  should provide a helper
- (-) we can't really control if/how local rank information is made
  available, but it is handled automatically if torchrun is used.

We can assume local rank is correctly passed if it is passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123992
Approved by: https://github.com/shuqiangzhang, https://github.com/zdevito, https://github.com/XilunWu
2024-04-16 00:09:46 +00:00
Shengbao Zheng
9fa922c2ed [profiler] Log process group name instead of pg uid (#124035)
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>,  instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".

Reviewed By: aaronenyeshi, kwen2501

Differential Revision: D55610682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
2024-04-15 21:49:06 +00:00
Shengbao Zheng
4e9094533e [c10d/nccl-pg] allow user to pass process group description (#123472)
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.

Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.

Solution: Add a group_desc field to c10d

Differential Revision: D55781850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
2024-04-12 08:44:21 +00:00
Gufan Yin
65710d95c9 Fix example in torch.distributed.new_subgroups docstring (#123492)
Summary: As title

Test Plan: Run the example locally

Reviewed By: zhaojuanmao

Differential Revision: D55617871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123492
Approved by: https://github.com/wconstab, https://github.com/wz337
2024-04-10 03:33:07 +00:00
Shengbao Zheng
ae6f8d923c Pass and record process_group_name when creating ProcessGroupNCCL (#123117)
Summary:
Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers.
Also record pg_name in flight recorder entry.

Differential Revision: D55597200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117
Approved by: https://github.com/wconstab
2024-04-05 18:57:45 +00:00
Chirag Pandya
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
Yifu Wang
36188360dd [dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560)
Fixes https://github.com/pytorch/pytorch/issues/120431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560
Approved by: https://github.com/wconstab
2024-03-24 11:13:05 +00:00
lezcano
8a5a377190 Move doc links to point to main (#121823)
The previous links were pointing to an outdated branch

Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + `

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823
Approved by: https://github.com/albanD, https://github.com/malfet
2024-03-15 19:49:37 +00:00
Ke Wen
038b2e8780 [c10d] Add complex support for P2P (#121240)
Fixes the following error when `tensor` is a complex tensor:
```
[rank0]:     return pg.send([tensor], dst, tag)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240
Approved by: https://github.com/shuqiangzhang
2024-03-08 22:47:49 +00:00
Shengbao Zheng
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
Will Constable
f5b99976ad [C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850)
Fixes #120847

Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2024-02-29 03:41:15 +00:00
Will Constable
c016ffed5b [C10D] Fix logic for default group=None in _set_pg_timeout (#120686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686
Approved by: https://github.com/yifuwang
2024-02-28 20:31:14 +00:00
Chien-Chin Huang
f422467ccb [BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625)
When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice.

However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base.

Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625
Approved by: https://github.com/wconstab, https://github.com/yifuwang
2024-02-28 18:34:45 +00:00
Shengbao Zheng
440a9b212d [profiler] log process group config information in distributedInfo field (#119443)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Differential Revision: D53557965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443
Approved by: https://github.com/kwen2501
2024-02-27 18:21:54 +00:00
Yifu Wang
11e4a9266d Temporarily support ranks + tag as pg identifier in native funcol (#120226)
As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #120042, #120043, #120070
2024-02-22 20:24:16 +00:00
Shengbao Zheng
9630bcbd49 [execution trace/chakra] remove backend_id from pg_info (#120038)
Summary:
PR 104373(https://github.com/pytorch/pytorch/pull/104373) log backend which has an unsafe dict loop up that might fail.
We decide to deprecate backend_id and use pg id/name directly.

Differential Revision: D53676181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120038
Approved by: https://github.com/aaronenyeshi
2024-02-21 19:37:18 +00:00
Shuqiang Zhang
893dcac068 [c10d] explicitly abort communicators in destroy_process_group call (#119250)
Summary:
This PR tries to resolve issue #119215.

Basically,  processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
2024-02-12 18:40:28 +00:00
Alexander Grund
69344fe987 c10d: Don't add NCCL backend by default without CUDA (#119149)
The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh:
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Fixes #117746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149
Approved by: https://github.com/kwen2501
2024-02-05 23:55:07 +00:00
Xu Song
3d8c36786b Add device for distributed examples (#118867)
## 🐛 Describe the bug

The following example (`all_reduce`) missed `device` allocation
a205e7bf56/torch/distributed/distributed_c10d.py (L2080-L2087)

## Solution

A better example should be like this
a205e7bf56/torch/distributed/distributed_c10d.py (L3212-L3222)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118867
Approved by: https://github.com/soulitzer
2024-02-02 05:51:59 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
Aaron Gokaslan
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Wei (Will) Feng
644f64f2d1 [c10d] added docstrings and tests for src / dst (#118593)
Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg
* update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments
* communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend``
* communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list``
* ``pytest test/distributed/test_c10d_nccl.py -k subgroup``

3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device
* use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2
* ``torch.cuda.set_device(global_rank)``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593
Approved by: https://github.com/wconstab
2024-01-30 17:47:58 +00:00
Will Constable
da0635d17c Add pytorch-distributed justknobs helper (#118568)
Summary:
Sets up a helper that checks any JKs relevent to pytorch distributed,
and propagates their values to ENV.

Test Plan: Added unit test

Differential Revision: D53192406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118568
Approved by: https://github.com/zdevito
2024-01-30 08:13:52 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Yifu Wang
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
PyTorch MergeBot
bb55970e5b Revert "Add justknobs env helper for pytorch distributed (#118451)"
This reverts commit 4d1bb2175a.

Reverted https://github.com/pytorch/pytorch/pull/118451 on behalf of https://github.com/wconstab due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/118451#issuecomment-1915369013))
2024-01-29 19:01:05 +00:00
Will Constable
4d1bb2175a Add justknobs env helper for pytorch distributed (#118451)
Summary:
Adds a JK killswitch check and configures the env for enabling pytorch
nccl flight recorder.  Note- this only enables recording events in memory, not
dumping them.

Test Plan: CI test

Reviewed By: zdevito

Differential Revision: D52920092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451
Approved by: https://github.com/malfet
2024-01-29 08:57:16 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Will Constable
70699a6357 [C10D] Add tests for gather and gather_object with subgroup (#118359)
Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359
Approved by: https://github.com/weifengpy
2024-01-27 04:08:56 +00:00
PyTorch MergeBot
3d062f9abe Revert "[pytorch][kineto] log process group config in distributed info (#117774)"
This reverts commit 9c1348feb3.

Reverted https://github.com/pytorch/pytorch/pull/117774 on behalf of https://github.com/aaronenyeshi due to This diff is breaking internal jobs, but has been internally reverted ([comment](https://github.com/pytorch/pytorch/pull/117774#issuecomment-1911251092))
2024-01-26 01:10:31 +00:00
Edward Z. Yang
3e76a0e9c2 Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-01-25 20:43:18 +00:00
Shengbao Zheng
9c1348feb3 [pytorch][kineto] log process group config in distributed info (#117774)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Test Plan: Tested in HPC

Differential Revision: D52882292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774
Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi
2024-01-25 00:08:10 +00:00
Ke Wen
1e185c7803 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-24 18:42:14 +00:00
PyTorch MergeBot
b5799d9977 Revert "[c10d] Barrier uses stream sync instead of device sync (#117804)"
This reverts commit 0f6bbb1c07.

Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788.  Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))
2024-01-22 21:54:03 +00:00
Ke Wen
0f6bbb1c07 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-22 20:14:51 +00:00
Ke Wen
6d96beb6be [c10d] Remove health check (#117699)
https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called).

If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699
Approved by: https://github.com/wconstab
2024-01-18 02:14:49 +00:00
FFFrog
7b0926cc3e Fix wrong class inheritance in pyi (#116404)
As the title stated.

f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404
Approved by: https://github.com/ezyang, https://github.com/wconstab
2024-01-12 21:25:29 +00:00
Chip Turner
9693b3740b [easy] [c10d] Add documentation for the device_id parameter for init_process_group (#116222)
Follow-up to add missing docs for #114916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116222
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-01-03 19:32:18 +00:00
Aaron Gokaslan
bbe3261dd3 [BE]: Use iterable.chain.from_iterable where possible (#116376)
This is more readable and more efficient when dealing with lots of sequences to chain together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376
Approved by: https://github.com/albanD
2023-12-27 19:20:07 +00:00
fduwjj
f6dfbffb3b [c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238)
For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL.  This is a debugging feature so that we can rule out the bug from c10d level.

<img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-12-25 22:25:38 +00:00
Lucas Pasqualin
8452f41305 Adds allreduce to inductor remap (#115950)
Fixes #115728

Implements a rewrite path for allreduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950
Approved by: https://github.com/wconstab
2023-12-18 22:00:22 +00:00