Commit Graph

37 Commits

Author SHA1 Message Date
Yuanyuan Chen
115af42e9d Fix readibility checks in TIDY and apply them (#164475)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475
Approved by: https://github.com/albanD, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-02 20:34:49 +00:00
Richard Barnes
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
Richard Barnes
b7f798caa4 Use C10_UNUSED instead of (void)X (#137239)
Summary:
Auto-generated with
```
buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\s*for\s*\()(const.*\n)\s*\(void\)[A-Za-z]+;\s*//\s*Suppress.*\s*\n(.*)'  --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".*\.\(cpp\|h\)"`
```

Differential Revision: D33432600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239
Approved by: https://github.com/Skylion007
2024-10-15 14:32:59 +00:00
cyy
71efbf701d [3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300)
Follows #130236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300
Approved by: https://github.com/ezyang
2024-07-09 13:32:57 +00:00
cyy
4457cd9a30 [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-05-11 00:03:52 +00:00
PyTorch MergeBot
724c7491d0 Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)"
This reverts commit b3fd94d15e.

Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))
2024-04-30 00:37:53 +00:00
cyy
b3fd94d15e [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-04-27 07:22:27 +00:00
cyy
6d8bb0e984 [Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884)
This PR fixes some clang-tidy warnings in distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884
Approved by: https://github.com/kwen2501
2024-03-31 09:06:35 +00:00
Aaron Gokaslan
647f14e70b [BE]: Enable clang-tidy check for readability-string-compare (#115994)
Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994
Approved by: https://github.com/albanD
2023-12-18 16:13:00 +00:00
Xiaodong Wang
7553c49514 [S382174] Fix distributed debug w/ non-equal split (#115483)
Summary:
In collectives, it's possible to have non-equal split that has a different implementation and the output tensor size will be different, e.g. https://www.internalfb.com/code/fbsource/[460afb1172b5]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=3104. However, TORCH_DISTRIBUTED_DEBUG=DETAIL will assume the output tensor size is the same and does the check and will fail the job if they don't: https://fburl.com/code/mhte9ty8. c10d code should handle this.

Ideally we should check the input size across ranks and make sure they're the same. Maybe for next diff.

Test Plan: Test torchrec's TWRW w/ non-even split and it's working now.

Reviewed By: zhangruiskyline

Differential Revision: D52010942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115483
Approved by: https://github.com/kwen2501, https://github.com/fegin, https://github.com/XilunWu
2023-12-12 18:02:05 +00:00
Rohan Varma
ebcc42ea10 [Dist] Fix coalescing manager + DETAIL debug mode (#111878)
Fix https://github.com/pytorch/pytorch/issues/109520 by adding it to
ProcessGroupWrapper.

Differential Revision: [D50583403](https://our.internmc.facebook.com/intern/diff/D50583403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111878
Approved by: https://github.com/fegin, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-24 07:47:39 +00:00
Rohan Varma
9ba2bfea9c [PG Wrapper] Add diff capability (#100214)
Currently we print out the mismatched collectives, but it is hard to
tell exactly the mismatch. This diff adds functionality to detect the exact mismatch
and report it.

New error is as follows:

```
Detected mismatch between collectives on ranks. Rank 0 is running collecti     ve: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=ALLREDUCE, TensorShape     =[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (defaul     t), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_me     mory=false (default), memory_format=(nullopt))), but Rank 1 is running collective:      CollectiveFingerPrint(SequenceNumber=1151423632, OpType=REDUCE, TensorShape=[20,      10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), de     vice=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=f     alse (default), memory_format=(nullopt))). Collectives differ in the following aspects:         Op type: ALLREDUCEvs REDUCE
```

i.e. the "Collectives differ in the following..." messaging is added.

Differential Revision: [D45375737](https://our.internmc.facebook.com/intern/diff/D45375737/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100214
Approved by: https://github.com/H-Huang
2023-05-10 15:32:30 +00:00
Rohan Varma
c4bed869d1 [PG Wrapper] Enhance error msg (#100213)
Previously, the mismatch report would not give the full details of the
collective running on the mismatched rank, it would look something like:

```
Detected mismatch between collectives on ranks. Rank 26 is running collective: CollectiveFingerPrint(SequenceNumber=683057617, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=513876813OpType=BROADCAST).
```

i.e. Rank 1 is missing more details such as the shape, type etc.

This was due to `num_tensors` field not being populated, which operator<<
checks to determine whether to print additional information such as the tensor
shape.

Adding this field gives a better error:

```
Detected mismatch between collectives on ranks. Rank 0 is run     ning collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=ALLREDUCE     , TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=     float (default), device=cpu, layout=Strided (default), requires_grad=false (defaul     t), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is runnin     g collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=REDUCE, Tens     orShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float      (default), device=cpu, layout=Strided (default), requires_grad=false (default), pi     nned_memory=false (default), memory_format=(nullopt))).
```

Differential Revision: [D45372325](https://our.internmc.facebook.com/intern/diff/D45372325/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100213
Approved by: https://github.com/H-Huang
2023-04-28 18:49:18 +00:00
Rohan Varma
b8cf010139 Print collective (#97544)
Prints the collective running in TDD.

Differential Revision: [D44347417](https://our.internmc.facebook.com/intern/diff/D44347417/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97544
Approved by: https://github.com/zhaojuanmao
2023-04-06 06:47:19 +00:00
Rohan Varma
dab1a7e6a1 [PG Wrapper] Add sequence number (#97462)
Adds sequence number to PG wrapper

Differential Revision: [D44347419](https://our.internmc.facebook.com/intern/diff/D44347419/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97462
Approved by: https://github.com/zhaojuanmao
2023-04-06 06:47:19 +00:00
Nikita Shulga
a229e78544 [BE] Enforce sign-compare (#96723)
Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.

The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars  here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)

Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
2023-03-15 06:04:20 +00:00
Aaron Gokaslan
97db9fde69 Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178)
…c10d

Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178
Approved by: https://github.com/ezyang
2022-12-27 07:34:12 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Aaron Gokaslan
da8f539e84 [Fix]: Add missing std::vector reserve in aten and torch/csrc (#90627)
Applies some clang-tidy static analysis fixes to some places where the std::vector could call.reserve() first to allocate the appropriate amount of space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90627
Approved by: https://github.com/ezyang
2022-12-13 14:46:27 +00:00
Min Si
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
PyTorch MergeBot
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718a.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
Min Si
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
Howard Huang
74ead61944 [2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680)
### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.

Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
2022-09-14 13:05:58 +00:00
Rodrigo Kumpera
08795f9afc Add _reduce_scatter_base to ProcessGroupWrapper. (#79633)
Fixes #66329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79633
Approved by: https://github.com/fduwjj, https://github.com/rohan-varma
2022-06-29 15:32:42 +00:00
PyTorch MergeBot
f62d8b2a0f ProcessGroupWrapper log full rank fingerprint mismatches (#79901)
### Current Error Message:
```
Detected mismatch between collectives on ranks. Rank 0 is running collective:
CollectiveFingerPrint(OpType=ALLREDUCE, …), but Rank 1 is running collective: REDUCE.
```

### Ops Mismatch, New Error Message (shows full fingerprint, includes tensor shape, data types, and device types):
```
Detected mismatch between collectives on ranks. Rank 0 is running collective:
CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[20, 10], TensorDtypes=Float,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu,
layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))),

but Rank 1 is running collective:
CollectiveFingerPrint(OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu,
layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).
```

### Shape Mismatch, New Error Message
```
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective:
CollectiveFingerPrint(OpType=SCATTER, TensorShape=[1], TensorDtypes=Long,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false
(default), pinned_memory=false (default), memory_format=(nullopt))),

but Rank 1 is running collective:
CollectiveFingerPrint(OpType=SCATTER, TensorShape=[2], TensorDtypes=Long,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false
(default), pinned_memory=false (default), memory_format=(nullopt))).

```

Changes:
- Update deserialize function to read shape of tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79901
Approved by: https://github.com/rohan-varma
2022-06-28 18:30:38 +00:00
Howard Huang
26b51290e5 [BE] Update ProcessGroupWrapper to add deserializer and improve logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79724

Approved by: https://github.com/kumpera, https://github.com/rohan-varma
2022-06-20 19:24:59 +00:00
Howard Huang
ee715e0a65 [small changes] add closing parentheses to print out
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79723

Approved by: https://github.com/rohan-varma
2022-06-20 18:00:48 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Andrew Gu
a37d54b6d1 [Easy][c10d][DDP] (Reland) Minor fixes (#73569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73569

Reland https://github.com/pytorch/pytorch/pull/73299 and https://github.com/pytorch/pytorch/pull/73318 together.

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D34552418

Pulled By: awgu

fbshipit-source-id: 95088d2c1c67cd4fb9bbb115e15ba6b26ae06bdb
(cherry picked from commit 695ebc3dc0ccb08a167445588c293b3a6c3c00b7)
2022-03-03 14:30:54 +00:00
Yanli Zhao
2d45a3d7cf Back out "[Easy][c10d] Minor fixes" (#73521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73521

Original commit changeset: 9ee79a72b4fc

Original Phabricator Diff: D34431929 (12890abcb4)
ghstack-source-id: 150127107

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D34527593

fbshipit-source-id: 050f909800626761cacc9aa0cc5af6bef20966ae
(cherry picked from commit 2ddbb49886961c0ba7f3344e3e3a41080d45bd6b)
2022-03-01 04:35:29 +00:00
Andrew Gu
12890abcb4 [Easy][c10d] Minor fixes (#73318)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73318

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34431929

Pulled By: awgu

fbshipit-source-id: 9ee79a72b4fc2a86c2fd846e091e278dfc242898
(cherry picked from commit e7c920a593c771096649bc5abb40f775a381729d)
2022-02-26 00:45:50 +00:00
Amir Khojaste
748790588c Upgrading the loop to use irange (#70326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70326

See D24145988 for context: it allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This is nice because it auto-types the loops and adds const-safety to the iteration variable.

Test Plan: buck run //caffe2/torch/fb/sparsenn:test

Reviewed By: r-barnes

Differential Revision: D33243400

fbshipit-source-id: b1f1b4163f4bf662031baea9e5268459b40c69a3
2022-01-06 07:06:53 -08:00
Rohan Varma
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00
Rohan Varma
b72a1782d8 [PG Wrapper][BE] Add collective information when monitored barrier error is (#66167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66167

Sometimes due to desync we see PG wrapper monitored barrier fail. In
this case it would be useful to print the info about the collective that was
trying to run along with the actual error.
ghstack-source-id: 140037653

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D31353021

fbshipit-source-id: e2a515326c9314c98119978d5566eb5431cca96c
2021-10-08 09:14:24 -07:00
Rohan Varma
b5b1d49a66 [PG Wrapper][BE] Make some methods private (#66166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66166

These methods should be private.
ghstack-source-id: 139782587

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D31353020

fbshipit-source-id: 583fb315cc2cacc37df3d29cd5793b42558930b3
2021-10-08 09:13:02 -07:00
Rohan Varma
f5341bd5e6 Enhance ProcessGroupWrapper with additional checks + refactor (#60237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60237

Closes https://github.com/pytorch/pytorch/issues/58711

This diff refactors the collective consistency checking in `ProcessGroupWrapper` as described in the above issue. In particular, we no longer run separate verification checks (`all_gather`s) for shapes, op type, etc. Instead, we implement a function `serialize_fingerprint` to serialize all this data into a single tensor and only verify that.

This has the benefit of being a lot more extensible, the developer does not need to add separate `all_gather` calls in order to verify additional data in the future. We can also provide some sort of mechanism where we allow data that needs to be verified to be "registered" in the `CollectiveFingerPrint` struct and make it even easier to add additional data, we can consider doing this if there are significant additions to `process group wrapper`.

We now also begin to check tensor `dtypes` and device types for consistency as well. Tests are refactored/added accordingly.
ghstack-source-id: 132520261

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28597287

fbshipit-source-id: b09f14f628df9e2457623ba81fc13fd4e214f3c9
2021-06-28 10:24:11 -07:00
Luca Wehrstedt
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00