pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	115af42e9d	Fix readibility checks in TIDY and apply them (#164475 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475 Approved by: https://github.com/albanD, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-02 20:34:49 +00:00
Richard Barnes	fddabc6e0b	C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 ) Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-10-19 13:17:43 +00:00
Richard Barnes	b7f798caa4	Use C10_UNUSED instead of (void)X (#137239 ) Summary: Auto-generated with ``` buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\sfor\s\()(const.\n)\s\(void\)[A-Za-z]+;\s//\sSuppress.\s\n(.)' --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".\.\(cpp\\|h\)"` ``` Differential Revision: D33432600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239 Approved by: https://github.com/Skylion007	2024-10-15 14:32:59 +00:00
cyy	71efbf701d	[3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300 ) Follows #130236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300 Approved by: https://github.com/ezyang	2024-07-09 13:32:57 +00:00
cyy	4457cd9a30	[Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987 Approved by: https://github.com/malfet	2024-05-11 00:03:52 +00:00
PyTorch MergeBot	724c7491d0	Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 )" This reverts commit `b3fd94d15e`. Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))	2024-04-30 00:37:53 +00:00
cyy	b3fd94d15e	[Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987 Approved by: https://github.com/malfet	2024-04-27 07:22:27 +00:00
cyy	6d8bb0e984	[Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884 ) This PR fixes some clang-tidy warnings in distributed code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884 Approved by: https://github.com/kwen2501	2024-03-31 09:06:35 +00:00
Aaron Gokaslan	647f14e70b	[BE]: Enable clang-tidy check for readability-string-compare (#115994 ) Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994 Approved by: https://github.com/albanD	2023-12-18 16:13:00 +00:00
Xiaodong Wang	7553c49514	[S382174] Fix distributed debug w/ non-equal split (#115483 ) Summary: In collectives, it's possible to have non-equal split that has a different implementation and the output tensor size will be different, e.g. https://www.internalfb.com/code/fbsource/[460afb1172b5]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=3104. However, TORCH_DISTRIBUTED_DEBUG=DETAIL will assume the output tensor size is the same and does the check and will fail the job if they don't: https://fburl.com/code/mhte9ty8. c10d code should handle this. Ideally we should check the input size across ranks and make sure they're the same. Maybe for next diff. Test Plan: Test torchrec's TWRW w/ non-even split and it's working now. Reviewed By: zhangruiskyline Differential Revision: D52010942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115483 Approved by: https://github.com/kwen2501, https://github.com/fegin, https://github.com/XilunWu	2023-12-12 18:02:05 +00:00
Rohan Varma	ebcc42ea10	[Dist] Fix coalescing manager + DETAIL debug mode (#111878 ) Fix https://github.com/pytorch/pytorch/issues/109520 by adding it to ProcessGroupWrapper. Differential Revision: [D50583403](https://our.internmc.facebook.com/intern/diff/D50583403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111878 Approved by: https://github.com/fegin, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-24 07:47:39 +00:00
Rohan Varma	9ba2bfea9c	[PG Wrapper] Add diff capability (#100214 ) Currently we print out the mismatched collectives, but it is hard to tell exactly the mismatch. This diff adds functionality to detect the exact mismatch and report it. New error is as follows: ``` Detected mismatch between collectives on ranks. Rank 0 is running collecti ve: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=ALLREDUCE, TensorShape =[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (defaul t), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_me mory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), de vice=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=f alse (default), memory_format=(nullopt))). Collectives differ in the following aspects: Op type: ALLREDUCEvs REDUCE ``` i.e. the "Collectives differ in the following..." messaging is added. Differential Revision: [D45375737](https://our.internmc.facebook.com/intern/diff/D45375737/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100214 Approved by: https://github.com/H-Huang	2023-05-10 15:32:30 +00:00
Rohan Varma	c4bed869d1	[PG Wrapper] Enhance error msg (#100213 ) Previously, the mismatch report would not give the full details of the collective running on the mismatched rank, it would look something like: ``` Detected mismatch between collectives on ranks. Rank 26 is running collective: CollectiveFingerPrint(SequenceNumber=683057617, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=513876813OpType=BROADCAST). ``` i.e. Rank 1 is missing more details such as the shape, type etc. This was due to `num_tensors` field not being populated, which operator<< checks to determine whether to print additional information such as the tensor shape. Adding this field gives a better error: ``` Detected mismatch between collectives on ranks. Rank 0 is run ning collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=ALLREDUCE , TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype= float (default), device=cpu, layout=Strided (default), requires_grad=false (defaul t), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is runnin g collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=REDUCE, Tens orShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pi nned_memory=false (default), memory_format=(nullopt))). ``` Differential Revision: [D45372325](https://our.internmc.facebook.com/intern/diff/D45372325/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100213 Approved by: https://github.com/H-Huang	2023-04-28 18:49:18 +00:00
Rohan Varma	b8cf010139	Print collective (#97544 ) Prints the collective running in TDD. Differential Revision: [D44347417](https://our.internmc.facebook.com/intern/diff/D44347417/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97544 Approved by: https://github.com/zhaojuanmao	2023-04-06 06:47:19 +00:00
Rohan Varma	dab1a7e6a1	[PG Wrapper] Add sequence number (#97462 ) Adds sequence number to PG wrapper Differential Revision: [D44347419](https://our.internmc.facebook.com/intern/diff/D44347419/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97462 Approved by: https://github.com/zhaojuanmao	2023-04-06 06:47:19 +00:00
Nikita Shulga	a229e78544	[BE] Enforce sign-compare (#96723 ) Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds. Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase. The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here: `6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)` Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869 Do not try to fix sign compare violations in caffe2 codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723 Approved by: https://github.com/albanD	2023-03-15 06:04:20 +00:00
Aaron Gokaslan	97db9fde69	Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178 ) …c10d Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178 Approved by: https://github.com/ezyang	2022-12-27 07:34:12 +00:00
Howard Huang	7a0f29b776	Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj	2022-12-16 23:15:00 +00:00
Aaron Gokaslan	da8f539e84	[Fix]: Add missing std::vector reserve in aten and torch/csrc (#90627 ) Applies some clang-tidy static analysis fixes to some places where the std::vector could call.reserve() first to allocate the appropriate amount of space. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90627 Approved by: https://github.com/ezyang	2022-12-13 14:46:27 +00:00
Min Si	1ad0048b64	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera, https://github.com/huydhn	2022-09-30 05:13:50 +00:00
PyTorch MergeBot	a50d8864fc	Revert "Refactor distribuetd to use absolute header path (#85780 )" This reverts commit `668082718a`. Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>	2022-09-30 02:04:29 +00:00
Min Si	668082718a	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera	2022-09-30 00:27:24 +00:00
Howard Huang	74ead61944	[2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680 ) ### Changes - Move ProcessGroup::Work into its own class and update all the references to it / header includes. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work. Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680 Approved by: https://github.com/kwen2501	2022-09-14 13:05:58 +00:00
Rodrigo Kumpera	08795f9afc	Add _reduce_scatter_base to ProcessGroupWrapper. (#79633 ) Fixes #66329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79633 Approved by: https://github.com/fduwjj, https://github.com/rohan-varma	2022-06-29 15:32:42 +00:00
PyTorch MergeBot	f62d8b2a0f	ProcessGroupWrapper log full rank fingerprint mismatches (#79901 ) ### Current Error Message: ``` Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, …), but Rank 1 is running collective: REDUCE. ``` ### Ops Mismatch, New Error Message (shows full fingerprint, includes tensor shape, data types, and device types): ``` Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). ``` ### Shape Mismatch, New Error Message ``` RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=SCATTER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(OpType=SCATTER, TensorShape=[2], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). ``` Changes: - Update deserialize function to read shape of tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/79901 Approved by: https://github.com/rohan-varma	2022-06-28 18:30:38 +00:00
Howard Huang	26b51290e5	[BE] Update ProcessGroupWrapper to add deserializer and improve logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/79724 Approved by: https://github.com/kumpera, https://github.com/rohan-varma	2022-06-20 19:24:59 +00:00
Howard Huang	ee715e0a65	[small changes] add closing parentheses to print out Pull Request resolved: https://github.com/pytorch/pytorch/pull/79723 Approved by: https://github.com/rohan-varma	2022-06-20 18:00:48 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Andrew Gu	a37d54b6d1	[Easy][c10d][DDP] (Reland) Minor fixes (#73569 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73569 Reland https://github.com/pytorch/pytorch/pull/73299 and https://github.com/pytorch/pytorch/pull/73318 together. Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D34552418 Pulled By: awgu fbshipit-source-id: 95088d2c1c67cd4fb9bbb115e15ba6b26ae06bdb (cherry picked from commit 695ebc3dc0ccb08a167445588c293b3a6c3c00b7)	2022-03-03 14:30:54 +00:00
Yanli Zhao	2d45a3d7cf	Back out "[Easy][c10d] Minor fixes" (#73521 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73521 Original commit changeset: 9ee79a72b4fc Original Phabricator Diff: D34431929 (`12890abcb4`) ghstack-source-id: 150127107 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D34527593 fbshipit-source-id: 050f909800626761cacc9aa0cc5af6bef20966ae (cherry picked from commit 2ddbb49886961c0ba7f3344e3e3a41080d45bd6b)	2022-03-01 04:35:29 +00:00
Andrew Gu	12890abcb4	[Easy][c10d] Minor fixes (#73318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73318 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34431929 Pulled By: awgu fbshipit-source-id: 9ee79a72b4fc2a86c2fd846e091e278dfc242898 (cherry picked from commit e7c920a593c771096649bc5abb40f775a381729d)	2022-02-26 00:45:50 +00:00
Amir Khojaste	748790588c	Upgrading the loop to use irange (#70326 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70326 See D24145988 for context: it allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This is nice because it auto-types the loops and adds const-safety to the iteration variable. Test Plan: buck run //caffe2/torch/fb/sparsenn:test Reviewed By: r-barnes Differential Revision: D33243400 fbshipit-source-id: b1f1b4163f4bf662031baea9e5268459b40c69a3	2022-01-06 07:06:53 -08:00
Rohan Varma	cb14a258a2	[c10d] Fix object-based collectives for debug mode (#68223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223 DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA. Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl. ghstack-source-id: 143242023 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D32366840 fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5	2021-11-13 04:18:31 -08:00
Rohan Varma	b72a1782d8	[PG Wrapper][BE] Add collective information when monitored barrier error is (#66167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66167 Sometimes due to desync we see PG wrapper monitored barrier fail. In this case it would be useful to print the info about the collective that was trying to run along with the actual error. ghstack-source-id: 140037653 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D31353021 fbshipit-source-id: e2a515326c9314c98119978d5566eb5431cca96c	2021-10-08 09:14:24 -07:00
Rohan Varma	b5b1d49a66	[PG Wrapper][BE] Make some methods private (#66166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66166 These methods should be private. ghstack-source-id: 139782587 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D31353020 fbshipit-source-id: 583fb315cc2cacc37df3d29cd5793b42558930b3	2021-10-08 09:13:02 -07:00
Rohan Varma	f5341bd5e6	Enhance ProcessGroupWrapper with additional checks + refactor (#60237 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60237 Closes https://github.com/pytorch/pytorch/issues/58711 This diff refactors the collective consistency checking in `ProcessGroupWrapper` as described in the above issue. In particular, we no longer run separate verification checks (`all_gather`s) for shapes, op type, etc. Instead, we implement a function `serialize_fingerprint` to serialize all this data into a single tensor and only verify that. This has the benefit of being a lot more extensible, the developer does not need to add separate `all_gather` calls in order to verify additional data in the future. We can also provide some sort of mechanism where we allow data that needs to be verified to be "registered" in the `CollectiveFingerPrint` struct and make it even easier to add additional data, we can consider doing this if there are significant additions to `process group wrapper`. We now also begin to check tensor `dtypes` and device types for consistency as well. Tests are refactored/added accordingly. ghstack-source-id: 132520261 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D28597287 fbshipit-source-id: b09f14f628df9e2457623ba81fc13fd4e214f3c9	2021-06-28 10:24:11 -07:00
Luca Wehrstedt	a016150163	Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543 Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place. ghstack-source-id: 132306292 Test Plan: It builds Reviewed By: cbalioglu Differential Revision: D29062002 fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6	2021-06-24 12:38:51 -07:00

37 Commits