pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Svetlana Karslioglu	d425da8bf3	Replace master with main in links and docs/conf.py (#100176 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100176 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-02 18:20:32 +00:00
xiny	57bb4cd046	[Doc][Distributed] Add missing functions to distributed.rst (#89905 ) Add missing documents for `torch.distributed.all_to_all_single` and other functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89905 Approved by: https://github.com/kit1980	2022-12-04 07:22:54 +00:00
Wanchao Liang	4451eb24e6	Move tensor_parallel out to distributed.tensor folder (#89878 ) This PR moves tensor parallel from torch.distributed._tensor.parallel to torch.distributed.tensor.parallel, to prepare for beta release Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878 Approved by: https://github.com/fduwjj	2022-11-30 22:13:10 +00:00
Howard Huang	bc66ddb5cb	Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134 ) Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8 Changes: - introduce new error type - Update `C10D_NCCL_CHECK` Sample script to demonstrate new error type ```python # python -m torch.distributed.run --nproc_per_node=2 <script>.py import torch import torch.distributed as dist if __name__ == "__main__": dist.init_process_group("nccl") dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0) ``` Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134 Approved by: https://github.com/rohan-varma	2022-11-08 13:26:42 +00:00
Masaki Kozuki	28593a8339	[docs] `batch_isend_irecv` and `P2POp` of torch.distributed (#86438 ) Reopening https://github.com/pytorch/pytorch/pull/79722 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/86438 Approved by: https://github.com/kit1980	2022-10-25 00:11:50 +00:00
Howard Huang	cc9183eb4c	Update distributed.rst backend collective support chart (#86406 ) NCCL `scatter` was added by Wanchao in https://github.com/pytorch/pytorch/pull/70029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86406 Approved by: https://github.com/wanchaol	2022-10-07 12:59:09 +00:00
Ke Wen	05d1128106	[c10d] Start deprecating *_multigpu APIs (#85961 ) ### Deprecation reasons: - For most users training is on one GPU per process so these APIs are rarely used - They added one more API dimension - They can be expressed in a composed manner - They are not abstracted – specific to GPU - They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-10-01 00:59:39 +00:00
Ke Wen	ade1c19612	Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867 ) This is a twin PR similar to the one for `all_gather_into_tensor` (#85686). The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686. Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867 Approved by: https://github.com/crcrpar, https://github.com/H-Huang	2022-09-30 05:48:16 +00:00
Ke Wen	775a22c7c6	Add all_gather_into_tensor in place of _all_gather_base (#85686 ) ### Description - This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning. - The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors. - This PR also adds deprecation warning to `_all_gather_base`. ### Issue `_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output. There are, however, two "blockers" that make the merge difficult: (i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list. (ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information. In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name. ### Testing Added tests: - `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example: ``` >>> tensor_in tensor([1, 2], device='cuda:0') # Rank 0 tensor([3, 4], device='cuda:1') # Rank 1 >>> tensor_out tensor([1, 2, 3, 4], device='cuda:0') # Rank 0 tensor([1, 2, 3, 4], device='cuda:1') # Rank 1 ``` - `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example: ``` >>> tensor_out2 tensor([[1, 2], [3, 4]], device='cuda:0') # Rank 0 tensor([[1, 2], [3, 4]], device='cuda:1') # Rank 1 ``` The output form is determined by the shape of the output tensor passed by the user, no flag used. Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686 Approved by: https://github.com/rohan-varma, https://github.com/crcrpar	2022-09-27 22:50:22 +00:00
Shawn Zhong	9c902f4749	Add `TORCH_CPP_LOG_LEVEL` to the docs Fixes #70667 `TORCH_CPP_LOG_LEVEL=INFO` is needed for `TORCH_DISTRIBUTED_DEBUG` to be effective. For reference, https://github.com/pytorch/pytorch/pull/71746 introduced the environment variable `TORCH_CPP_LOG_LEVEL` and https://github.com/pytorch/pytorch/pull/73361 documented it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76625 Approved by: https://github.com/rohan-varma	2022-05-03 17:01:11 +00:00
Ke Wen	1f04a00ccf	[PyTorch Distributed] Update documentation about NCCL environment variables (#74006 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74006 updated recommendations about environment variables to use during debug and performance tuning Test Plan: `make html` Reviewed By: rohan-varma Differential Revision: D34767454 fbshipit-source-id: 08cd58469bf72b58702e50e82020fa19b43b5911 (cherry picked from commit ac7e6630f8043f85d3d16be17c6a8ad1ebb2990c)	2022-03-11 23:57:17 +00:00
Alban Desmaison	734281c3d6	Cleanup all module references in doc (#73983 ) Summary: Working towards https://docs.google.com/document/d/10yx2-4gs0gTMOimVS403MnoAWkqitS8TUHX73PN8EjE/edit?pli=1# This PR: - Ensure that all the submodules are listed in a rst file (that ensure they are considered by the coverage tool) - Remove some long deprecated code that just error out on import - Remove the allow list altogether to ensure nothing gets added back there Pull Request resolved: https://github.com/pytorch/pytorch/pull/73983 Reviewed By: anjali411 Differential Revision: D34787908 Pulled By: albanD fbshipit-source-id: 163ce61e133b12b2f2e1cbe374f979e3d6858db7 (cherry picked from commit c9edfead7a01dc45bfc24eaf7220d2a84ab1f62e)	2022-03-10 22:26:29 +00:00
Can Balioglu	0e7a7a5fe7	Add documentation for c10d log levels (#73361 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73361 This PR adds the documentation for the newly introduced `TORCH_CPP_LOG_LEVEL` and how it can be used along with `TORCH_DISTRIBUTED_DEBUG` to adjust the log level of c10d. ghstack-source-id: 149874995 Test Plan: Locally rendered and checked the documentation. Reviewed By: rohan-varma Differential Revision: D34452352 fbshipit-source-id: ecb54590f3030ddef9921a7152ca9f7fc9438345 (cherry picked from commit f4c7c6f3b27dbd3006686cf26a6e9e53cd2c8f09)	2022-02-24 20:38:15 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Wanchao Liang	9b53d3194c	Implement gather primitive for ProcessGroupNCCL (#66745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745 This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api. NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API. 1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp. 1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors. ghstack-source-id: 147754838 Test Plan: test_gather_ops test_gather_checks test_gather_stress Reviewed By: pritamdamania87 Differential Revision: D29616361 fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782 (cherry picked from commit `d560ee732e`)	2022-01-27 19:37:55 +00:00
Shen Li	7bc220e060	Update distributed.rst for ProcessGroup Extensions (#71482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71482 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D33745986 Pulled By: mrshenli fbshipit-source-id: fe2d0491901bf00be09deb5c556bc1e1d359b725 (cherry picked from commit `be5104bfd7`)	2022-01-25 00:30:08 +00:00
mrshenli	b8c3693281	Remove autograd-enabled collective APIs from distributed docs (#69011 ) Summary: These APIs are not yet officially released and are still under discussion. Hence, this commit removes those APIs from docs and will add them back when ready. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/69011 Reviewed By: fduwjj Differential Revision: D32703124 Pulled By: mrshenli fbshipit-source-id: ea049fc7ab6b0015d38cc40c5b5daf47803b7ea0	2021-11-29 18:14:50 -08:00
Yi Wang	7f25c3e666	Update distributed.rst to show that CUDA send/recv on GPU is supported (#65601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65601 I believe this feature was supported one year ago: https://github.com/pytorch/pytorch/pull/44921 #Closes: https://github.com/pytorch/pytorch/issues/65525 ghstack-source-id: 138918961 Test Plan: N/A Reviewed By: pritamdamania87, mingzhe09088 Differential Revision: D31163535 fbshipit-source-id: 9321a0a5137a3e265e2b54bd78730ac28c7acd55	2021-09-24 12:30:10 -07:00
Kiuk Chung	9d95d48567	(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: https://github.com/pytorch/pytorch/issues/63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5	2021-08-25 22:57:43 -07:00
Howard Huang	cdc027679b	Add compare_set in distributed docs (#61351 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61351 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29588206 Pulled By: H-Huang fbshipit-source-id: 9db48e7b6de29503275f10616470ad2d66b075f9	2021-07-08 12:30:32 -07:00
Rohan Varma	2f395f3b54	[reland] Document debugability features in torch.distributed (#59726 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59726 Reland of https://github.com/pytorch/pytorch/pull/59604 with indentation fix ghstack-source-id: 130979356 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D29001923 fbshipit-source-id: 225d9dc5054c223b453f3b39749e2b62f61b9a2c	2021-06-09 16:40:11 -07:00
Luca Wehrstedt	f1786b293d	Revert D28972444: [pytorch][PR] Document debugability features in torch.distributed Test Plan: revert-hammer Differential Revision: D28972444 (`a9d2810817`) Original commit changeset: da5e8ee84f0d fbshipit-source-id: 94d3b3b75ddec74ea5b2b76f6a7519dc921ee2a7	2021-06-09 03:04:36 -07:00
Rohan Varma	a9d2810817	Document debugability features in torch.distributed (#59604 ) Summary: Adds comprehensive documentation around debugability features added to `torch.distributed` recently, including the `monitored_barrier` and TORCH_DISTRIBUTED_DEBUG env variable. ![dist_one](https://user-images.githubusercontent.com/8039770/121102672-0f052180-c7b3-11eb-974c-81dbbe102cb6.png) ![dist_two](https://user-images.githubusercontent.com/8039770/121102734-39ef7580-c7b3-11eb-94f7-c75469351440.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/59604 Reviewed By: jbschlosser, SciPioneer Differential Revision: D28972444 Pulled By: rohan-varma fbshipit-source-id: da5e8ee84f0d6f252c703c4d70ff2a0d5817cc4e	2021-06-08 23:52:19 -07:00
Rohan Varma	071d49a970	Document monitored barrier (#58322 ) Summary: Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322 Reviewed By: SciPioneer Differential Revision: D28595405 Pulled By: rohan-varma fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa	2021-05-21 19:04:57 -07:00
Rohan Varma	52bb8120b8	Mention distributed profiling in documentation (#58286 ) Summary: Added a simple section indicating distributed profiling is expected to work similar to other torch operators, and is supported for all communication backends out of the box. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58286 Reviewed By: bdhirsh Differential Revision: D28436489 Pulled By: rohan-varma fbshipit-source-id: ce1905a987c0ede8011e8086a2c30edc777b4a38	2021-05-14 09:43:00 -07:00
Alexander Golynski	bc30c3165c	Update docs for get_future support (#58107 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58107 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28387374 Pulled By: agolynski fbshipit-source-id: 70052afbb0b07ba341ea55f7ec30f7d9759b7bd4	2021-05-12 18:29:28 -07:00
Wanchao Liang	270d675f86	update distributed doc table for alltoall nccl (#54277 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54277 alltoall already supported in nccl backend, so update the doc to reflect it. Test Plan: Imported from OSS Reviewed By: divchenko Differential Revision: D27172904 Pulled By: wanchaol fbshipit-source-id: 9afa89583d56b247b2017ea2350936053eb30827	2021-03-19 15:35:10 -07:00
Stas Bekman	924c15c962	[doc] reorg dist init and non-init functions (#52976 ) Summary: This PR proposes to improve the distributed doc: * [x] putting the init functions together * [x] moving post-init functions into their own sub-section as they are only available after init and moving that group to after all init sub-sections If this is too much, could we at least put these 2 functions together: ``` .. autofunction:: init_process_group .. autofunction:: is_initialized ``` as they are interconnected. and the other functions are not alphabetically sorted in the first place. Thank you. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52976 Reviewed By: albanD Differential Revision: D26993933 Pulled By: mrshenli fbshipit-source-id: 7cacbe28172ebb5849135567b1d734870b49de77	2021-03-12 08:48:18 -08:00
Joe Zhu	f2b43ddbf4	Update api doc for enabling TcpStore on Windows (#51847 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/51847 Reviewed By: albanD Differential Revision: D26405678 Pulled By: malfet fbshipit-source-id: 073b675225b48d1732771583f8f2473e0fdcf35c	2021-02-11 14:44:03 -08:00
Nikita Shulga	76c6e12a5c	Minor spelling updates (#52149 ) Summary: Add space between 'e.g.' and 'build' 'pacakge'->'package' Pull Request resolved: https://github.com/pytorch/pytorch/pull/52149 Reviewed By: osalpekar Differential Revision: D26405824 Pulled By: malfet fbshipit-source-id: 386390d3f31a9fc268b05902b9dca1deeaf626f9	2021-02-11 12:36:27 -08:00
Emilio Castillo	233e4ebdb6	Implement autograd functions for c10d communication operations (#40762 ) Summary: Closes https://github.com/pytorch/pytorch/issues/40702, Fixes https://github.com/pytorch/pytorch/issues/40690 Currently wip. But I would appreciate some feedback. Functions should be double-differentiable. Contrary to `b35cdc5200/torch/nn/parallel/_functions.py` This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct? Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/40762 Reviewed By: glaringlee Differential Revision: D24758889 Pulled By: mrshenli fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce	2021-01-26 07:52:51 -08:00
Rohan Varma	d6b5f3ad98	Add object-based collective APIs to public docs (#48909 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48909 Adds these new APIs to the documentation ghstack-source-id: 117965961 Test Plan: CI Reviewed By: mrshenli Differential Revision: D25363279 fbshipit-source-id: af6889d377f7b5f50a1a77a36ab2f700e5040150	2020-12-07 14:30:25 -08:00
Rohan Varma	362d9a932e	Remove object-based collective APIs from public docs (#46075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46075 Removes these from public docs for now as we are still iterating/formalizing these APIs. Will add them back once they are part of a PyTorch release. ghstack-source-id: 113928700 Test Plan: CI Reviewed By: mrshenli Differential Revision: D24211510 fbshipit-source-id: 3e36ff6990cf8e6ef72b6e524322ae06f9097aa2	2020-10-09 09:24:51 -07:00
Rohan Varma	154347d82f	Fix distributed documentation for asynchronous collective Work objects (#45709 ) Summary: Closes https://github.com/pytorch/pytorch/issues/42247. Clarifies some documentation related to `Work` object semantics (outputs of async collective functions). Clarifies the difference between CPU operations and CUDA operations (on Gloo or NCCL backend), and provides an example where the difference in CUDA operation's wait() semantics is necessary to understand for correct code. ![sync](https://user-images.githubusercontent.com/8039770/94875710-6f64e780-040a-11eb-8fb5-e94fd53534e5.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/45709 Reviewed By: ngimel Differential Revision: D24171256 Pulled By: rohan-varma fbshipit-source-id: 6365a569ef477b59eb2ac0a8a9a1c1f34eb60e22	2020-10-07 19:59:51 -07:00
Omkar Salpekar	3799ba83e5	[Docs] Adding Store API Docs (#45543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543 This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store. ghstack-source-id: 113409195 Test Plan: Will verify screenshots by building the docs. Reviewed By: pritamdamania87 Differential Revision: D24005598 fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044	2020-10-02 11:16:56 -07:00
gunandrose4u	47debdca42	Document change for DDP enabled on Windows platform (#45392 ) Summary: Document change for DDP enabled on Windows platform Pull Request resolved: https://github.com/pytorch/pytorch/pull/45392 Reviewed By: gchanan Differential Revision: D23962344 Pulled By: mrshenli fbshipit-source-id: 8924c6ca36d68699871d8add3e0aab6542ea269c	2020-09-28 13:22:42 -07:00
Rohan Varma	fbea2ee917	broadcast_object API for c10d (#43887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111180436 Reviewed By: mrshenli Differential Revision: D23422577 fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e	2020-09-01 18:54:17 -07:00
Shen Li	2f52748515	Publish all_gather_object and gather_object docs (#43772 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D23398495 Pulled By: rohan-varma fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605	2020-08-31 13:28:00 -07:00
Shen Li	0edbe6b063	Add a link in RPC doc page to point to PT Distributed overview (#41108 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108 Test Plan: Imported from OSS Differential Revision: D22440751 Pulled By: mrshenli fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb	2020-07-08 14:00:05 -07:00
Shen Li	b982a6a247	Expose torch.distributed.is_available() API (#37021 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37021 Test Plan: Imported from OSS Differential Revision: D21164318 Pulled By: mrshenli fbshipit-source-id: 08a446af342cbe54f3eb4994956ffa7ef4922bcf	2020-04-21 18:38:46 -07:00
Orion Reblitz-Richardson	2d8dbcd3ef	Remove python2 and 3.5 from requirements.txt, README and docs (#35677 ) Summary: Some more cleanup now that we no longer support python2 or 3.5 on master and eventually PyTorch 1.6 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35677 Differential Revision: D20838097 Pulled By: orionr fbshipit-source-id: 95d553a1e8769f3baa395e0bc6d4ce7cd93236e9	2020-04-03 11:05:43 -07:00
Feng Tian	762270c51f	add c10d dynamic loading mechanism and unit test (#28068 ) Summary: The original behavior of pytorch c10d only supports built-in c10d backends, such as nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically loading 3rd party communication libraries which are derived from ProcessGroup base class. related RFC is in: https://github.com/pytorch/pytorch/issues/27955 Through this way, user just need specify a 3rd party c10d backend name when invoking torch.distributed.init_process_group(). The proposed logic will try to load corresponding c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068 Differential Revision: D19174838 Pulled By: agolynski fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62	2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar	945d7a7408	Add All-to-all comms support to distributed module and MPI backend (#32361 ) Summary: As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend. mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361 Reviewed By: mrshenli Differential Revision: D20635481 Pulled By: srinivas212 fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2	2020-04-01 08:57:12 -07:00
Yuichiro Ueno	aadd0fda8b	Document reduce_scatter collective operation (#35274 ) Summary: I don't know why reduce_scatter collective operation is not documented so I add it to the document. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35274 Differential Revision: D20645850 Pulled By: mrshenli fbshipit-source-id: 0a4458bff1a4e15a4593dd4dcc25e4e0f6e2265d	2020-03-25 13:36:29 -07:00
Xiang Gao	df8d6eeb19	Update docs about DP and DDP for CUDA (#35063 ) Summary: We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063 Differential Revision: D20549621 Pulled By: ngimel fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543	2020-03-20 20:06:37 -07:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
zou3519	23bffc4f14	Fix most documentation warnings (#27782 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782 Warnings show up when running `make html` to build documentation. All of the warnings are very reasonable and point to bugs in our docs. This PR attempts to fix most of those warnings. In the future we will add something to the CI that asserts that there are no warnings in our docs. Test Plan: - build and view changes locally Differential Revision: D17887067 Pulled By: zou3519 fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166	2019-10-13 10:34:01 -07:00
Yuxin Wu	23f963e4a8	Update distributed.rst (#23289 ) Summary: Different backend is supported since https://github.com/pytorch/pytorch/pull/18595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23289 Differential Revision: D16528229 Pulled By: soumith fbshipit-source-id: 57753e84c015817661ba30835278ee3a899aa2d0	2019-07-26 16:55:52 -07:00
Pieter Noordhuis	95e822622b	Enhance interpretation of GLOO_SOCKET_IFNAME (#22978 ) Summary: With this change you can now list multiple interfaces separated by comma. ProcessGroupGloo creates a single Gloo context for every device in the list (a context represents a connection to every other rank). For every collective that is called, it will select the context in a round robin fashion. The number of worker threads responsible for executing the collectives is set to be twice the number of devices. If you have a single physical interface, and wish to employ increased parallelism, you can also specify `GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`. This makes ProcessGroupGloo use 4 connections per rank, 4 I/O threads, and 8 worker threads responsible for executing the collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978 ghstack-source-id: 87006270 Differential Revision: D16339962 fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1	2019-07-25 04:52:38 -07:00
Seungwon Park	6c7135decb	fix typo: pytoch -> pytorch Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19719 Differential Revision: D15080095 Pulled By: ezyang fbshipit-source-id: b731a0fde87d25c63c1e3d4b9a9c2244e5ad84af	2019-04-25 06:40:40 -07:00

1 2

67 Commits