Commit Graph

94 Commits

Author SHA1 Message Date
Rohan Varma
071d49a970 Document monitored barrier (#58322)
Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322

Reviewed By: SciPioneer

Differential Revision: D28595405

Pulled By: rohan-varma

fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
2021-05-21 19:04:57 -07:00
Rohan Varma
52bb8120b8 Mention distributed profiling in documentation (#58286)
Summary:
Added a simple section indicating distributed profiling is expected to work similar to other torch operators, and is supported for all communication backends out of the box.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58286

Reviewed By: bdhirsh

Differential Revision: D28436489

Pulled By: rohan-varma

fbshipit-source-id: ce1905a987c0ede8011e8086a2c30edc777b4a38
2021-05-14 09:43:00 -07:00
Alexander Golynski
bc30c3165c Update docs for get_future support (#58107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58107

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28387374

Pulled By: agolynski

fbshipit-source-id: 70052afbb0b07ba341ea55f7ec30f7d9759b7bd4
2021-05-12 18:29:28 -07:00
Wanchao Liang
270d675f86 update distributed doc table for alltoall nccl (#54277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54277

alltoall already supported in nccl backend, so update the doc to reflect it.

Test Plan: Imported from OSS

Reviewed By: divchenko

Differential Revision: D27172904

Pulled By: wanchaol

fbshipit-source-id: 9afa89583d56b247b2017ea2350936053eb30827
2021-03-19 15:35:10 -07:00
Stas Bekman
924c15c962 [doc] reorg dist init and non-init functions (#52976)
Summary:
This PR proposes to improve the distributed doc:

* [x] putting the init functions together
* [x] moving post-init functions into their own sub-section as they are only available after init and moving that group to after all init sub-sections

If this is too much, could we at least put these 2 functions together:

```
.. autofunction:: init_process_group

.. autofunction:: is_initialized
```
as they are interconnected. and the other functions are not alphabetically sorted in the first place.

Thank you.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52976

Reviewed By: albanD

Differential Revision: D26993933

Pulled By: mrshenli

fbshipit-source-id: 7cacbe28172ebb5849135567b1d734870b49de77
2021-03-12 08:48:18 -08:00
Joe Zhu
f2b43ddbf4 Update api doc for enabling TcpStore on Windows (#51847)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51847

Reviewed By: albanD

Differential Revision: D26405678

Pulled By: malfet

fbshipit-source-id: 073b675225b48d1732771583f8f2473e0fdcf35c
2021-02-11 14:44:03 -08:00
Nikita Shulga
76c6e12a5c Minor spelling updates (#52149)
Summary:
Add space between 'e.g.' and 'build'
'pacakge'->'package'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52149

Reviewed By: osalpekar

Differential Revision: D26405824

Pulled By: malfet

fbshipit-source-id: 386390d3f31a9fc268b05902b9dca1deeaf626f9
2021-02-11 12:36:27 -08:00
Emilio Castillo
233e4ebdb6 Implement autograd functions for c10d communication operations (#40762)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40702, Fixes https://github.com/pytorch/pytorch/issues/40690

Currently wip. But I would appreciate some feedback. Functions should be double-differentiable.

Contrary to b35cdc5200/torch/nn/parallel/_functions.py
This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct?

Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40762

Reviewed By: glaringlee

Differential Revision: D24758889

Pulled By: mrshenli

fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce
2021-01-26 07:52:51 -08:00
Rohan Varma
d6b5f3ad98 Add object-based collective APIs to public docs (#48909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48909

Adds these new APIs to the documentation
ghstack-source-id: 117965961

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25363279

fbshipit-source-id: af6889d377f7b5f50a1a77a36ab2f700e5040150
2020-12-07 14:30:25 -08:00
Rohan Varma
362d9a932e Remove object-based collective APIs from public docs (#46075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46075

Removes these from public docs for now as we are still
iterating/formalizing these APIs. Will add them back once they are part of a
PyTorch release.
ghstack-source-id: 113928700

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D24211510

fbshipit-source-id: 3e36ff6990cf8e6ef72b6e524322ae06f9097aa2
2020-10-09 09:24:51 -07:00
Rohan Varma
154347d82f Fix distributed documentation for asynchronous collective Work objects (#45709)
Summary:
Closes https://github.com/pytorch/pytorch/issues/42247. Clarifies some documentation related to `Work` object semantics (outputs of async collective functions). Clarifies the difference between CPU operations and CUDA operations (on Gloo or NCCL backend), and provides an example where the difference in CUDA operation's wait() semantics is necessary to understand for correct code.
![sync](https://user-images.githubusercontent.com/8039770/94875710-6f64e780-040a-11eb-8fb5-e94fd53534e5.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45709

Reviewed By: ngimel

Differential Revision: D24171256

Pulled By: rohan-varma

fbshipit-source-id: 6365a569ef477b59eb2ac0a8a9a1c1f34eb60e22
2020-10-07 19:59:51 -07:00
Omkar Salpekar
3799ba83e5 [Docs] Adding Store API Docs (#45543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543

This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195

Test Plan: Will verify screenshots by building the docs.

Reviewed By: pritamdamania87

Differential Revision: D24005598

fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
2020-10-02 11:16:56 -07:00
gunandrose4u
47debdca42 Document change for DDP enabled on Windows platform (#45392)
Summary:
Document change for DDP enabled on Windows platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45392

Reviewed By: gchanan

Differential Revision: D23962344

Pulled By: mrshenli

fbshipit-source-id: 8924c6ca36d68699871d8add3e0aab6542ea269c
2020-09-28 13:22:42 -07:00
Rohan Varma
fbea2ee917 broadcast_object API for c10d (#43887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436

Reviewed By: mrshenli

Differential Revision: D23422577

fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
2020-09-01 18:54:17 -07:00
Shen Li
2f52748515 Publish all_gather_object and gather_object docs (#43772)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23398495

Pulled By: rohan-varma

fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605
2020-08-31 13:28:00 -07:00
Shen Li
0edbe6b063 Add a link in RPC doc page to point to PT Distributed overview (#41108)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108

Test Plan: Imported from OSS

Differential Revision: D22440751

Pulled By: mrshenli

fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb
2020-07-08 14:00:05 -07:00
Shen Li
b982a6a247 Expose torch.distributed.is_available() API (#37021)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37021

Test Plan: Imported from OSS

Differential Revision: D21164318

Pulled By: mrshenli

fbshipit-source-id: 08a446af342cbe54f3eb4994956ffa7ef4922bcf
2020-04-21 18:38:46 -07:00
Orion Reblitz-Richardson
2d8dbcd3ef Remove python2 and 3.5 from requirements.txt, README and docs (#35677)
Summary:
Some more cleanup now that we no longer support python2 or 3.5 on master and eventually PyTorch 1.6 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35677

Differential Revision: D20838097

Pulled By: orionr

fbshipit-source-id: 95d553a1e8769f3baa395e0bc6d4ce7cd93236e9
2020-04-03 11:05:43 -07:00
Feng Tian
762270c51f add c10d dynamic loading mechanism and unit test (#28068)
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.

related RFC is in: https://github.com/pytorch/pytorch/issues/27955

Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068

Differential Revision: D19174838

Pulled By: agolynski

fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar
945d7a7408 Add All-to-all comms support to distributed module and MPI backend (#32361)
Summary:
As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend.

mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361

Reviewed By: mrshenli

Differential Revision: D20635481

Pulled By: srinivas212

fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2
2020-04-01 08:57:12 -07:00
Yuichiro Ueno
aadd0fda8b Document reduce_scatter collective operation (#35274)
Summary:
I don't know why reduce_scatter collective operation is not documented so I add it to the document.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35274

Differential Revision: D20645850

Pulled By: mrshenli

fbshipit-source-id: 0a4458bff1a4e15a4593dd4dcc25e4e0f6e2265d
2020-03-25 13:36:29 -07:00
Xiang Gao
df8d6eeb19 Update docs about DP and DDP for CUDA (#35063)
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063

Differential Revision: D20549621

Pulled By: ngimel

fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
2020-03-20 20:06:37 -07:00
Brian Wignall
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
zou3519
23bffc4f14 Fix most documentation warnings (#27782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782

Warnings show up when running `make html` to build documentation. All of
the warnings are very reasonable and point to bugs in our docs. This PR
attempts to fix most of those warnings.

In the future we will add something to the CI that asserts that there
are no warnings in our docs.

Test Plan: - build and view changes locally

Differential Revision: D17887067

Pulled By: zou3519

fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166
2019-10-13 10:34:01 -07:00
Yuxin Wu
23f963e4a8 Update distributed.rst (#23289)
Summary:
Different backend is supported since https://github.com/pytorch/pytorch/pull/18595
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23289

Differential Revision: D16528229

Pulled By: soumith

fbshipit-source-id: 57753e84c015817661ba30835278ee3a899aa2d0
2019-07-26 16:55:52 -07:00
Pieter Noordhuis
95e822622b Enhance interpretation of GLOO_SOCKET_IFNAME (#22978)
Summary:
With this change you can now list multiple interfaces separated by
comma. ProcessGroupGloo creates a single Gloo context for every device
in the list (a context represents a connection to every other
rank). For every collective that is called, it will select the context
in a round robin fashion. The number of worker threads responsible for
executing the collectives is set to be twice the number of devices.

If you have a single physical interface, and wish to employ increased
parallelism, you can also specify
`GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`.  This makes ProcessGroupGloo
use 4 connections per rank, 4 I/O threads, and 8 worker threads
responsible for executing the collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978
ghstack-source-id: 87006270

Differential Revision: D16339962

fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1
2019-07-25 04:52:38 -07:00
Seungwon Park
6c7135decb fix typo: pytoch -> pytorch
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19719

Differential Revision: D15080095

Pulled By: ezyang

fbshipit-source-id: b731a0fde87d25c63c1e3d4b9a9c2244e5ad84af
2019-04-25 06:40:40 -07:00
Teng Li
2d3cf98b49 Making dist.get_default_group private for PT1 release (#14767)
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions.  It should really be private.

All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.

We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767

Reviewed By: pietern

Differential Revision: D13330655

Pulled By: teng-li

fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
2018-12-04 19:22:24 -08:00
Pieter Noordhuis
3648c269e9 Misc distributed documentation updates (#14605)
Summary:
* s/environmental/environment/g
* Casing (CUDA, InfiniBand, Ethernet)
* Don't embed torch.multiprocessing.spawn but link to it (not part of the package)
* spawn _function_ instead of _utility_ (it's mentioned after the launch utility which is a proper utility)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14605

Differential Revision: D13273480

Pulled By: pietern

fbshipit-source-id: da6b4b788134645f2dcfdd666d1bbfc9aabd97b1
2018-11-29 21:51:43 -08:00
Teng Li
2b7345bcd5 PT1 distributed doc update (#14530)
Summary:
Removed an incorrect section. We don't support this. I wrote this from my memory :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14530

Differential Revision: D13253471

Pulled By: teng-li

fbshipit-source-id: c3f1ffc6c98ef8789157e885776e0b775ec47b15
2018-11-29 17:50:47 -08:00
Teng Li
a38ed0268e PT1 Stable Release Distributed Documentation (#14444)
Summary:
The doc covers pretty much all we have had on distributed for PT1 stable release, tracked in https://github.com/pytorch/pytorch/issues/14080

Tested by previewing the sphinx generated webpages. All look good.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14444

Differential Revision: D13227675

Pulled By: teng-li

fbshipit-source-id: 752f00df096af38dd36e4a337ea2120ffea79f86
2018-11-28 00:34:11 -08:00
Tongzhou Wang
044d00516c Rename DistBackend -> Backend (#11830)
Summary:
Also add docs for get_backend, Backend, and reduce_op

fixes #11803

cc The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830

Differential Revision: D9927991

Pulled By: SsnL

fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48
2018-11-07 11:58:12 -08:00
Teng Li
3d5fd12488 Documentation for c10d: torch.distributed and deprecate the old distributed doc (#11450)
Summary:
This is the new documentation for c10d release, and it also deprecates the old torch.distributed document.

This PR depends on https://github.com/pytorch/pytorch/pull/11405

and should only be landed after https://github.com/pytorch/pytorch/pull/11405 is landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11450

Differential Revision: D9765504

Pulled By: teng-li

fbshipit-source-id: 48f38b27b8c270baf389f8e478ea226b9ecc63db
2018-09-11 02:10:28 -07:00
Tongzhou Wang
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00
Soumith Chintala
d4f6c84041
fix nccl distributed documentation 2018-05-17 18:03:54 -04:00
Teng Li
f5beff334b Added distributed docs on NCCL2 backend/functions and launch module (#6579) 2018-04-15 21:53:10 -04:00
Scott Sievert
3821fca0c6 DOC: i{send, recv} message order with MPI backend 2017-09-14 20:38:11 -04:00
Brett Koonce
08b4770adf minor spelling, intialize->initialize 2017-09-14 15:13:01 -04:00
Soumith Chintala
4fec5f658b add Bilinear to docs, fix reference 2017-09-11 20:12:27 -04:00
jekbradbury
7aa6bc516f add "Basics" section to distributed docs (#2433) 2017-08-24 17:07:20 -04:00
Kai Arulkumaran
11a14fd0fd Clarifications on setting up torch.distributed (#2475) 2017-08-18 09:21:04 -04:00
Adam Paszke
4f035f14de Add a support matrix for distributed backends 2017-07-21 14:19:46 -04:00
Soumith Chintala
81fd2bf2d0 fix some language / typos 2017-07-12 14:47:36 -04:00
Adam Paszke
8915e2710c Refactor scatter/gather and add distributed docs 2017-07-12 14:47:36 -04:00