Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322
Reviewed By: SciPioneer
Differential Revision: D28595405
Pulled By: rohan-varma
fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
Summary:
Added a simple section indicating distributed profiling is expected to work similar to other torch operators, and is supported for all communication backends out of the box.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58286
Reviewed By: bdhirsh
Differential Revision: D28436489
Pulled By: rohan-varma
fbshipit-source-id: ce1905a987c0ede8011e8086a2c30edc777b4a38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54277
alltoall already supported in nccl backend, so update the doc to reflect it.
Test Plan: Imported from OSS
Reviewed By: divchenko
Differential Revision: D27172904
Pulled By: wanchaol
fbshipit-source-id: 9afa89583d56b247b2017ea2350936053eb30827
Summary:
This PR proposes to improve the distributed doc:
* [x] putting the init functions together
* [x] moving post-init functions into their own sub-section as they are only available after init and moving that group to after all init sub-sections
If this is too much, could we at least put these 2 functions together:
```
.. autofunction:: init_process_group
.. autofunction:: is_initialized
```
as they are interconnected. and the other functions are not alphabetically sorted in the first place.
Thank you.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52976
Reviewed By: albanD
Differential Revision: D26993933
Pulled By: mrshenli
fbshipit-source-id: 7cacbe28172ebb5849135567b1d734870b49de77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48909
Adds these new APIs to the documentation
ghstack-source-id: 117965961
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D25363279
fbshipit-source-id: af6889d377f7b5f50a1a77a36ab2f700e5040150
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46075
Removes these from public docs for now as we are still
iterating/formalizing these APIs. Will add them back once they are part of a
PyTorch release.
ghstack-source-id: 113928700
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D24211510
fbshipit-source-id: 3e36ff6990cf8e6ef72b6e524322ae06f9097aa2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543
This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195
Test Plan: Will verify screenshots by building the docs.
Reviewed By: pritamdamania87
Differential Revision: D24005598
fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this.
The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.
Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436
Reviewed By: mrshenli
Differential Revision: D23422577
fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
Summary:
Some more cleanup now that we no longer support python2 or 3.5 on master and eventually PyTorch 1.6 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35677
Differential Revision: D20838097
Pulled By: orionr
fbshipit-source-id: 95d553a1e8769f3baa395e0bc6d4ce7cd93236e9
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.
related RFC is in: https://github.com/pytorch/pytorch/issues/27955
Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068
Differential Revision: D19174838
Pulled By: agolynski
fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
Summary:
I don't know why reduce_scatter collective operation is not documented so I add it to the document.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35274
Differential Revision: D20645850
Pulled By: mrshenli
fbshipit-source-id: 0a4458bff1a4e15a4593dd4dcc25e4e0f6e2265d
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063
Differential Revision: D20549621
Pulled By: ngimel
fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27782
Warnings show up when running `make html` to build documentation. All of
the warnings are very reasonable and point to bugs in our docs. This PR
attempts to fix most of those warnings.
In the future we will add something to the CI that asserts that there
are no warnings in our docs.
Test Plan: - build and view changes locally
Differential Revision: D17887067
Pulled By: zou3519
fbshipit-source-id: 6bf4d08764759133b20983d6cd7f5d27e5ee3166
Summary:
With this change you can now list multiple interfaces separated by
comma. ProcessGroupGloo creates a single Gloo context for every device
in the list (a context represents a connection to every other
rank). For every collective that is called, it will select the context
in a round robin fashion. The number of worker threads responsible for
executing the collectives is set to be twice the number of devices.
If you have a single physical interface, and wish to employ increased
parallelism, you can also specify
`GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`. This makes ProcessGroupGloo
use 4 connections per rank, 4 I/O threads, and 8 worker threads
responsible for executing the collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978
ghstack-source-id: 87006270
Differential Revision: D16339962
fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private.
All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.
We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767
Reviewed By: pietern
Differential Revision: D13330655
Pulled By: teng-li
fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
Summary:
* s/environmental/environment/g
* Casing (CUDA, InfiniBand, Ethernet)
* Don't embed torch.multiprocessing.spawn but link to it (not part of the package)
* spawn _function_ instead of _utility_ (it's mentioned after the launch utility which is a proper utility)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14605
Differential Revision: D13273480
Pulled By: pietern
fbshipit-source-id: da6b4b788134645f2dcfdd666d1bbfc9aabd97b1
Summary:
Removed an incorrect section. We don't support this. I wrote this from my memory :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14530
Differential Revision: D13253471
Pulled By: teng-li
fbshipit-source-id: c3f1ffc6c98ef8789157e885776e0b775ec47b15
Summary:
The doc covers pretty much all we have had on distributed for PT1 stable release, tracked in https://github.com/pytorch/pytorch/issues/14080
Tested by previewing the sphinx generated webpages. All look good.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14444
Differential Revision: D13227675
Pulled By: teng-li
fbshipit-source-id: 752f00df096af38dd36e4a337ea2120ffea79f86
Summary:
Also add docs for get_backend, Backend, and reduce_op
fixes#11803
cc The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830
Differential Revision: D9927991
Pulled By: SsnL
fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48