Commit Graph

68 Commits

Author SHA1 Message Date
Mingzhe Li
66f9b1de1b [NCCL] enable p2p tests (#47797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24863808

fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
2020-11-12 10:44:50 -08:00
Omkar Salpekar
32b4b51254 [Docs] Minor doc fixes for init_process_group (#47644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47644

Minor Update to the init_process_group docs.
ghstack-source-id: 116441798

Test Plan: CI

Reviewed By: jiayisuse, mrshenli

Differential Revision: D24633432

fbshipit-source-id: fbd38dab464ee156d119f9f0b22ffd0e416c4fd7
2020-11-11 15:21:30 -08:00
Xu Zhao
73a3e70b24 Add type annotations for torch._C._distributed_c10d module. (#46623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46623

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24761606

Pulled By: xuzhao9

fbshipit-source-id: 827eaf2502e381ee24d36741c1613b4c08208569
2020-11-06 01:28:48 -08:00
Rohan Varma
c7183c9878 Fix object-based collectives API to use torch.cuda.current_device instead of (#46897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115359633

Test Plan: Modified unittests

Reviewed By: divchenko

Differential Revision: D24556177

fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
2020-10-28 18:12:50 -07:00
Omkar Salpekar
5e2f17d77a Add NCCL_ASYNC_ERROR_HANDLING to docs (#46856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46856

Add reference to NCCL_ASYNC_ERROR_HANDLING in the pytorch docs,
similar to how NCCL_BLOCKING_WAIT is curently described.
ghstack-source-id: 115186877

Test Plan: CI, verifying docs change

Reviewed By: jiayisuse

Differential Revision: D24541822

fbshipit-source-id: a0b3e843bc6392d2787a4bb270118f2dfda5f4ec
2020-10-26 14:41:32 -07:00
Luca Wehrstedt
f230245c06 Revert D24422354: [pytorch][PR] fix-process-group-counter
Test Plan: revert-hammer

Differential Revision:
D24422354 (caed29a069)

Original commit changeset: 32493cc2001d

fbshipit-source-id: 9b633f738ea555f45031056689f780dde8eda859
2020-10-23 08:04:37 -07:00
Brian Hirsh
db83ddcb86 small doc fix (#46599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46599

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24426181

Pulled By: bdhirsh

fbshipit-source-id: d0900d5c43574c80f1bf614824eafd21ba6a9caf
2020-10-21 20:17:31 -07:00
Joel Lamy-Poirier
caed29a069 fix-process-group-counter (#46563)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46561

A minimal fix to issue https://github.com/pytorch/pytorch/issues/46561. Increment the global variable `_group_count` at the same time as the others so the global state remains consistent in case of a failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46563

Reviewed By: zou3519

Differential Revision: D24422354

Pulled By: mrshenli

fbshipit-source-id: 32493cc2001d21ad366c396d16c303936959434e
2020-10-21 13:03:53 -07:00
Alexander Golynski
e7e919fc34 Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24294437

Pulled By: gmagogsfm

fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6
2020-10-14 16:27:37 -07:00
Brian Hirsh
1f791c06f0 adding BAND/BOR/BXOR reduce ops to unsupported list for complex numbers. added tests (#46270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46270

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24284702

Pulled By: bdhirsh

fbshipit-source-id: 7e6c3fce83a4367808a638f0400999399b2c35b0
2020-10-14 08:48:14 -07:00
Brian Hirsh
c02efdefa8 adding complex support for distributed functions and . fix #45760 (#45879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24127949

Pulled By: bdhirsh

fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223
2020-10-12 12:44:47 -07:00
Mingzhe Li
281463ba0b [NCCL] Enable send/recv tests (#45994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994

Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24172484

fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
2020-10-09 15:00:39 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
Pritam Damania
a2b4177c5b Add barrier() at the end of init_process_group and new_group. (#45181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181

`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112

Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.

Reviewed By: mrshenli

Differential Revision: D23858025

fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
2020-09-25 15:46:59 -07:00
Rohan Varma
bee97d5be0 Document the default behavior for dist.new_group() when ranks=None (#44000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000

This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D23465034

fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
2020-09-17 11:30:37 -07:00
Rohan Varma
fbea2ee917 broadcast_object API for c10d (#43887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436

Reviewed By: mrshenli

Differential Revision: D23422577

fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
2020-09-01 18:54:17 -07:00
Akihiro Nitta
f17d7a5556 Fix exception chaining in torch/ (#43836)
Summary:
## Motivation
Fixes https://github.com/pytorch/pytorch/issues/43770.

## Description of the change
This PR fixes exception chaining only in files under `torch/` where appropriate.
To fix exception chaining, I used either:
1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information.
2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant.
I subjectively chose which one to use from the above options.

## List of lines containing raise in except clause:
I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause.

- [x] 000739c31a/torch/jit/annotations.py (L35)
- [x] 000739c31a/torch/jit/annotations.py (L150)
- [x] 000739c31a/torch/jit/annotations.py (L158)
- [x] 000739c31a/torch/jit/annotations.py (L231)
- [x] 000739c31a/torch/jit/_trace.py (L432)
- [x] 000739c31a/torch/nn/utils/prune.py (L192)
- [x] 000739c31a/torch/cuda/nvtx.py (L7)
- [x] 000739c31a/torch/utils/cpp_extension.py (L1537)
- [x] 000739c31a/torch/utils/tensorboard/_pytorch_graph.py (L292)
- [x] 000739c31a/torch/utils/data/dataloader.py (L835)
- [x] 000739c31a/torch/utils/data/dataloader.py (L849)
- [x] 000739c31a/torch/utils/data/dataloader.py (L856)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L186)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L189)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L424)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1279)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1283)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1356)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1388)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1391)
- [ ] 000739c31a/torch/testing/_internal/common_utils.py (L1412)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L310)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L329)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L332)
- [x] 000739c31a/torch/testing/_internal/jit_utils.py (L183)
- [x] 000739c31a/torch/testing/_internal/common_nn.py (L4789)
- [x] 000739c31a/torch/onnx/utils.py (L367)
- [x] 000739c31a/torch/onnx/utils.py (L659)
- [x] 000739c31a/torch/onnx/utils.py (L892)
- [x] 000739c31a/torch/onnx/utils.py (L897)
- [x] 000739c31a/torch/serialization.py (L108)
- [x] 000739c31a/torch/serialization.py (L754)
- [x] 000739c31a/torch/distributed/rpc/_testing/faulty_agent_backend_registry.py (L76)
- [x] 000739c31a/torch/distributed/rpc/backend_registry.py (L260)
- [x] 000739c31a/torch/distributed/distributed_c10d.py (L184)
- [x] 000739c31a/torch/_utils_internal.py (L57)
- [x] 000739c31a/torch/hub.py (L494)
- [x] 000739c31a/torch/contrib/_tensorboard_vis.py (L16)
- [x] 000739c31a/torch/distributions/lowrank_multivariate_normal.py (L100)
- [x] 000739c31a/torch/distributions/constraint_registry.py (L142)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43836

Reviewed By: ailzhang

Differential Revision: D23431212

Pulled By: malfet

fbshipit-source-id: 5f7f41b391164a5ad0efc06e55cd58c23408a921
2020-08-31 20:26:23 -07:00
Shen Li
2f52748515 Publish all_gather_object and gather_object docs (#43772)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23398495

Pulled By: rohan-varma

fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605
2020-08-31 13:28:00 -07:00
Rohan Varma
f22aa601ce All Gather and gather APIs for Python Objects (#42189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189

Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.

As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433

Reviewed By: mrshenli

Differential Revision: D22785387

fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
2020-08-06 13:30:25 -07:00
Tongzhou Wang
3001facd7a [doc] [distributed] fix typo (#39264)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39264

Differential Revision: D21791426

Pulled By: mrshenli

fbshipit-source-id: c3aa8fda1893aa3c0f9ad3db7da25f1ee80303e8
2020-06-01 19:19:46 -07:00
Quang Luong
9d7a79ac27 [Caffe2] raise exceptions instead of str (#37744)
Summary:
Some exceptions are not correctly wrapped inside a class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37744

Differential Revision: D21388197

Pulled By: mrshenli

fbshipit-source-id: 2d69e2543c2e05116c367d137968b982c254d2dc
2020-05-05 13:34:33 -07:00
Pritam Damania
136d84dd38 Enhance error message for MPI unavailability. (#36781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36781

Mention that you need to to build PyTorch from source to enable MPI.
Additional context:
https://discuss.pytorch.org/t/distributed-pytorch-with-mpi/77106.
ghstack-source-id: 102341246

Test Plan: waitforbuildbot

Differential Revision: D21082009

fbshipit-source-id: 3a3286349e71322726a341dfc743b5978c7d9a56
2020-04-18 14:45:44 -07:00
Sudarshan Raghunathan
739351fac4 Fix linter warning: replace f-strings with str.format for Py2 compat (#35492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35492

Test Plan: Imported from OSS

Differential Revision: D20998727

Pulled By: drdarshan

fbshipit-source-id: 54f34a7649a2772ad030b456f1b50aba831ce2e0
2020-04-13 18:43:58 -07:00
Feng Tian
762270c51f add c10d dynamic loading mechanism and unit test (#28068)
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.

related RFC is in: https://github.com/pytorch/pytorch/issues/27955

Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068

Differential Revision: D19174838

Pulled By: agolynski

fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar
945d7a7408 Add All-to-all comms support to distributed module and MPI backend (#32361)
Summary:
As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend.

mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361

Reviewed By: mrshenli

Differential Revision: D20635481

Pulled By: srinivas212

fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2
2020-04-01 08:57:12 -07:00
Ankesh Anand
45c45195cd Remove warning about building from source to use the NCCL backend (#34051)
Summary:
I think this warning isn't true anymore, and the NCCL backend works without PyTorch needing to be built from source.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34051

Differential Revision: D20195310

Pulled By: ezyang

fbshipit-source-id: 14f879a8c43ea5efdbdf0f638792ea2b90011f4a
2020-03-02 13:43:43 -08:00
Rohan Varma
6cb9e6b015 Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434

Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
2020-02-19 17:17:17 -08:00
Rohan Varma
d4e4beddc4 Revert D19871946: [distributed] pass in timeout to TCP store when initializing
Test Plan: revert-hammer

Differential Revision:
D19871946

Original commit changeset: dd002180c4c8

fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2
2020-02-16 19:37:44 -08:00
Rohan Varma
df47a3abe0 [distributed] pass in timeout to TCP store when initializing (#33325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325

Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
2020-02-16 17:59:44 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
Alexander Golynski
23695ab23f Moving python allgather_coalesced impl from Py to C. (#29059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29059
This is a resubmit of reverted diff D18209289 ( PR #28857 ).

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: pietern

Differential Revision: D18277097

fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed
2019-11-04 08:34:34 -08:00
Shen Li
9041e29d94 Revert D18209289: Moving python allgather_coalesced impl from Py to C
Test Plan: revert-hammer

Differential Revision:
D18209289

Original commit changeset: c5a4c4a1aaa0

fbshipit-source-id: d4865e3f8c4eeee285c711e5c2250b8c9f9b0d25
2019-11-01 11:23:41 -07:00
Alexander Golynski
22a346ee34 Moving python allgather_coalesced impl from Py to C
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28857

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: mrshenli

Differential Revision: D18209289

fbshipit-source-id: c5a4c4a1aaa07286a05a7c842dda428eeb46f696
2019-11-01 10:34:23 -07:00
Alexander Golynski
45dab56153 adding python all_gather coalesced functionality and testing. (#28634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28634

caveat 1: this only works in sync mode.
caveat 2: this is going to go away and be replaced by c++ implementation

Test Plan: buck test caffe2/test:distributed_gloo -- test_all_gather_coalesced

Reviewed By: mrshenli

Differential Revision: D18123422

fbshipit-source-id: cfb9950d5d54c6181a5240e7cc9fed88ed47f5d9
2019-10-28 08:12:36 -07:00
Shihao Xu
59402f51cf Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226

# Goal

Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.

The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.

We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.

# Solution

- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.

Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```

```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```

```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```

Differential Revision: D5524494

fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
2019-10-23 21:51:08 -07:00
zou3519
e5d6b75319 Bag of documentation fixes; fix more sphinx warnings (#27850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850

Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).

Test Plan: - built and viewed the documentation for each change locally.

Differential Revision: D17908123

Pulled By: zou3519

fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
2019-10-15 07:31:14 -07:00
Pritam Damania
646e214706 ProcessGroupNCCL should respect timeout passed in to init_process_group. (#27224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27224

As part of adding error handling to NCCL, we are now able to specify a
timeout for operations using ProcessGroupNCCL. Although, this timeout had a
default of 10 seconds and didn't respect the timeout specified in
init_process_group.

In this change, I've ensured we pass the appropriate timeout to
ProcessGroupNCCL.
ghstack-source-id: 91283548

Test Plan:
Added unit test to verify timeout passed in to init_process_group is
respected.

Differential Revision: D17717992

fbshipit-source-id: c73320187f1f3b2693ba1e177d80646e282d01a2
2019-10-04 13:28:57 -07:00
Vikas Mehta
3a18e2e768 support re-creating/destroying process groups when some trainers recover after failures (#26912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26912

group name is used as prefix in the c10d store and without a consistent name process group cannot be initialized.

When process group doesn't have an explicit name (only WORLD (default) process group can have an explicit name), we use global _group_counter to generate the name. We need to reset the counter on destruction to allow consistent value to be generated when we re-create process groups after some trainers recover from failure.

Test Plan: existing tests passed

Reviewed By: mrshenli

Differential Revision: D17594268

fbshipit-source-id: 17f4d2746584dadaa5d468085d871ff3e95a1c84
2019-09-27 16:16:58 -07:00
Pieter Noordhuis
ebdb32c749 Remove global group name tracking for ProcessGroupNCCL (#25905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25905

Now that we can detect and recover from failures in NCCL we should
allow processes that are started at different times (and perhaps have
had previous NCCL process group instances), to eventually be part of
the same process group. Keeping track of group names in global
variables prevents that, because the processes will be out of sync.

This commit removes the global group name maps and defers
responsibility of isolating access to the same store from multiple
process groups to the store itself. Users can use `c10d::PrefixStore`
to derive new store instances whose keyspace is scoped to some
prefix. Functionally, this is identical to keeping a global map and
using a group name, but also gives more flexibility to the front-end
API to reset state and have processes that have started at different
times to join the same process group.
ghstack-source-id: 89804865

Test Plan: Tests pass.

Differential Revision: D17281416

fbshipit-source-id: eab3b48463a9b0ef24aedeca76e2bb970b9f33ef
2019-09-11 06:56:33 -07:00
Pieter Noordhuis
500e72aaa5 Make scatter/gather arguments optional (#25575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25575

For both scatter and gather, only the source and destination rank,
respectively, need to supply a list of tensors. The `scatter_list` and
`gather_list` arguments were mandatory, however, and this has resulted
in some confusion. This commit makes both the `scatter_list` and
`gather_list`, and the `src` and `dst` arguments optional.

Closes #25463.

Test Plan: Imported from OSS

Differential Revision: D17164253

fbshipit-source-id: a16bc208c87a1c96163c1a86d4a7ca8634a26f95
2019-09-03 12:27:05 -07:00
Pieter Noordhuis
493f7bd817 Error phrasing in torch.distributed helper functions
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25574

Test Plan: Imported from OSS

Differential Revision: D17164254

fbshipit-source-id: 13dbcffd67c2b5425c722b2b21765345a85a3872
2019-09-03 12:27:01 -07:00
jfc4050
590619ab8c Support all_reduce a list of same-device tensors #21640 (#24949)
Summary:
addresses https://github.com/pytorch/pytorch/issues/21640 for CPU tensors and the Gloo backend.

Questions:
- ~~currently takes `AllreduceOptions`, since all of the options are the same. Would it be better to make a new `AllreduceCoalescedOptions` class?~~
- ~~I decided to inherit from `ProcessGroupGloo::AsyncWork` instead of `AsyncAllreduceWork` to shorten the inheritance chain a bit and for consistency with existing classes. However, this means that the two `getFunction` methods are copy-pasted. Would inheriting from `AsyncAllreduceWork` be preferable?~~
- ~~should the work class be named `AsyncCoalescedAllreduceWork` or `AsyncAllreduceCoalescedWork`?~~

thank you!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24949

Differential Revision: D17055580

Pulled By: mrshenli

fbshipit-source-id: e63b5fcaec6021053ea960776a09ee8cf11d1ec2
2019-08-28 10:57:37 -07:00
Max Wang
c5845c4482 Add support for reduce-scatter in c10d (#18844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844
ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18844 Add support for reduce-scatter in c10d**
* #18820 Refactor ProcessGroupNCCL collective primitives

Reviewed By: mrshenli

Differential Revision: D14768369

fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156
2019-04-26 13:46:57 -07:00
Kutta Srinivasan
b7323a94ad Cleanup init_process_group (#19033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033

torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced.

Reviewed By: mrshenli

Differential Revision: D14813070

fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042
2019-04-18 09:37:38 -07:00
Pieter Noordhuis
ce166d949d ProcessGroupMPI exists only if it is valid (#14809)
Summary:
Previously, MPI process groups were created for all processes, even if
they were not part of the created group. Their MPI_Comm member field
would be MPI_COMM_NULL and they would ignore any calls. Their rank and
size were identical to that of the global process group and they had a
special groupRank and groupSize field to capture the _real_ rank.

This also meant assymetry with other process group types, where creating
a new group would either return the process group OR
GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always
return a process group and an additional check was needed to verify
whether or not a process was indeed part of a process group or not.

This commit changes this such that every MPI process group is a valid
process group, and by extension that we no longer have to special case
MPI to determine whether or not a process is part of a group. Now, if
the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the
process is not a member, otherwise it is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809

Differential Revision: D14887937

Pulled By: pietern

fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d
2019-04-10 21:36:35 -07:00
Shen Li
8f9b11cf33 Propagate ProcessGroup timeout to Store (#16571)
Summary:
closes #16520

Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks!

Questions:
1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion?
2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571

Differential Revision: D13954527

Pulled By: mrshenli

fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
2019-04-09 12:36:28 -07:00
Pieter Noordhuis
7a19d3c9e1 Allow override of backend in dist.new_group() (#18595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595

There is no need to force the backend to be the same as the global
process group, as long as the backend is "nccl" or "gloo".

Reviewed By: mrshenli

Differential Revision: D14657204

fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b
2019-04-04 14:23:03 -07:00
Shen Li
c0ad6747a9 Highlight NCCL all_reduce and all_gather requirements (#18741)
Summary:
See #18689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18741

Differential Revision: D14726874

Pulled By: mrshenli

fbshipit-source-id: a92404c653e3c62fc23fa3ccacfb3b2959b2e307
2019-04-03 09:50:29 -07:00
Igor Fedan
36237c4893 Fix flake8 issues in gragrad test
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18727

Differential Revision: D14724887

Pulled By: ifedan

fbshipit-source-id: 8c1db6460303e746e4aea0142302b8d61277c067
2019-04-02 12:45:18 -07:00
Pieter Noordhuis
bdfdf6c2b9 C++ handler for gradient reduction (#18251)
Summary:
This commit adds the `c10d::Reducer` class that hooks into autograd
and performs gradient bucketing and reduction. These are the core
parts of `nn.parallel.DistributedDataParallel` that up to now were
only usable for CUDA models.

This should enable the following:

* Distributed data parallelism for models defined using the C++ frontend.
* Allow overlap of gradient computation and reduction for non-CUDA models.
* Enable distributed data parallelism for models with some unused parameters.

This does not include any logic for computing bucket assignment, which
can be done separately; either by observing autograd execution order
(this is what Apex does), or by assigning buckets based on some
maximum byte size, or both.

Also see #17757 and #13273.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251

Reviewed By: mrshenli

Differential Revision: D14571899

Pulled By: pietern

fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c
2019-04-01 14:30:02 -07:00