Commit Graph

1403 Commits

Author SHA1 Message Date
Rohan Varma
6dabe0b291 [Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031

Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26356192

fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
2021-02-17 12:21:37 -08:00
Rohan Varma
7b21c6be67 [Dist Profiling] Enable profiling for gloo send/recv (#52004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52004

Enables profiling of p2p collectives for Gloo. Modified/added relevant unittests.
ghstack-source-id: 121507511

Test Plan: CI

Reviewed By: mrzzd

Differential Revision: D26347164

fbshipit-source-id: f4d1c474fccf40d5776fc13c4add7a053ea08960
2021-02-12 13:46:51 -08:00
Rohan Varma
4c93a79a04 [Dist Profiling] Support shape recording for profiling collectives (#51822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51822

Adds support for shape recording for profiling distributed collectives, for nccl/gloo backends. Added
both cpp and python tests to ensure that shapes are recorded properly. Note that we don't add `ProcessGroupNCCLTest`s since they need to be modified to support single process per device and > 1 world size.
ghstack-source-id: 121507509

Test Plan: CI

Reviewed By: mrzzd

Differential Revision: D26291739

fbshipit-source-id: 5f7bd54d8c36d17a4a29e172b25266ca3dbd8fbd
2021-02-11 12:42:26 -08:00
Richard Barnes
fa325d7c9f Use sum_integers and multiply_integers (#51146)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51146

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D25903430

fbshipit-source-id: 329c14018c9e5192864eed88a8ed0a5068ff1c69
2021-02-10 18:05:45 -08:00
Yanli Zhao
18e0a61388 add more logging fields that can be set in construction time (#51260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
2021-02-09 21:58:58 -08:00
Howard Huang
97e35858ec [Resubmit] Add compare_set operation and test to TCPStore (#51815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51815

This is resubmission of #51593, already approved.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D26316875

Pulled By: H-Huang

fbshipit-source-id: d81cb131ef6b9e2ebaee32bb505dfc11235bc29d
2021-02-08 13:44:31 -08:00
Yi Wang
5a962369e2 [Gradient Compression] Check if the backend is NCCL when a DDP communication hook is registered (#51759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51759

Some unit tests actually register a comm hook on other backends like GLOO. Example: `test_ddp_comm_hook_future_passing_cpu`

Therefore, only do the check on `register_builtin_comm_hook`.

Currently DDP communication hook can only be supported on NCCL. Add a check in the registration methods.
ghstack-source-id: 121115814

Test Plan: unit tests.

Reviewed By: pritamdamania87

Differential Revision: D26268581

fbshipit-source-id: c739fa4dca6d320202dc6689d790c2761c834c30
2021-02-05 09:59:12 -08:00
Howard Huang
62aea33d7f Revert D26237328: Add compare_set operation and test to TCPStore
Test Plan: revert-hammer

Differential Revision:
D26237328 (7d00aec6bc)

Original commit changeset: c6837a4cc34f

fbshipit-source-id: 662f8067ead9bce0da13b35d393fb781635dd2b9
2021-02-04 13:43:05 -08:00
Howard Huang
7d00aec6bc Add compare_set operation and test to TCPStore (#51593)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51593

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D26237328

Pulled By: H-Huang

fbshipit-source-id: c6837a4cc34f8247df6e1c29c1f40fd9e7953313
2021-02-04 10:36:58 -08:00
Omkar Salpekar
3361d365bd [Gloo] Use TORCH_CHECK for ensuring tag is nonnegative (#51370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51370

TORCH_CHECK should be used when confirming the correctness of function
arguments like the tag passed to Gloo functions.
ghstack-source-id: 120908449

Test Plan: Sandcastle/CI

Reviewed By: mingzhe09088

Differential Revision: D26152359

fbshipit-source-id: ddffaa6f11393aaedaf0870759dc526d8d4530ee
2021-02-03 11:48:20 -08:00
Yanli Zhao
e54cbb8250 Create PyTorch DDP logging APIs for applications to use (#50637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50637

add APIs for logging pytorch ddp logging data in applications.

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D25933411

fbshipit-source-id: 57c248a2f002da06a386fc7406d3e5533ebb9124
2021-02-02 18:24:21 -08:00
Yanli Zhao
d5541c50a3 add a c++ interface in processGroup to get its backend name (#51066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066

backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend.
ghstack-source-id: 120628432

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D26059769

fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503
2021-01-29 17:28:42 -08:00
Yanli Zhao
250c71121b Create a DDPLoggingData and expose it to python interface (#50622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
2021-01-25 15:23:07 -08:00
Xiang Gao
44922f26f5 Add support for NCCL alltoall (#44374)
Summary:
In https://github.com/pytorch/pytorch/issues/42514, NCCL `alltoall_single` is already added. This PR adds NCCL `alltoall`.

The difference between `alltoall_single` and `alltoall` is: `alltoall_single`  works on a single tensor and send/receive slices of that tensor, while `alltoall` works on a list of tensor, and send/receive tensors in that list.

cc: ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44374

Reviewed By: zhangguanheng66, mrshenli

Differential Revision: D24455427

Pulled By: srinivas212

fbshipit-source-id: 42fdebdd14f8340098e2c34ef645bd40603552b1
2021-01-20 14:57:12 -08:00
Pritam Damania
4e248eb3f6 Change watchdog timeout logging from INFO to ERROR. (#50455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50455

Certain systems only print logging messages for ERROR/WARN and the
error message that the watchdog is timing out a particular operation is pretty
important.

As a result, changing its level to ERROR instead of INFO.
ghstack-source-id: 119761029

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25894795

fbshipit-source-id: 259b16c13f6cdf9cb1956602d15784b92aa53f17
2021-01-12 20:15:39 -08:00
Rohan Varma
78e71ce627 warn user once for possible unnecessary find_unused_params (#50133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50133

`find_unused_parameters=True` is only needed when the model has unused parameters that are not known at model definition time or differ due to control flow.

Unfortunately, many DDP users pass this flag in as `True` even when they do not need it, sometimes as a precaution to mitigate possible errors that may be raised (such as the error we raise with not using all outputs).While this is a larger issue to be fixed in DDP, it would also be useful to warn once if we did not detect unused parameters.

The downside of this is that in the case of flow control models where the first iteration doesn't have unused params but the rest do, this would be a false warning. However, I think the warning's value exceeds this downside.
ghstack-source-id: 119707101

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D25411118

fbshipit-source-id: 9f4a18ad8f45e364eae79b575cb1a9eaea45a86c
2021-01-12 02:55:06 -08:00
Rohan Varma
294b7867eb Address clang-tidy warnings in ProcessGroupNCCL (#50131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50131

Noticed that in the internal diff for
https://github.com/pytorch/pytorch/pull/49069 there was a clang-tidy warning to
use emplace instead of push_back. This can save us a copy as it eliminates the
unnecessary in-place construction
ghstack-source-id: 119560979

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D25800134

fbshipit-source-id: 243e57318f5d6e43de524d4e5409893febe6164c
2021-01-07 21:29:28 -08:00
Jagadish Krishnamoorthy
c115957df0 [distributed] Provide parameter to pass GPU ID in barrier function (#49069)
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.

Add test cases to validate barrier device_ids.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes https://github.com/pytorch/pytorch/issues/48110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069

Reviewed By: mrshenli

Differential Revision: D25658528

Pulled By: rohan-varma

fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
2021-01-05 11:27:54 -08:00
Omkar Salpekar
31fcbbdf35 [FileStore] Implemented numKeys and Added Tests (#49556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49556

Implemented the missing Store functionality (specifically numKeys) in the FileStore.

Test Plan: Added both C++ and Python tests to verify functionality.

Reviewed By: jiayisuse

Differential Revision: D25619001

fbshipit-source-id: 9146d0da9e0903622be3035880f619bbb2cc3891
2020-12-17 14:54:24 -08:00
Luca Wehrstedt
9234f5026d Make WorkNCCL use CUDAEvent::query() rather than re-implement it (#49343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49343

at::cuda::CUDAEvent is "lazy" and only creates an event when it's first recorded. Until then, at::cuda::CUDAEvent is empty. If we use at::cuda::CUDAEvent::query() this is taken into account (an empty event is always ready), but WorkNCCL extracts the raw cudaEvent_t value from at::cuda::CUDAEvent and calls cudaEventQuery manually and doesn't check this. This could cause a failure.

It's unclear if this is ever supposed to happen, but we're seeing that failure, and we want to sort it out in order to see if there's something "deeper" going on.
ghstack-source-id: 118532806

Test Plan: Unit tests

Reviewed By: SciPioneer

Differential Revision: D25537844

fbshipit-source-id: 506319f4742e1c0a02aa75ecc01112ea3be42d8f
2020-12-15 03:15:48 -08:00
Luca Wehrstedt
f204f77e6d Drop FutureNCCL in favor of vanilla CUDAFuture (#49014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49014

We extracted a generic and reusable CUDAFuture class from FutureNCCL, but we had left FutureNCCL around, as a subclass of CUDAFuture, in order to deal with some peculiarity of ProcessGroupNCCL, namely that the future would be completed right away when constructed and that its CUDA events would be _shared_ with the ones of the WorkNCCL. This required some "hacks" in CUDAFuture itself (protected members, fields wrapped in shared_ptrs, ...).

My understanding is that creating CUDA events is a rather cheap operation. That would mean that we could afford to record _twice_ the events after each NCCL call, once for the WorkNCCL and once for the future. By doing so, we can use the CUDAFuture class directly and revert all its hacks.
ghstack-source-id: 118391217

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25355272

fbshipit-source-id: 3a2a0891724928221ff0f08600675d2f5990e674
2020-12-11 09:25:05 -08:00
Luca Wehrstedt
5ab90b2fda Make CUDAFuture remember and restore current device in callback (#48789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48789

CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events).

However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this.
ghstack-source-id: 118081338

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210335

fbshipit-source-id: 1d1a3f80b1cc42e5114bc88554ed50617f1aaa90
2020-12-11 03:35:53 -08:00
Rohan Varma
696e30af6e Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda (#48946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48946

Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine.

ghstack-source-id: 118330130

Test Plan: Ci

Reviewed By: mrzzd

Differential Revision: D25368322

fbshipit-source-id: 7d17036248a3dcd855e58addc383bba64d6bc391
2020-12-10 21:09:41 -08:00
Yixin Bao
840e71f4e6 Check CUDA kernel launches (/fbcode/caffe2/) (#49145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105

(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.

(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.

Test Plan:
Test build:
```
buck build mode/dev-nosan //caffe2/modules/detectron:
buck test mode/dev-nosan //caffe2/modules/detectron:
buck build mode/dev-nosan //caffe2/torch/fb/:
buck test mode/dev-nosan //caffe2/torch/fb/:
```

To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.

Reviewed By: r-barnes

Differential Revision: D25452852

fbshipit-source-id: d6657edab612c9e0fa99b29c68460be8b1a20064
2020-12-10 10:43:03 -08:00
Luca Wehrstedt
b5a7e25059 Cache the DataPtrs in CUDAFuture (#48788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48788

CUDAFuture needs to inspect the value it contains in order to first determine what devices its tensors reside on (so that it can record events on those devices), and then to record these tensors with the caching allocator when they are used in other streams. Extracting data ptrs can become somewhat expensive (especially if we resort to using the pickler to do that), hence it's probably a good idea to cache the result the first time we compute it.
ghstack-source-id: 118180023

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25303486

fbshipit-source-id: 5c541640f6d19249dfb5489ba5e8fad2502836fb
2020-12-10 03:54:29 -08:00
Luca Wehrstedt
030fa6cfba Split out reusable CUDAFuture from FutureNCCL (#48506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48506

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL is now a general-purpose type-agnostic multi-device class, so in this commit I extract it from ProcessGroupNCCL to make it available for wider use (notably by the RPC module). We'll call this new class CUDAFuture. We'll keep FutureNCCL as a subclass of CUDAFuture to deal with some NCCL peculiarity, namely the fact that the future becomes complete immediately upon creation. We can clean this up for good once we're done merging Future and Work.

I'm not exactly sure of where to put CUDAFuture. It needs to be available to both c10d and RPC (which lives under torch/csrc). If I figured CMake out correctly (and that's a big if) I think c10d can only depend on ATen (I'll maybe add a comment with how I tracked that down). Hence we cannot put CUDAFuture in torch/csrc. On the other hand, RPC currently depends on c10d, because RPC agents use ProcessGroups internally, so it would be "ok" to put CUDAFuture in c10d. However, we want to get rid of ProcessGroups in RPC, and at that point RPC should in principle not depend on c10d. In that case, the only shared dep between the two that I see is ATen itself.

While I'm a bit wary of putting it right in ATen, I think it might actually make sense. CUDAFuture is intended to be a general-purpose component that can be reused in all settings and is not particularly tied to c10d or RPC. Moreover, ATen already contains ivalue::Future, and it contains a lot of CUDA helpers, so CUDAFuture definitely belongs to the "closure" of what's already there.
ghstack-source-id: 118180030

Test Plan: Unit tests?

Reviewed By: wanchaol

Differential Revision: D25180532

fbshipit-source-id: 697f655240dbdd3be22a568d5102ab27691f86d4
2020-12-10 03:54:26 -08:00
Luca Wehrstedt
4c425e8da0 Merge common parts of FutureNCCL into at::ivalue::Future (#48505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180032

Test Plan: Unit tests

Reviewed By: wanchaol

Differential Revision: D25180535

fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c
2020-12-10 03:54:22 -08:00
Luca Wehrstedt
9078088edb Split FutureNCCL's CUDA-specific parts from generic future logic (#48504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48504

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180025

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25180534

fbshipit-source-id: 7b3cd374aee78f6c07104daec793c4d248404c61
2020-12-10 03:54:19 -08:00
Luca Wehrstedt
a6778989d1 Support wider range of types in FutureNCCL (#48502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48502

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though.

The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated.
ghstack-source-id: 118180026

Test Plan: Unit tests (I should probably add new ones)

Reviewed By: wanchaol

Differential Revision: D25177562

fbshipit-source-id: 1ef18e67bf44543c70abb4ca152f1610dea4e533
2020-12-10 03:54:15 -08:00
Luca Wehrstedt
9fe3ac3650 Don't store device indices separately on FutureNCCL (#48501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48501

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL stores a set of devices (on which the tensors in the data reside) and a CUDA event for each of those devices. In fact, each event instance also already contains the device it belongs to, which means we can avoid storing that information separately (with the risk that it'll be mismatched and/or inaccurate).
ghstack-source-id: 118180024

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177554

fbshipit-source-id: 64667c176efc2a7dafe99457a1fbba5d142cb06c
2020-12-10 03:54:12 -08:00
Luca Wehrstedt
e294c2d841 Add multi-GPU support to FutureNCCL (#48500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48500

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices.

Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor.
ghstack-source-id: 118180022

Test Plan: Unit tests (I should probably add new ones)

Reviewed By: mrshenli

Differential Revision: D25177556

fbshipit-source-id: 41ef39ec0dc458e341aa1564f2b9f2b573d7fa9f
2020-12-10 03:54:09 -08:00
Luca Wehrstedt
91ad3ed831 Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48563

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

The CUDA caching allocator requires us to register all streams in which a DataPtr is used. We already do so when we invoke a callback, for which we obtain streams from the ATen pool. However, we didn't do so when the user waits for the Future and then uses the results in their current streams. This was probably fine in most cases, because the outputs of the NCCL ops (which is the tensors we're dealing with here) were user-provided, and thus already registered in some user streams, but in principle the user could use different streams when waiting than the ones they used to create the tensors. (If they use the same streams, registering becomes a no-op). But, more importantly, this change will help us turn FutureNCCL into a more general-purpose class as for example in RPC the tensors of the result are allocated by PyTorch itself and thus we need to record their usage on the user's streams with the caching allocator.
ghstack-source-id: 118180033

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210338

fbshipit-source-id: e0a4ba157653b74dd84cf5665c992ccce2dea188
2020-12-10 03:54:06 -08:00
Luca Wehrstedt
003c30ba82 Fix FutureNCCL's completed() disagreeing with wait() (#48503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48503

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet.

That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed().
ghstack-source-id: 118180028

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25180531

fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30
2020-12-10 03:54:02 -08:00
Luca Wehrstedt
b91b0872a1 Record CUDA events for "follow-up" FutureNCCL inside markCompleted (#48499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48499

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

We can merge and "hide" a whole bunch of CUDA-related logic if we store and record the CUDA events that correspond to the completion of a FutureNCCL when we call markCompleted (rather than splitting it between the constructor, the `then` method, and a wrapper around the callback).

A more concrete reason for this change is that soon I'll add support for multi-device, and in that case we can't necessarily know in advance which devices a value will be on until we get that value (and we don't want to record an event on all devices as then we might "over-synchronize").

To me, this also makes more conceptual sense: the moment when we store a value on the future, which is the "signal" that the future is now ready, should also be time at which we record the events needed to synchronize with that value. Though this may just be personal preference.
ghstack-source-id: 118180034

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177557

fbshipit-source-id: 53d4bcdfb89fa0d11bb7b1b94db5d652edeb3b7b
2020-12-10 03:53:59 -08:00
Luca Wehrstedt
6157f8aeb5 Use fresh stream from pool for each FutureNCCL callback (#48498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48498

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL has a dedicated CUDA stream that it sets as current when running callbacks. This stream is initialized by the ProcessGroupNCCL by extracting it from the global ATen pool.

In order to decouple FutureNCCL from that specific ProcessGroup and make it more generic, in this commit we make FutureNCCL extract a fresh stream from the ATen pool each time it needs one.

This introduces a functional change, because it removes the implicit synchronization and ordering between the callbacks of a same Future. In fact, such an ordering is hard to guarantee in the general case as, for example, a user could attach a new callback just after the future becomes completed, and thus that callback would be run inline, immediately, out-of-order wrt the other callbacks. (There are ways to "fix" this but they are complicated). NCCL got around this because its futures are already marked complete when they're returned, but in fact it could also run into issues if multiple threads were adding callbacks simultaneously.

Note that it remains still possible to enforce ordering between callbacks, but one must now do so explicitly. Namely, instead of this:
```
fut.then(cb1)
fut.then(cb2)
```
one must now do:
```
fut.then(cb1).then(cb2)
```
ghstack-source-id: 118180029

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177559

fbshipit-source-id: 4d4e73ea7bda0ea65066548109b9ea6d5b465599
2020-12-10 03:53:56 -08:00
Luca Wehrstedt
8fb52e7fa2 Make FutureNCCL record events in current stream (#48497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48497

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

When we record the events to mark a "follow-up" future complete (for a callback), we used to record them onto the dedicated stream, but that streams is the current stream at that time, so instead we could just record them onto the current stream. This introduces no functional differences. The reason I'm adding such an additional layer of indirection is so that the dedicated stream is only referenced inside the `addCallback` method, which will later allow us to more easily change how that stream works.
ghstack-source-id: 118180035

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177553

fbshipit-source-id: c6373eddd34bd399df09fd4861915bf98fd50681
2020-12-10 03:53:53 -08:00
Luca Wehrstedt
e4267eb424 Have FutureNCCL record streams w/ allocator in addCallback (#48496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48496

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

There are two ways to add a callback to a Future: `then` and `addCallback` (with the former deferring to the latter). FutureNCCL only "patched" `then`, which caused `addCallback` to be unsupported. By patching `addCallback`, on the other hand, we cover both.

The high-level goal of this change though is to remove all CUDA-specific stuff from `then`, and move it to either `markCompleted` or to a wrapper around the callback. This will take a few more steps to achieve.
ghstack-source-id: 118180031

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177558

fbshipit-source-id: ee0ad24eb2e56494c353db700319858ef9dcf32b
2020-12-10 03:53:50 -08:00
Luca Wehrstedt
868a1a48c6 Add some safeguards to FutureNCCL (#48562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48562

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.
ghstack-source-id: 118180036

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210333

fbshipit-source-id: d2eacf0f7de5cc763e3cdd1ae5fd521fd2eec317
2020-12-10 03:53:47 -08:00
Luca Wehrstedt
b7f5aa9890 Remove NCCL dependency from PythonFutureWrapper (#48495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48495

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.
ghstack-source-id: 118180038

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177560

fbshipit-source-id: 3a424610c1ea191e8371ffee0a26d62639895884
2020-12-10 03:53:44 -08:00
Luca Wehrstedt
7f7f0fa335 Avoid using FutureNCCL before it's ready (#48561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48561

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

383abf1f0c/torch/lib/c10d/ProcessGroup.cpp (L66)

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.
ghstack-source-id: 118180037

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210337

fbshipit-source-id: 54033c814ac77641cbbe79b4d01686dfc2b45495
2020-12-10 03:48:43 -08:00
Supriya Rao
bfa95f90a0 Revert D25325039: Check CUDA kernel launches (/fbcode/caffe2/)
Test Plan: revert-hammer

Differential Revision:
D25325039 (f5e9ffbc27)

Original commit changeset: 2043d6e63c7d

fbshipit-source-id: 5377dd2aa7c6f58c8641c956b7642c7c559bbc40
2020-12-09 14:07:16 -08:00
Yixin Bao
f5e9ffbc27 Check CUDA kernel launches (/fbcode/caffe2/) (#49105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105

(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.

(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.

Test Plan:
Test build:
```
buck build //caffe2/modules/detectron:
buck build //caffe2/torch/fb/:
```

To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.

Reviewed By: r-barnes

Differential Revision: D25325039

fbshipit-source-id: 2043d6e63c7d029c35576d3101c18247ffe92f01
2020-12-09 12:34:55 -08:00
Yi Wang
7439bc4dd6 [Gradient Compression] Add an index field to GradBucket for PowerSGD (#48757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48757

Add an index field to GradBucekt, so error_dict is keyed by this index instead of the hashcode of input tensor. The replacement will be done in a separate diff, as the definition of this new method somehow couldn't be recognized in the OSS version.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117939208

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25288496

fbshipit-source-id: 6f71977809690a0367e408bd59601ee62c9c03ea
2020-12-05 01:39:58 -08:00
Yanan Cao
a3298c2f64 Implement JIT serialization of ProcessGroup (#48544)
Summary:
This diff enables JIT serialization of `ProcessGroup`, including both base `ProcessGroup` class and derived classes like `ProcessGroupNCCL`.

If a `ProcessGroup` is created via high-level APIs like `dist_c10d.frontend().new_process_group_helper()`, they are automatically serializable. If a `ProcessGroup` is created via its derived class TorchBind APIs like `dist_c10d.ProcessGroupNCCL()`, then it has to be given a name and registered with `dist_c10d.frontend().register_process_group_name` to be uniquely identifiable and serializable.

* Fixed a minor bug in new dist_c10d frontend which fails to check whether a process group is used or not
* Fixed an issue where `test_jit_c10d.py` wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48544

Reviewed By: wanchaol

Differential Revision: D25298309

Pulled By: gmagogsfm

fbshipit-source-id: ed27ce37373c88277dc0c78704c48d4c19d46d46
2020-12-04 18:44:38 -08:00
Nikita Shulga
5654fc8edd Revert D25293474: [pytorch][PR] Server connects to its listen socket addr
Test Plan: revert-hammer

Differential Revision:
D25293474 (7c9ba62130)

Original commit changeset: 15f75dab48a4

fbshipit-source-id: 71ca136f2aa3204ad49f76c604f51c477cba270a
2020-12-04 17:08:03 -08:00
Zrss
7c9ba62130 Server connects to its listen socket addr (#46801)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46801

Reviewed By: heitorschueroff

Differential Revision: D25293474

fbshipit-source-id: 15f75dab48a4360645436360c216885cf3bd5667
2020-12-04 13:21:57 -08:00
Joe Zhu
92f376147c Enable TCPStore on Windows (#47749)
Summary:
Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance.

Related RFC is https://github.com/pytorch/pytorch/issues/47659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749

Reviewed By: bdhirsh

Differential Revision: D25220401

Pulled By: mrshenli

fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529
2020-12-03 08:32:01 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Chester Liu
8177f63c91 Reorganize and refine the Windows.h import in C++ files (#48009)
Summary:
This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009

Reviewed By: gchanan

Differential Revision: D25045840

Pulled By: ezyang

fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90
2020-11-20 14:21:09 -08:00
Yanan Cao
28580d3c0f Add TorchBind-based Python and TorchScript binding for ProcessGroup (#47907)
Summary:
Add TorchBind-binding for ProcessGroup class.

Currently there are a few limitation of TorchBind that prevents us from fully matching existing PyBind-binding of ProcessGroup:

- TorchBind doesn't support method overloading. Current PyBind binding uses overloading extensively to provide flexible API, but TorchBind (and TorchScript ClassType behind it) doesn't yet support it. Therefore, we can provide at most one version of API under each name.

- TorchBind doesn't support C++ enums yet. This prevents us from making real uses of XXXOptions, which is widely used in many APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47907

Reviewed By: wanchaol

Differential Revision: D24945814

Pulled By: gmagogsfm

fbshipit-source-id: e103d448849ea838c10414068c3e4795db91ab1c
2020-11-19 20:25:56 -08:00
Yanan Cao
db767b7862 Add c10d new frontend to build (#48146)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/48148 Add TorchBind-based Python and TorchScript binding for ProcessGroup
* https://github.com/pytorch/pytorch/issues/48147 Add process group creation logic in c10d new frontend
* **https://github.com/pytorch/pytorch/issues/48146 Add c10d new frontend to build**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48146

Reviewed By: wanchaol

Differential Revision: D25073969

Pulled By: gmagogsfm

fbshipit-source-id: d111649144a4de9f380e5f7a2ad936860de4bd7b
2020-11-19 04:47:02 -08:00
Scott Wolchok
383abf1f0c [PyTorch] Make RecordFunction::active private (#47549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47549

In preparation for moving state onto the heap.
ghstack-source-id: 117027862

Test Plan: CI

Reviewed By: ilia-cher

Differential Revision: D24812214

fbshipit-source-id: 1455c2782b66f6a59c4d45ba58e1c4c92402a323
2020-11-18 17:58:54 -08:00
Omkar Salpekar
f8c559db8e [resubmit] Providing more information while crashing process in async error handling (#47246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47246

We crash the process in NCCL Async Error Handling if the collective
has been running for greater than some set timeout. This PR introduces more
information about the rank and duration the collective ran.
ghstack-source-id: 116676182

Test Plan: Run desync tests and flow.

Reviewed By: pritamdamania87

Differential Revision: D24695126

fbshipit-source-id: 61ae46477065a1a451dc46fb29c3ac0073ca531b
2020-11-13 20:11:06 -08:00
Omkar Salpekar
5d51b63984 Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set (#47926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47926

Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.
ghstack-source-id: 116553583

Test Plan: Simple FBL run/CI

Reviewed By: jiayisuse

Differential Revision: D24928149

fbshipit-source-id: d42c038ad44607feb3d46dd65925237c564ff7a3
2020-11-13 14:43:00 -08:00
Wanchao Liang
553ccccc54 [c10d] switch ProcessGroup to be managed by intrusive_ptr (#47343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47343

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24723418

Pulled By: wanchaol

fbshipit-source-id: 0463819b96c53b12bdbb3905431110d7b21beb77
2020-11-12 07:36:23 -08:00
Wanchao Liang
a02baa0c7a [reland][c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr (#47807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47807

reland https://github.com/pytorch/pytorch/pull/47075

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905247

fbshipit-source-id: abd9731d86b3bd48d60bbc90d534823e0c037b93
2020-11-11 22:53:22 -08:00
Wanchao Liang
665ac2f7b0 [reland] [c10d] switch Store to be managed by intrusive_ptr (#47808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47808

reland https://github.com/pytorch/pytorch/pull/47074

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905246

fbshipit-source-id: edeb7e6e486570ce889f12512e9dc02061d6cc03
2020-11-11 22:53:20 -08:00
Wanchao Liang
70ae5685f9 [reland][c10d] switch ProcessGroup::Work to be managed by intrusive_ptr (#47806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47806

reland https://github.com/pytorch/pytorch/pull/44046

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905245

fbshipit-source-id: ad75ace5432fcfd22d513878f5a73c4bb017324e
2020-11-11 22:51:03 -08:00
Wanchao Liang
dac0192148 Revert D23632280: [c10d] switch ProcessGroup::Work to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D23632280 (0650a6166f)

Original commit changeset: 0a4642a8ffab

fbshipit-source-id: 2aa8ddb874fab11f773f4c08d740afcd865482e9
2020-11-11 10:54:08 -08:00
Wanchao Liang
1f946e942d Revert D24667128: [c10d] switch Store to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D24667128 (0cfe3451d4)

Original commit changeset: 9b6024c31c85

fbshipit-source-id: d8ddf9eb2fccef5023e05698e0c4662708fe4945
2020-11-11 10:49:58 -08:00
Wanchao Liang
2204374fd4 Revert D24667127: [c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D24667127 (ae5c2febb9)

Original commit changeset: 54986193ba1b

fbshipit-source-id: 12e1ebea1981c0b1b6dff4c8a2e2045878d44537
2020-11-11 10:42:33 -08:00
Wanchao Liang
ae5c2febb9 [c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr (#47075)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47075

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D24667127

Pulled By: wanchaol

fbshipit-source-id: 54986193ba1b22480622a2e9d6d41d9472d201f3
2020-11-10 23:36:47 -08:00
Wanchao Liang
0cfe3451d4 [c10d] switch Store to be managed by intrusive_ptr (#47074)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47074

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24667128

Pulled By: wanchaol

fbshipit-source-id: 9b6024c31c851b7c3243540f460ae57323da523b
2020-11-10 23:36:44 -08:00
Wanchao Liang
0650a6166f [c10d] switch ProcessGroup::Work to be managed by intrusive_ptr (#44046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44046

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23632280

Pulled By: wanchaol

fbshipit-source-id: 0a4642a8ffabdd26c52c1baabfa30c0f446c3c85
2020-11-10 23:30:22 -08:00
Yanan Cao
9d0c6e9469 Implement Complex tensor support in all reduce and all gather (#47523)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47523

Reviewed By: bdhirsh

Differential Revision: D24806743

Pulled By: gmagogsfm

fbshipit-source-id: 627a5a0654c603bc82b90e4cb3d924b4ca416fbe
2020-11-06 22:26:48 -08:00
Mehdi Mirzazadeh
160db3db4f Adding profiling capability to c++ ddp collective functions (#46471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46471

ghstack-source-id: 116018837

Test Plan:
Added unit tests:

 buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork
 buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D23948397

fbshipit-source-id: 6d93a370aff26bf96c39e5d78a2492c5142a9156
2020-11-06 10:29:58 -08:00
Yi Wang
6b3802a711 [Gradient Compression] Export sizes, along with length and offset of each variable to GradBucket for PowerSGD (#47203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47203

1. Create a new field in BucketReplica to store sizes info for each variable.
2. Export sizes list, along with lengths and offsets to GradBuceket.

These fields are needed for PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 115875194

Test Plan: Checked the field values from log.

Reviewed By: rohan-varma

Differential Revision: D24644137

fbshipit-source-id: bcec0daf0d02cbf25389bfd9be90df1e6fd8fc56
2020-11-04 12:34:53 -08:00
Yanan Cao
5c4bd9a38f Move python-independent c10d implementations to torch/lib (#47309)
Summary:
* This is a pre-step to build c10d into libtorch
* Includes a minor cleanup in c10d/CMakeLists.txt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47309

Reviewed By: wanchaol

Differential Revision: D24711768

Pulled By: gmagogsfm

fbshipit-source-id: 6f9e0a6a73c30f5ac7dafde9082efcc4b725dde1
2020-11-03 23:39:54 -08:00
Yi Wang
f91fcefc81 [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115783237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

//arvr/projects/eye_tracking/Masquerade:python_test

USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install

Reviewed By: mrshenli

Differential Revision: D24700959

fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
2020-11-03 18:33:50 -08:00
Omkar Salpekar
8b13ab9370 Event Logging for NCCL Async Error Handling Process Crash (#47244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47244

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table.

As a demonstration, I ran the following workflow with this diff patched: f229675892
Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries.
ghstack-source-id: 115708632

Test Plan: Did a quick demo as described above. Scuba entries with the logs can be found here: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

Reviewed By: jiayisuse

Differential Revision: D24688739

fbshipit-source-id: 7532dfeebc53e291fbe10d28a6e50df6324455b1
2020-11-03 13:42:42 -08:00
Yi Wang
b1b77148ac Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234

Revert the diff because of https://github.com/pytorch/pytorch/issues/47153

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115720415

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24691866

fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539
2020-11-02 20:51:18 -08:00
Alban Desmaison
c10aa44e33 Back out "Providing more information while crashing process in async error handling" (#47185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47185

Original commit changeset: 02d48f13352a

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D24682055

fbshipit-source-id: 060efa29eb2f322971848ead447021f6972cb3f3
2020-11-02 08:34:30 -08:00
Yi Wang
ee0033af9b [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115629230

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

Reviewed By: pritamdamania87

Differential Revision: D24471910

fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a
2020-10-30 23:19:42 -07:00
Omkar Salpekar
7eb427e931 Providing more information while crashing process in async error handling (#46274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46274

We crash the process in NCCL Async Error Handling if the collective
has been running for greater than some set timeout. This PR logs more
information about the rank and duration the collective ran before throwing an exception.
ghstack-source-id: 115614622

Test Plan:
Run desync tests and flow. Here are the Flow runs showing the right messages: f225031389
f225032004

Reviewed By: jiayisuse

Differential Revision: D24200144

fbshipit-source-id: 02d48f13352aed40a4476768c123d5cebbedc8e0
2020-10-30 16:22:51 -07:00
Jeff Daily
ce5bca5502 ProcessGroupNCCL::alltoall_base needs to call recordStream (#46603)
Summary:
For similar reasons as documented in the `[Sync Streams]` note.  For a current example, `ProcessGroupNCCL::allgather` must also call `recordStream` and does so already.

The output tensor is created on the default stream (by the application).  NCCL/RCCL internally uses another stream (i.e., ncclStream).  If we do not record the output tensor on the ncclStream, there is a chance that the output tensor might be deallocated while NCCL/RCCL is using it.

The application is not aware of the ncclStream since it's internal to ProcessGroupNCCL.  So, the application cannot record the output tensor on the ncclStream.

Patch originally developed by sarunyap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46603

Reviewed By: srinivas212

Differential Revision: D24458530

fbshipit-source-id: b02e74d1c3a176ea1b9bbdd7dc671b221fcadaef
2020-10-22 15:53:19 -07:00
Yi Wang
98aad933b6 [pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45318

When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.

Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.

I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.

Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed.

In addition, a few lines are reformatted by lint.
caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted.

#Closes: https://github.com/pytorch/pytorch/issues/44203

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_ddp_comm_hook_allreduce_with_then_hook_nccl

Reviewed By: pritamdamania87

Differential Revision: D23910257

fbshipit-source-id: 66920746c41f3a27a3689f22e2a2d9709d0faa15
2020-10-22 01:49:47 -07:00
Omkar Salpekar
2e2fe8cf3b [NCCL] Modularize ncclCommWatchdog (#46051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46051

Creates a subroutine for aborting timed out collectives. This should help modularize the ncclCommWatchdog a bit, since it is growing too large.
ghstack-source-id: 114398496

Test Plan:
Successful Flow Run:
f225037915
f217609101

Reviewed By: jiayisuse

Differential Revision: D23607535

fbshipit-source-id: 0b1c9483bcd3a41847fc8c0bf6b22cdba01fb1e6
2020-10-16 11:06:40 -07:00
Alexander Golynski
e7e919fc34 Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24294437

Pulled By: gmagogsfm

fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6
2020-10-14 16:27:37 -07:00
Omkar Salpekar
d655341adb [Distributed] General Function for Parsing Environment Variable Flags in PG (#46045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46045

PG NCCL functionality differs based on certain binary environment
variables such as NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING. Previously
we had separate helper function to parse these env vars and set class variables
accordingly. This PR introduces a general purpose function for this purpose.
ghstack-source-id: 114209823

Test Plan:
Ran the following flow with NCCL_BLOCKING_WAIT set, and ensured the
ProcessGroup constructor set blcokingWait_ to true: f223454701

Reviewed By: jiayisuse

Differential Revision: D24173982

fbshipit-source-id: b84db2dda29fcf5d163ce8860e8499d5070f8818
2020-10-14 12:21:11 -07:00
Omkar Salpekar
2ffb768607 [Distributed] deleteKey support for HashStore (#46049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46049

Adding support for the deleteKey API in the c10d HashStore.
ghstack-source-id: 113874207

Test Plan:
Added C++ tests to check whether deleteKey function works, and
whether it returns an exception for attempting to delete non-existing keys.

Reviewed By: jiayisuse

Differential Revision: D24067657

fbshipit-source-id: 4c58dab407c6ffe209585ca91aa430850261b29e
2020-10-14 12:04:42 -07:00
Omkar Salpekar
74f13a8b8f [Distributed] Adding getNumKeys support to the HashStore (#46048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46048

This PR adds support for the getNumKeys API for the HashStore
ghstack-source-id: 113874241

Test Plan: Added C++ tests for the HashStore::getNumKeys

Reviewed By: jiayisuse

Differential Revision: D24067658

fbshipit-source-id: 2db70a90f0ab8ddf0ff03cedda59b45ec987af07
2020-10-14 12:01:22 -07:00
Rohan Varma
965046c445 [NCCL] Provide additional information about NCCL error codes. (#45950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45950

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 114219288

Test Plan: CI

Reviewed By: mingzhe09088

Differential Revision: D24155894

fbshipit-source-id: 10810ddf94d6f8cd4989ddb3436ddc702533e1e1
2020-10-13 21:18:20 -07:00
Omkar Salpekar
952dc7ed87 [NCCL] Fix Hang in Async Error Handling due to Work logging (#46265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46265

tl;dr - we must remove tensor-related logging from the
WorkNCCL::operator<< function, otherwise printing the work objects tracked in
the workMetaList_ will cause segfaults.

The Work objects we track in the workMetaList for the NCCL Async Error
Handling mechanism don't have any `outputs_`. As described in the workEnqueue
function, destructing the output tensors calls into autograd_meta, which
happens in the user thread, but our system destructs work objects in the
workCleanupThread, so this could lead to a deadlock scenario. We avoid this
problem by not tracking the tensors in the work objects in the workMetaList
(it's called work meta list because these work objects only track the metadata
and not the actual tensors), so when the WorkNCCL::operator<< function tried to
log tensor shapes for work objects from the watchdog thread, the async error
handling mechanism hanged (in the desync test) or segfaulted (in the desync
flow). This PR removes the tensor-related logging from the operator<< function.
ghstack-source-id: 114192929

Test Plan: Verified that this fixes the desync test and desync flow.

Reviewed By: jiayisuse

Differential Revision: D24268204

fbshipit-source-id: 20ccb8800aa3d71a48bfa3cbb65e07ead42cd0dc
2020-10-13 16:23:56 -07:00
Omkar Salpekar
172036a565 [NCCL] Add Error log when ProcessGroupNCCL takes down process upon (#44988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44988

The new NCCL async error handling feature throws an exception from the
workCleanup Thread if one of the NCCL operations encounters an error or times
out. This PR adds an error log to make it more clear to the user why the
training process crashed.
ghstack-source-id: 114002493

Test Plan:
Verified that we see this error message when running with the desync
test.

Reviewed By: pritamdamania87

Differential Revision: D23794801

fbshipit-source-id: 16a44ce51f01531062167fb762a8553221363698
2020-10-09 16:58:50 -07:00
Omkar Salpekar
e33d455ef7 [Distributed] Set smaller Store timeouts to make c10d tests run faster (#46067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46067

In our store tests, we expect there to be an exception when we call
get on a recently-deleted key. Unforunately, the store waits for the timeout
period for the key to be set before throwing, which causes the tests to idel
wait for 5+ minutes. This PR decreases the timeouts before this set call so
these tests run faster.
ghstack-source-id: 113917315

Test Plan: Ran both the Python and C++ tests.

Reviewed By: pritamdamania87

Differential Revision: D24208617

fbshipit-source-id: c536e59ee305e0c01c44198a3b1a2247b8672af2
2020-10-09 15:45:42 -07:00
Pritam Damania
c83314e982 [ci-all tests] Improve logging in ProcessGroupNCCL for debugging purposes. (#46010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46010

When training jobs running with NCCL fail sometimes it is hard to
debug the reason of the failure and our logging doesn't provide enough
information at times to narrow down the issue.

To improve the debugging experience, I've enhanced our logging to add a lot
more information about what the ProcessGroup is doing under the hood.

#Closes: https://github.com/pytorch/pytorch/issues/45310

Sample output:
```
> I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started!
> I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options:
> NCCL_ASYNC_ERROR_HANDLING: 0
> NCCL_BLOCKING_WAIT: 1
> TIMEOUT(ms): 1000
> USE_HIGH_PRIORITY_STREAM: 0
> I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally
> I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000)
> I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally
> I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators
> I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally
```
ghstack-source-id: 113961408

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24183463

fbshipit-source-id: cb09c1fb3739972294e7edde4aae331477621c67
2020-10-09 09:46:58 -07:00
Mingzhe Li
8cd3857bc7 [NCCL] Add torch::cuda::nccl::send/recv (#45926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926

torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python
Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517

ghstack-source-id: 113910530

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24147802

fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3
2020-10-08 19:20:40 -07:00
Mingzhe Li
b7f7378b2d [NCCL] support send/recv to/from self when communicator is created on demand (#45873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873

This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes.
ghstack-source-id: 113910526

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24124413

fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d
2020-10-08 19:19:15 -07:00
Nikita Shulga
c19b9cd18d Add torch::cuda::ncll::all2all (#45900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes https://github.com/pytorch/pytorch/issues/42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
2020-10-07 23:56:31 -07:00
Natalia Gimelshein
de0d0bd5ee Revert D24093032: Improve logging in ProcessGroupNCCL for debugging purposes.
Test Plan: revert-hammer

Differential Revision:
D24093032 (c8d76ff7dc)

Original commit changeset: 240b03562f8c

fbshipit-source-id: dab7d54a5ba517bb308a1825b0d63ed146e5269d
2020-10-07 16:41:35 -07:00
Pritam Damania
c8d76ff7dc Improve logging in ProcessGroupNCCL for debugging purposes. (#45780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45780

When training jobs running with NCCL fail sometimes it is hard to
debug the reason of the failure and our logging doesn't provide enough
information at times to narrow down the issue.

To improve the debugging experience, I've enhanced our logging to add a lot
more information about what the ProcessGroup is doing under the hood.

#Closes: https://github.com/pytorch/pytorch/issues/45310

Sample output:
```
> I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started!
> I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options:
> NCCL_ASYNC_ERROR_HANDLING: 0
> NCCL_BLOCKING_WAIT: 1
> TIMEOUT(ms): 1000
> USE_HIGH_PRIORITY_STREAM: 0
> I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally
> I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000)
> I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally
> I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators
> I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally
```
ghstack-source-id: 113731163

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24093032

fbshipit-source-id: 240b03562f8ccccc3d872538f5e331df598ceca7
2020-10-07 12:18:41 -07:00
Mingzhe Li
10d86d1196 [NCCL] create NCCL communicator for send/recv on demand (#44922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44922

For NCCL send/recv operations, we will create NCCL communicator on demand following the same design as how it's currently done for collective operations.
ghstack-source-id: 113592757

Test Plan: to add

Reviewed By: pritamdamania87

Differential Revision: D23773726

fbshipit-source-id: 0d47c29d670ddc07f7181e8485af0e02e2c9cfaf
2020-10-05 18:33:03 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
Pritam Damania
b5a2f04089 Disallow creation of ProcessGroupNCCL without GPUs. (#45642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642

Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
2020-10-05 12:05:48 -07:00
Hongyi Jia
06a566373a [PyTorch/NCCL] Fix async error handling (#45456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456

Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100%

SyncBatchNorm failure trace: P143879560

Test Plan:
**Desync test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync

**SyncBatchNorm test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient

Reviewed By: osalpekar

Differential Revision: D23972071

fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f
2020-09-29 15:44:34 -07:00
Omkar Salpekar
6b65b3cbd8 [Distributed] DeleteKey API for c10d TCP Store (#45401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: mrshenli

Differential Revision: D23955730

fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
2020-09-28 15:30:39 -07:00
Natalia Gimelshein
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
Omkar Salpekar
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
Omkar Salpekar
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
gunandrose4u
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
Mike Ruberry
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
gunandrose4u
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
Mingzhe Li
574f9af160 [NCCL] Add option to run NCCL on high priority cuda stream (#43796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43796

This diff adds an option for the process group NCCL backend to pick high priority cuda streams.

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D23404286

fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb
2020-09-16 16:00:41 -07:00
Omkar Salpekar
f7278473d3 [NCCL] Fix NCCL_BLOCKING_WAIT functionality with Async Error Handling (#44411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411

This basically aborts errored NCCL communicators if either blocking
wait or async error handling is enabled. Otherwise we may abort nccl
communicators where neither are enabled, and this may result in subsequent GPU
operations using corrupted data.
ghstack-source-id: 111839264

Test Plan: Succesful Flow run: f217591683

Reviewed By: jiayisuse

Differential Revision: D23605382

fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b
2020-09-10 20:57:55 -07:00
Mehdi Mirzazadeh
2e744b1820 Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111775070

Test Plan:
Added checks to existing unit test and ran it on gpu devserver.
Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683

Reviewed By: pritamdamania87

Differential Revision: D23455047

fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
2020-09-10 17:13:37 -07:00
Yi Wang
38c10b4f30 [NCCL] Fix the initialization of futureNCCLCallbackStreams (#44347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347

Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff.

As johnsonpaul mentioned in D23277575 (7d517cf96f). It looks like all processes were allocating memory on GPU-ID=0.

I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   3132563      C   python                                       777MiB |
|    0   3132564      C   python                                       775MiB |
|    4   3132564      C   python                                       473MiB |
+-----------------------------------------------------------------------------+
```
I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0.

We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    673925      C   python                                       771MiB |
|    0    673926      C   python                                       771MiB |
|    1    673925      C   python                                       771MiB |
|    1    673926      C   python                                       771MiB |
|    2    673925      C   python                                       771MiB |
|    2    673926      C   python                                       771MiB |
|    3    673925      C   python                                       771MiB |
|    3    673926      C   python                                       771MiB |
|    4    673925      C   python                                       771MiB |
|    4    673926      C   python                                       771MiB |
|    5    673925      C   python                                       771MiB |
|    5    673926      C   python                                       771MiB |
|    6    673925      C   python                                       771MiB |
|    6    673926      C   python                                       771MiB |
|    7    673925      C   python                                       707MiB |
|    7    673926      C   python                                       623MiB |
+-----------------------------------------------------------------------------+
```
This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff.

I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   2513142      C   python                                       745MiB |
|    4   2513144      C   python                                       747MiB |
+-----------------------------------------------------------------------------+
```
I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments.
ghstack-source-id: 111485483

Test Plan:
`test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay.

This diff also fixes the regression in HPC tests as we register a hook:

{F322730175}

See https://fb.quip.com/IGuaAbD8 (474fdd7e2d)bnvy for details.

Reviewed By: pritamdamania87

Differential Revision: D23495436

fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822
2020-09-10 11:25:38 -07:00
Omkar Salpekar
e028ad0762 Fix HashStoreTests and move to Gtest (#43384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43384

Much like the FileStoreTests, the HashStoreTests were also run in a single blob and threw exceptions upon failure. This modularizes the test by separating each function into separate gtest test cases.
ghstack-source-id: 111690834

Test Plan: Confirmed that the tests pass on devvm.

Reviewed By: jiayisuse

Differential Revision: D23257579

fbshipit-source-id: 7e821f0e9ee74c8b815f06facddfdb7dc2724294
2020-09-09 17:56:33 -07:00
Omkar Salpekar
69a3ff005d Modularize FileStoreTest and move to Gtest (#43383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43383

FileStore Test currently has a large blob of tests that throw
exceptions upon failure. This PR modularizes each test so they can run
independently, and migrates the framework to gtest.
ghstack-source-id: 111690831

Test Plan: Confirmed tests pass on devvm

Reviewed By: jiayisuse

Differential Revision: D22879473

fbshipit-source-id: 6fa5468e594a53c9a6b972757068dfc41645703e
2020-09-09 17:56:30 -07:00
Omkar Salpekar
a7fba7de22 Convert StoreTestUtils to Gtest (#43382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43382

StoreTestCommon defines standard helper functions that are used by all of our Store tests. These helpers currently throw exceptions upon failure, this PR changes them to use gtest assertions instead.
ghstack-source-id: 111690833

Test Plan: Tested the 2 PR's above this on devvm

Reviewed By: jiayisuse

Differential Revision: D22828156

fbshipit-source-id: 9e116cf2904e05ac0342a441e483501e00aad3dd
2020-09-09 17:55:25 -07:00
Omkar Salpekar
48c47db8fe [NCCL] Add Environment Variable to guard Async Error Handling feature (#44163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788

Test Plan:
CI/Sandcastle. We will turn on this env var by default in
torchelastic and HPC trainer soon.

Reviewed By: jiayisuse

Differential Revision: D23517895

fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc
2020-09-09 12:26:25 -07:00
Omkar Salpekar
211ece7267 [NCCL] ProcessGroupNCCL Destructor Blocks on WorkNCCL Completion (#41054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41054

**This Commit:**
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614314

Test Plan:
1. **DDP Sanity Check**: First we have a sanity check based on the PyTorch DDP benchmark. This verifies that the baseline DDP training with NCCL for  standard CU workloads works well (esp. with standard models like Resnet50 and BERT). Here is a sample Flow: f213293473

1. **HPC Performance Benchmarks**: This stack has undergone thorough testing and profiling on the Training Cluster with varying number of nodes. This introduces 1-1.5% QPS regression only (~200-400 QPS regression for 8-64 GPUs).

1. **HPC Accuracy Benchmarks**: We've confirmed NE parity with the existing NCCL/DDP stack without this change.

1. **Kernel-Specific Benchmarks**: We have profiled other approaches for this system (such as cudaStreamAddCallback) and performed microbenchmarks to confirm the current solution is optimal.

1. **Sandcastle/CI**: Apart from the recently fixed ProcessGroupNCCL tests, we will also introduce a new test for desynchronization scenarios.

Reviewed By: jiayisuse

Differential Revision: D22054298

fbshipit-source-id: 2b95a4430a4c9e9348611fd9cbcb476096183c06
2020-09-09 12:26:22 -07:00
Omkar Salpekar
afbf2f140b [NCCL] WorkNCCL Helper Functions (#41053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053

**This Commit:**
Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614315

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21943520

fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb
2020-09-09 12:26:18 -07:00
Omkar Salpekar
f8f7b7840d [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread (#41052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052

**This Commit:**
Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we  also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.)

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614313

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21943151

fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453
2020-09-09 12:26:15 -07:00
Omkar Salpekar
4e5c55ef69 [NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051

**This Commit:**
In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614319

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21938498

fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07
2020-09-09 12:26:12 -07:00
Omkar Salpekar
1df24fd457 [NCCL] Timeout Loop Thread for Async Error Handling (#41050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41050

**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21916637

fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe
2020-09-09 12:25:06 -07:00
Omkar Salpekar
7c464eed16 Skipping CUDA tests in ProcessGroupGloo and logs (#42488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42488

Currently, ProcessGroupGloo tests do not emit logs if the test was
skipped due CUDA not being available/not enough CUDA devices. This PR clarifies
the reason for skipping through these logs.
ghstack-source-id: 111638111

Test Plan: tested on devvm and devgpu

Reviewed By: jiayisuse

Differential Revision: D22879396

fbshipit-source-id: d483ca46b5e22ed986521262c11a1c6dbfbe7efd
2020-09-09 10:52:52 -07:00
Nikita Shulga
7035cd0f84 Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends
Test Plan: revert-hammer

Differential Revision:
D23216393 (0b2694cd11)

Original commit changeset: fed5e37fbabb

fbshipit-source-id: 27fbeb1617066fa3f271a681cb089622027d6689
2020-09-01 10:32:38 -07:00
Mehdi Mirzazadeh
0b2694cd11 Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43386

Resolves #43178

ghstack-source-id: 111109716

Test Plan: Added checks to existing unit test and ran it on gpu devserver.

Reviewed By: rohan-varma

Differential Revision: D23216393

fbshipit-source-id: fed5e37fbabbd2ac4a9055b20057fffe3c416c0b
2020-09-01 08:05:55 -07:00
Pritam Damania
f1624b82b5 Preserve python backtrace in autograd engine errors. (#43684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684

This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.

As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.

For the example in #42560, the exception trace would now look like:

```
> Traceback (most recent call last):
>   File "test_autograd.py", line 6914, in test_preserve_backtrace
>     Foo.apply(t).sum().backward()
>   File "torch/tensor.py", line 214, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "torch/autograd/__init__.py", line 127, in backward
>     allow_unreachable=True)  # allow_unreachable flag
>   File "torch/autograd/function.py", line 87, in apply
>     return self._forward_cls.backward(self, *args)
>   File "test_autograd.py", line 6910, in backward
>     raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D23365408

fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
2020-09-01 01:28:47 -07:00
Ashkan Aliabadi
4e39c310eb Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252331

Pulled By: AshkanAliabadi

fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48
2020-08-29 17:47:00 -07:00
Sinan Nasir
7d517cf96f [NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110909401

Test Plan:
Perf trace runs to validate the desired behavior:
See the dedicated stream 152 is running the then callback operations:

{F299759342}

I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook.
See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585)

After updates, same observation: see f214890101

Reviewed By: malfet

Differential Revision: D23277575

fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9
2020-08-28 17:26:23 -07:00
Rohan Varma
5ca6cbbd93 Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43543

Closes https://github.com/pytorch/pytorch/issues/14691. This is not needed in the multiple outputs case, because gloo allreduce
will broadcast the result tensor to all the outputs. See
https://github.com/facebookincubator/gloo/issues/152 and commit
9cabb5aaa4
for more details. Came across this when debugging https://github.com/pytorch/pytorch/pull/42577.

This effectively reverts https://github.com/pytorch/pytorch/pull/14688 while still keeping the tests.

Tested by ensuring `test_allreduce_basics` in `test_c10d.py` still works as expected.
ghstack-source-id: 110636498

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23173945

fbshipit-source-id: d1ae08f84b4ac9919c53080949b8fffcb2fe63a8
2020-08-25 14:01:26 -07:00
Sinan Nasir
6e1127ea3f [NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.

Reviewed By: pritamdamania87

Differential Revision: D23055807

fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
2020-08-19 19:42:22 -07:00
Hongyi Jia
d467ac8ff0 [GLOO] handle empty split size (#43256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43256

* Handle empty split size by moving to call computeLengthsAndOffsets()
* Enable GLOO alltoall python tests
ghstack-source-id: 109292763

Test Plan:
buck build mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest

./trainer_cmd.sh -p 16 -n 8 -d gloo (modify ./trainer_cmd.sh a bit)

Reviewed By: mingzhe09088

Differential Revision: D22961600

fbshipit-source-id: b9e90dadf7b45323b8af2e6cab2e156043b7743b
2020-08-19 11:14:06 -07:00
Hongyi Jia
c9e825640a [c10d] Template computeLengthsAndOffsets() (#42706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42706

Different backends accept different type of length to, like MPI_Alltoallv, nccSend/Recv(), gloo::alltoallv(). So to make computeLengthsAndOffsets() template

Test Plan:
Sandcastle
CI
HPC: ./trainer_cmd.sh -p 16 -n 8 -d nccl

Reviewed By: osalpekar

Differential Revision: D22961459

fbshipit-source-id: 45ec271f8271b96f2dba76cd9dce3e678bcfb625
2020-08-10 19:21:46 -07:00
Sinan Nasir
0a804be47d [NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596).

ghstack-source-id: 109461507

Test Plan:
```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py
Couldn't download test skip set, leaving all tests enabled...
..............................s.....................................................s................................
----------------------------------------------------------------------
Ran 117 tests in 298.042s

OK (skipped=2)
```
### Facebook Internal:
2\. HPC PT trainer run to validate no regression. Check the QPS number:
**Master:** QPS after 1000 iters: around ~34100
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912
```
```
[0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs)

**getFuture/new design:** QPS after 1000 iters: around ~34030
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/new design Run 2:** QPS after 1000 iters: around ~34200
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/old design (Regression):** QPS after 1000 iters: around ~31150
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (d904ea5972)" --trainers 16 --trainer-version 1cb5cbb
```
```
priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (d904ea5972).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159
```
3\. `flow-cli` tests; roberta_base; world_size=4:
**Master:** f210039922
```
total:
  32 GPUs -- 32 GPUs: p25:  0.908    35/s  p50:  1.002    31/s  p75:  1.035    30/s  p90:  1.051    30/s  p95:  1.063    30/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   452/s  p50:  0.071   449/s  p75:  0.072   446/s  p90:  0.072   445/s  p95:  0.072   444/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.821    38/s  p50:  0.915    34/s  p75:  0.948    33/s  p90:  0.964    33/s  p95:  0.976    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2035/s  p75:  0.016  2027/s  p90:  0.016  2019/s  p95:  0.016  2017/s
```
**getFuture new design:** f210285797
```
total:
  32 GPUs -- 32 GPUs: p25:  0.952    33/s  p50:  1.031    31/s  p75:  1.046    30/s  p90:  1.055    30/s  p95:  1.070    29/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   449/s  p50:  0.072   446/s  p75:  0.072   445/s  p90:  0.072   444/s  p95:  0.072   443/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.865    37/s  p50:  0.943    33/s  p75:  0.958    33/s  p90:  0.968    33/s  p95:  0.982    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2033/s  p75:  0.016  2022/s  p90:  0.016  2018/s  p95:  0.016  2017/s

```

Reviewed By: ezyang

Differential Revision: D22833298

fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208
2020-08-07 18:48:35 -07:00
Darius Tan
6ebc0504ca BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors (#42669)
Summary:
cc rohan-varma
Fixes https://github.com/pytorch/pytorch/issues/41362 #39708

# Description
NCCL doesn't support `BAND, BOR, BXOR`. Since the [current mapping](0642d17efc/torch/lib/c10d/ProcessGroupNCCL.cpp (L39)) doesn't contain any of the mentioned bitwise operator, a default value of `ncclSum` is used instead.

This PR should provide the expected behaviour where a runtime exception is thrown.

# Notes
- The way I'm throwing exceptions is derived from [ProcessGroupGloo.cpp](0642d17efc/torch/lib/c10d/ProcessGroupGloo.cpp (L101))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42669

Reviewed By: ezyang

Differential Revision: D22996295

Pulled By: rohan-varma

fbshipit-source-id: 83a9fedf11050d2890f9f05ebcedf53be0fc3516
2020-08-07 13:09:07 -07:00
Omkar Salpekar
e97e87368e Clean up CUDA Sleep and Tensor Initialization in ProcessGroupNCCLTest (#42211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42211

Helper functions for launching CUDA Sleep and Tensor Value Initialization for the collective test functions.

This is more of a code cleanup fix compared to the previous diffs.
ghstack-source-id: 109097243

Test Plan: working on devGPU and devvm

Reviewed By: jiayisuse

Differential Revision: D22782671

fbshipit-source-id: 7d88f568a4e08feae778669affe69c8d638973db
2020-08-04 12:36:27 -07:00
Omkar Salpekar
3ca361791f TearDown function for ProcessGroupNCCLTest Initializer (#42209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42209

This PR adds a TearDown function to the testing superclass to ensure that the NCCL_BLOCKING_WAIT environment variable is reset after each test case.
ghstack-source-id: 109097247

Test Plan: Working on devGPU and devvm.

Reviewed By: jiayisuse

Differential Revision: D22782672

fbshipit-source-id: 8f919a96d7112f9f167e90ce3df59886c88f3514
2020-08-04 12:36:24 -07:00
Omkar Salpekar
2b8e7e2f2d Moving ProcessGroupNCCLTest to Gtest (#42208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42208

ProcessGroupNCCLTest is currently written without any testing framework, and all tests are simply called from the main function and throw exceptions upon failure. As a result, it is hard to debug and pinpoint which tests have succeeded/failed.

This PR moves ProcessGroupNCCLTest to gtest with appropriate setup and skipping functionality in the test superclass.
ghstack-source-id: 109097246

Test Plan: Working Correctly on devGPU and devvm.

Reviewed By: jiayisuse

Differential Revision: D22782673

fbshipit-source-id: 85bd407f4534f3d339ddcdd65ef3d2022aeb7064
2020-08-04 12:34:09 -07:00
gunandrose4u
d2a2ac4eea Fix read/write bulk data (#42504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42504

Reviewed By: glaringlee

Differential Revision: D22922750

Pulled By: mrshenli

fbshipit-source-id: 9008fa22c00513bd75c3cf88a3081184cd72b0e3
2020-08-04 11:30:53 -07:00
Srinivas Sridharan
ecb88c5d11 Add NCCL Alltoall to PT NCCL process group (#42514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42514

Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.

Reviewed By: mrshenli

Differential Revision: D22917967

fbshipit-source-id: 402f2870915bc237845864a4a27c97df4351d975
2020-08-04 08:39:28 -07:00
Mustafa Said Mehmetoglu
44b018ddeb Convert ProcessGroupNCCLTest.cpp to gtest unittest (#42365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42365

Converting the test

Reviewed By: malfet

Differential Revision: D22855087

fbshipit-source-id: dc917950dcf99ec7036e48aaa4264d2c455cb19e
2020-07-31 20:34:11 -07:00
Brandon Lin
4c6878c97d [gloo] change ProcessGroupGlooAsyncTest to use gtest (#42313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42313

Changes the tests in `ProcessGroupGlooAsyncTest.cpp` to use the Gtest testing framework.

Reviewed By: malfet

Differential Revision: D22821577

fbshipit-source-id: 326b24a334ae84a16434d0d5ef27d16ba4b90d5d
2020-07-31 08:54:50 -07:00
Omkar Salpekar
b6a9f42758 Add appropriate error messages for ProcessGroupNCCLTest (#42143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42143

Replaces the original makeshift error messages in ProcessGroupNCCLTest
with more appropriate ones.
ghstack-source-id: 108711579

Test Plan: Ran the tests on DevGPU

Reviewed By: mrshenli

Differential Revision: D22778505

fbshipit-source-id: 27109874f0b474a74b09f588cf6e7528d2069702
2020-07-28 18:31:23 -07:00
Omkar Salpekar
e4c3f526c8 Fixed Skipping Logic in ProcessGroupNCCLErrors tests (#42192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42192

This PR fixes the complicated skipping logic for ProcessGroupNCCLErrors Tests - it correctly logs the reason for skipping tests when GPUs are not available or the NCCL version is too old.

This is part of a broader effort to improve the testing of the ProcessGroup and Collectives tests.
ghstack-source-id: 108620568

Test Plan: Tested on devGPU and devvm. Tests are run correctly on GPU and skipped on CPU as expected.

Reviewed By: mrshenli

Differential Revision: D22782856

fbshipit-source-id: 6071dfdd9743f45e59295e5cee09e89c8eb299c9
2020-07-28 16:59:40 -07:00
Jongsoo Park
73ff252913 Back out "[NCCL] DDP communication hook: getFuture()" (#42152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152

Original commit changeset: 8c059745261d

Test Plan: .

Reviewed By: ajtulloch, jianyuh

Differential Revision: D22786183

fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca
2020-07-28 10:05:35 -07:00
Nikita Shulga
fbdaa555a2 Enable ProcessGroupGlooTest in CI (take 2) (#42086)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42086

Reviewed By: ngimel

Differential Revision: D22765777

Pulled By: malfet

fbshipit-source-id: ebbcd44f448a1e7f9a3d18fa9967461129dd1dcd
2020-07-27 10:21:59 -07:00
Shen Li
47e6d4b3c8 Revert D22741514: [pytorch][PR] Enable ProcessGroupGlooTest in CI
Test Plan: revert-hammer

Differential Revision:
D22741514 (45e6f2d600)

Original commit changeset: 738d2e27f523

fbshipit-source-id: 0381105ed0ab676b0abd1927f602a35b1b264a6a
2020-07-25 18:19:17 -07:00
Rohan Varma
366c014a77 [Resubmit #41318] NCCL backend support for torch bool (#41959)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/41318 pushed to ci-all branch.

Original description:
Closes https://github.com/pytorch/pytorch/issues/24137.
This PR adds support for the torch.bool tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since bool is not supported as a native ncclDataType_t, we add the following logic:

Map at::kBool to ncclUint8
During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.
The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see https://github.com/pytorch/pytorch/issues/41362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41959

Reviewed By: mrshenli

Differential Revision: D22719665

Pulled By: rohan-varma

fbshipit-source-id: 8bc4194a8d1268589640242277124f277d2ec9f1
2020-07-24 23:44:29 -07:00
Omkar Salpekar
6287f9ed65 Remove AllGatherTestWithTimeout (#41945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41945

This test previously did a thread sleep before launching the allgather operation, and then waited on the work object. Since the sleep was done before the work object was created, it did not affect the allgather call, and thus, did not test work-level timeouts as intended.

I am removing this test for now. In the future we can add this test back, but would need to somehow inject a `cudaSleep` call before the  allgather (so the collective operation itself is delayed). This may require overriding the `ProcessGroupNCCL::collective`, so it's a bit more heavy-weight.

In the meantime, we can remove this test - work-level timeouts are still thoroughly tested with Gloo.
ghstack-source-id: 108370178

Test Plan: Ran ProcessGroupNCCL tests on devGPU

Reviewed By: jiayisuse

Differential Revision: D22702291

fbshipit-source-id: a36ac3d83abfab6351c0476046a2f3b04a80c44d
2020-07-24 18:17:48 -07:00
Nikita Shulga
45e6f2d600 Enable ProcessGroupGlooTest in CI (#41985)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/41143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41985

Reviewed By: rohan-varma

Differential Revision: D22741514

Pulled By: malfet

fbshipit-source-id: 738d2e27f52334e402b65b724b8ba3b0b41372ee
2020-07-24 17:44:00 -07:00
Sinan Nasir
d904ea5972 [NCCL] DDP communication hook: getFuture() (#41596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748

Test Plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).

Reviewed By: izdeby

Differential Revision: D22583690

fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
2020-07-24 11:22:44 -07:00
Shen Li
dbe6bfbd7e Revert D22496604: NCCL Backend support for torch.bool
Test Plan: revert-hammer

Differential Revision:
D22496604 (3626473105)

Original commit changeset: a1a15381ec41

fbshipit-source-id: 693c2f9fd1df568508cbcf8c734c092cec3b0a72
2020-07-23 15:33:58 -07:00
Rohan Varma
3626473105 NCCL Backend support for torch.bool (#41318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41318

Closes https://github.com/pytorch/pytorch/issues/24137.

This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic:
1) Map `at::kBool` to `ncclUint8`
2) During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.

The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Tests are added to ensure that the reductions work as expected.
ghstack-source-id: 108315417

Test Plan: Added unittests

Reviewed By: mrshenli

Differential Revision: D22496604

fbshipit-source-id: a1a15381ec41dc59923591885d40d966886ff556
2020-07-23 12:33:39 -07:00
Shen Li
b80ffd44b0 Revert D20781624: Add NCCL Alltoall to PT NCCL process group
Test Plan: revert-hammer

Differential Revision:
D20781624 (b87f0e5085)

Original commit changeset: 109436583ff6

fbshipit-source-id: 03f6ee4d56baea93a1cf795d26dd92b7d6d1df28
2020-07-22 13:22:17 -07:00
Srinivas Sridharan
b87f0e5085 Add NCCL Alltoall to PT NCCL process group (#39984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39984

Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.

Reviewed By: jiayisuse

Differential Revision: D20781624

fbshipit-source-id: 109436583ff69a3fea089703d32cfc5a75f973e0
2020-07-22 10:55:51 -07:00
Hongyi Jia
65bd38127a GLOO process group GPU alltoall (#41690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41690

Gloo alltoall for GPU

Test Plan: buck test mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: osalpekar

Differential Revision: D22631554

fbshipit-source-id: 4b126d9d991a118f3925c005427f399fc60f92f7
2020-07-20 19:01:12 -07:00
Hongyi Jia
6f5f455c54 [Gloo] alltoall to ProcessGroupGloo (#41424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41424

Adding alltoall to Gloo process group

Test Plan:
buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Verified on TSC as well D22141532

Reviewed By: osalpekar

Differential Revision: D22451929

fbshipit-source-id: 695c4655c894c85229b16097fa63352ed04523ef
2020-07-16 11:27:26 -07:00
Omkar Salpekar
81e964904e [Gloo] Tests for Gloo Async Work Wait-level Timeouts (#41265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41265

This PR adds tests for the Async Work wait-level timeouts that were added in the previous PR
ghstack-source-id: 107835732

Test Plan: New tests are in this diff - Running on local machine and Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22470084

fbshipit-source-id: 5552e384d384962e359c5f665e6572df03b6aa63
2020-07-16 10:59:01 -07:00
Omkar Salpekar
b979129cba [Gloo] Support work-level timeouts in ProcessGroupGloo (#40948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40948

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.
ghstack-source-id: 107835738

Test Plan: Tests are in the last PR in this stack

Reviewed By: jiayisuse

Differential Revision: D22173763

fbshipit-source-id: e0493231a23033464708ee2bc0e295d2b087a1c9
2020-07-16 10:58:59 -07:00
Omkar Salpekar
01dcef2e15 [NCCL] Tests for WorkNCCL::wait with Timeouts (#40947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40947

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.
ghstack-source-id: 107835734

Test Plan: This diff added tests - checking CI/Sandcastle for correctness. These are NCCL tests so they require at least 2 GPUs to run.

Reviewed By: jiayisuse

Differential Revision: D22173101

fbshipit-source-id: 8595e4b67662cef781b20ced0befdcc53d157c39
2020-07-16 10:58:56 -07:00
Omkar Salpekar
edf3dc73f2 [NCCL] Support Wait Timeout in ProcessGroupNCCL (#40946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40946

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.
ghstack-source-id: 107835739

Test Plan: Test added to `ProcessGroupNCCLTest` in the next PR in this stack.

Reviewed By: jiayisuse

Differential Revision: D22127898

fbshipit-source-id: 543964855ac5b41e464b2df4bb6c211ef053e73b
2020-07-16 10:58:54 -07:00
Omkar Salpekar
9d92fa2679 [NCCL] Add timeout to ProcessGroup Work Wait (#40944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40944

This stack adds Work-level timeout for blocking wait.

This PR just changes the API to accept a default wait arg for the wait function in each ProcessGroup backend. The ProcessGroup superclass correctly waits for the given timeout by changing the CV wait to wait_for.

Closes: https://github.com/pytorch/pytorch/issues/37571
ghstack-source-id: 107835735

Test Plan: Tests in 4th PR in this stack

Reviewed By: jiayisuse

Differential Revision: D22107135

fbshipit-source-id: b38c07cb5e79e6c86c205e580336e7918ed96501
2020-07-16 10:56:58 -07:00
Omkar Salpekar
33f9fbf8ba Modularize parsing NCCL_BLOCKING_WAIT in ProcessGroupNCCL (#41076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41076

Modularizes Parsing the NCCL_BLOCKING_WAIT environment variable in the ProcessGroupNCCL Constructor.
ghstack-source-id: 107491850

Test Plan: Sandcastle/CI

Differential Revision: D22401225

fbshipit-source-id: 79866d3f4f1a617cdcbca70e3bea1ce9dcac3316
2020-07-10 10:47:38 -07:00
Jithun Nair
eea535742f Add bfloat16 support for nccl path (#38515)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38515

Differential Revision: D22420896

Pulled By: ezyang

fbshipit-source-id: 80d2d0c2052c91c9035e1e025ebb14e210cb0100
2020-07-07 18:07:06 -07:00
Omkar Salpekar
49e12d888a [NCCL - reland] Explicitly abort NCCL Communicators on Process Group Destruction (#40585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40585

This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL
destructor. This should prevent pending NCCL communicators from blocking other CUDA ops.
ghstack-source-id: 106988073

Test Plan: Sandcastle/ OSS CI

Differential Revision: D22244873

fbshipit-source-id: 4b4fe65e1bd875a50151870f8120498193d7535e
2020-07-01 16:21:16 -07:00
Natalia Gimelshein
b05c34259b relax size check in flatten_for_scatter_gather (#40573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40573

Per title, to workaround apex sbn bug.

Test Plan: Covered by existing tests

Reviewed By: blefaudeux

Differential Revision: D22236942

fbshipit-source-id: ddb164ee347a7d472a206087e4dbd16aa9d72387
2020-06-25 15:16:37 -07:00
Yanli Zhao
dfbf0164c9 Revert D22103662: [NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction
Test Plan: revert-hammer

Differential Revision:
D22103662 (527ab13436)

Original commit changeset: 1f6f88b56bd7

fbshipit-source-id: d0944462c021ec73c7f883f98609fc4a3408efd9
2020-06-25 12:27:24 -07:00
Omkar Salpekar
0c923eea0a Add finishAndThrow function to ProcessGroup::Work, and use with Gloo (#40405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40405

This adds a finishAndThrow function that completes the work object,
sets an exception if one is provided by the user, and throws an exception (if
it is already set or passed by the caller). This is now done by grabbing the
lock just once and simplifies the wait functions in ProcessGroupGloo.
ghstack-source-id: 106516114

Test Plan: CI

Differential Revision: D22174890

fbshipit-source-id: ea74702216c4328187c8d193bf39e1fea43847f6
2020-06-24 14:46:25 -07:00
Omkar Salpekar
3e2d2fc856 [NCCL Docs] Adding Comments for Work-level Finish in ProcessGroup (#40404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40404

Adds docs to the finish function in ProcessGroup::Work. It's better to have some documentation around these functions since we have some PR's with API-changes/optimizations for these work-level functions here and in the subclasses.
ghstack-source-id: 106381736

Test Plan: CI (Docs change only)

Differential Revision: D22174891

fbshipit-source-id: 7901ea3b35caf6f69f37178ca574104d3412de28
2020-06-24 14:44:18 -07:00
Omkar Salpekar
527ab13436 [NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction (#40241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40241

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: https://github.com/pytorch/pytorch/issues/32231
ghstack-source-id: 106469423

Test Plan: CI/Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22103662

fbshipit-source-id: 1f6f88b56bd7a5e9ca5a41698995a76e60e8ad9f
2020-06-24 14:34:00 -07:00
Michael Carilli
8066fba226 [RELAND2] Change AccumulateGrad to yield .grads that match weights' memory layout (#40358)
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.

This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master.  In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.

Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358

Differential Revision: D22165785

Pulled By: albanD

fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
2020-06-22 17:13:21 -07:00
Pritam Damania
a80dd02a22 [Resubmit] Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40249

Blocking wait didn't work for dist.barrier() since we performed a
cudaDeviceSynchronize() before we performed any of the timeout checks. As a
result, in case of failures/desync the barrier() call would get stuck on
cudaDeviceSynchrnonize() and would never return a timeout error to the user.

To fix this, I've moved the device synchronization after the timeout checks.
ghstack-source-id: 106250153
ghstack-source-id: 106250153

Test Plan: waitforbuildbot

Differential Revision: D22126152

fbshipit-source-id: d919a7a6507cca7111d8ad72e916777b986d0d67
2020-06-19 15:42:43 -07:00
Omkar Salpekar
52e4e3a9b8 NCCL Comment typo fix (#40242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40242

Comment Typo in ProcessGroupNCCL
ghstack-source-id: 106088379

Test Plan: CI

Differential Revision: D22099219

fbshipit-source-id: ddce91e640d4eea54e0698166c6276aeffedeb1e
2020-06-19 11:24:52 -07:00
Ilia Cherniavskii
30648985a7 Revert D22108899: Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier()
Test Plan: revert-hammer

Differential Revision:
D22108899

Original commit changeset: 6b109ef9357e

fbshipit-source-id: 41ca36091a7d4d5e94143835809560362fb14fcd
2020-06-18 13:35:10 -07:00
Pritam Damania
d1a0e88075 Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40207

Blocking wait didn't work for dist.barrier() since we performed a
cudaDeviceSynchronize() before we performed any of the timeout checks. As a
result, in case of failures/desync the barrier() call would get stuck on
cudaDeviceSynchrnonize() and would never return a timeout error to the user.

To fix this, I've moved the device synchronization after the timeout checks.
ghstack-source-id: 106123004

Test Plan: waitforbuildbot

Differential Revision: D22108899

fbshipit-source-id: 6b109ef9357e9464e7d66b540caabf5801e6a44a
2020-06-17 23:44:59 -07:00
Alban Desmaison
08227fea4f Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D22079377

Original commit changeset: 9bd2b7e0c34f

fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59
2020-06-17 10:17:27 -07:00
Michael Carilli
1ec8ece2b9 [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout (#40129)
Summary:
https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).

This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129

Differential Revision: D22079377

Pulled By: albanD

fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
2020-06-17 09:02:54 -07:00
Alban Desmaison
f1e575a0bf Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D20496044

Original commit changeset: 248d680f4b1b

fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d
2020-06-16 10:38:40 -07:00
Michael Carilli
2beb9690c3 Change AccumulateGrad to yield .grads that match weights' memory layout (#34904)
Summary:
Currently, whether `AccumulateGrad`  [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout.  If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient.  This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).

Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads.  This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout).  The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes.  I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix.  The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

## BC notes
ngimel albanD gchanan

#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change.  Any user code that was accustomed to `view(-1)`ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed.  In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params.  Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous.  IMO this is a mild BC breakage.  Param backward hooks still see grads come in with whatever format the backward kernel gave them.  The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`.  Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place.  Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout.  After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`.  This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR):  Presetting `param.grad` does not ensure in-place accumulation all the time.  For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~  PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately.  Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904

Differential Revision: D20496044

Pulled By: albanD

fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
2020-06-16 08:43:31 -07:00
Kurt Mohler
f9eb8824f1 Remove datatype from Storage and StorageImpl (#38870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38870

* Removed dtype data member from StorageImpl
* Removed any methods or method arguments in Storage/StorageImpl that deal with dtypes
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Original PR: https://github.com/pytorch/pytorch/pull/38038

Reviewed By: albanD

Differential Revision: D21549645

Pulled By: ezyang

fbshipit-source-id: 4289b356c55ff6b9530376a79343b99b540ee3de
2020-05-21 15:26:08 -07:00
Rohan Varma
6d4d508d8e Log incorrect device in ProcessGroupGloo (#38844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38844

Enhances error message in ProcessGroupGloo to log the unsupported
device. Been seeing a few issues with this and this will provide more debug
information.

Test Plan: CI

Differential Revision: D21676881

fbshipit-source-id: 1fd727162682e1a55003adff67c4358dab488455
2020-05-21 13:16:50 -07:00
Edward Yang
fe88806784 Back out "Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count" (#37893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37893

Original commit changeset: 50746043acf3

Test Plan: sandcastle and ossci

Reviewed By: malfet, seemethere, ngimel

Differential Revision: D21416509

fbshipit-source-id: 735ec4e61f9d36d4537f52dd2dc6267751aeb94b
2020-05-05 22:43:15 -07:00
Edward Yang
a2fc7f787a Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count
Test Plan: revert-hammer

Differential Revision:
D21171334

Original commit changeset: 37329a379de9

fbshipit-source-id: 50746043acf3c76754688de0fe6f1cc12437ea2f
2020-05-05 16:36:15 -07:00
Kurt Mohler
3706803b60 Change StorageImpl to track byte count rather than element count (#37776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37776

* Remove type-specific size tracking in favor of byte size tracking in Storage and StorageImpl
* Changed numel() and set_numel() to nbytes() and set_nbytes()
* Added enum argument to Storage/StorageImpl constructor to indicate new meaning of the size parameter
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37028

Differential Revision: D21171334

Pulled By: ezyang

fbshipit-source-id: 37329a379de9a3a83cc5e9007e455a3e1c2d10b8
2020-05-05 14:20:51 -07:00
cyy
2658bae570 use std::move (#34365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34365

Differential Revision: D21349942

Pulled By: mrshenli

fbshipit-source-id: 4deb51cbb557501b43990ec7080c71a839cb5db9
2020-05-01 13:42:23 -07:00
Mo Zhou
69e2f1aaff [cmake] add HAVE_SOVERSION option (default=OFF). (#37502)
Summary:
This is useful for linux distributions when the ABI/API of libtorch has
been changed. The default SOVERSION is set to
"${TORCH_VERSION_MAJOR}.${TORCH_VERSION_MINOR}".

ezyang

But if the release strategy of pytorch/caffe2 involves avoiding breaking API/ABI changes to libtorch for minor/patch releases, then we can set `TORCH_SOVERSION` to simply `TORCH_VERSION_MAJOR`. Please confirm that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37502

Differential Revision: D21303565

Pulled By: ezyang

fbshipit-source-id: 798f5ec7fc5f0431ff1a7f9e8e5d3a0d3b25bb22
2020-04-30 06:52:33 -07:00
Dhiraj D Kalamkar
945d7a7408 Add All-to-all comms support to distributed module and MPI backend (#32361)
Summary:
As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend.

mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361

Reviewed By: mrshenli

Differential Revision: D20635481

Pulled By: srinivas212

fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2
2020-04-01 08:57:12 -07:00
peter
3bdc4a37ed CMake script cleanup - mixed case for function names (#35589)
Summary:
Running the following code.
```bash
cmake --help-command-list |
grep -v "cmake version" |
while read c; do
    echo 's/\b'"$(echo $c | tr '[:lower:]' '[:upper:]')"'\(\s*\)(/'"$c"'\1(/g'
done >convert.sed &&
git ls-files -z -- bootstrap '*.cmake' '*.cmake.in' '*CMakeLists.txt' |
egrep -z -v '^(cmake/Modules/|cmake/Modules_CUDA_fix/)' |
xargs -0 sed -i -f convert.sed &&
rm convert.sed
```
cmake-lint is too sensitive about mixed case so I didn't switch the check on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35589

Differential Revision: D20735648

Pulled By: ezyang

fbshipit-source-id: a09a60a7ce921bb198575a35335faa299bd10b66
2020-03-30 11:37:02 -07:00
peter
45c9ed825a Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521)
Summary:
Running commands:
```bash
shopt -s globstar

sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake.in
```
We may further convert all the commands into lowercase according to the following issue: 77543bde41.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521

Differential Revision: D20704382

Pulled By: malfet

fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80
2020-03-27 14:25:17 -07:00
Johannes M Dieterich
835ee34e38 [ROCm] Update to ROCm 3.1.1 (#35552)
Summary:
Redux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35552

Differential Revision: D20701593

Pulled By: ezyang

fbshipit-source-id: 1946d1e8fb47d597da903bae5d355bf52a5f017f
2020-03-27 12:21:12 -07:00
Edward Yang
3622e1c90f Revert D20589048: [pytorch][PR] [ROCm] Update CI dockers to ROCm release 3.1.1
Test Plan: revert-hammer

Differential Revision:
D20589048

Original commit changeset: 568f40c1b90f

fbshipit-source-id: 724c4fe99e8806f00d2f7dceb71d15a02358f663
2020-03-26 09:31:59 -07:00
Johannes M Dieterich
f7f7c4edd9 [ROCm] Update CI dockers to ROCm release 3.1.1 (#33930)
Summary:
Request to update ROCm CI dockers to release 3.1

Changes required to the PyTorch source base attached:
* switch to the fast path for the Caffe2 ReLU operator
* switch to the new hipMemcpyWithStream(stream) API to replace hipMemcpyAsync(stream) && hipStreamSynchronize(stream) paradigm in an optimized fashion
* disable two regressed unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33930

Differential Revision: D20589048

Pulled By: ezyang

fbshipit-source-id: 568f40c1b90f311eb2ba57f02a9901114d8364af
2020-03-26 07:55:44 -07:00
Hong Xu
a8ca340ad6 Remove all uses of AT_CHECK and replace them with TORCH_CHECK (#34846)
Summary:
AT_CHECK has been deprecated and provides no more features than
TORCH_CHECK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34846

Differential Revision: D20481339

Pulled By: mrshenli

fbshipit-source-id: 1777e769a069a78e03118270294e5e273d516ca7
2020-03-17 08:59:02 -07:00
cyy
c218963270 fix more errors (#34480)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34480

Differential Revision: D20345198

Pulled By: ezyang

fbshipit-source-id: 583246acd02850ead96f1f0574d01ef6697c6352
2020-03-09 14:54:15 -07:00
Rohan Varma
92083f31b5 [gloo] dont hold locks in calls to buffer in ProcessGroupGloo:RecvWork::wait() and (#33926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33926

The UnboundBuffer calls here are already protected by a mutex. We only
need to hold the lock while writing the shared structures completed_ and
exception_.
ghstack-source-id: 99315427

Test Plan:
CI

CI

Differential Revision: D20154546

fbshipit-source-id: d1b74508c917b21acdcd0f6a914eb0455437ca0e
2020-03-03 13:28:45 -08:00
Pritam Damania
c90b393c00 Fix logging for aborted communicators in ProcessGroupNCCL. (#33147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33147

The log mentioned that it is aborting communicators even if
`blockingWait_` was false. This was incorrect, and I updated the logging to
reflect the appropriate behavior.
ghstack-source-id: 98025017

Test Plan: waitforbuildbot

Differential Revision: D19817967

fbshipit-source-id: fb3415af2cc99eb20981ceaa5203c0a1880fd6f3
2020-02-17 14:42:51 -08:00
Pritam Damania
ab75d64e6e Add ability to abort NCCL communicators from the store. (#32895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32895

When a particular rank calls `ncclCommAbort` on a communicator, it is
important to ensure all other ranks call `ncclCommAbort` on their respective
communicators. If this is not done, the other ranks could get stuck causing the
GPU to spin with 100% utilization.

To alleviate this issue, whenever any rank calls `ncclCommAbort` we put the
unique communicator id in the store. The NCCL watchdog thread then monitors the
store and aborts any communicators found in the store as "aborted".

A few more general fixes in this PR:

1) Use std::shared_ptr for the store in PrefixStore. PrefixStore was using a
reference to the store and when that reference went out of scope the store
object it was holding onto was invalid. This caused a segfault in the watchdog
thread.
2) Enhanced logging for the watchdog thread.

Test Plan: waitforbuildbot

Differential Revision: D19638159

fbshipit-source-id: 596cd87c9fe6d4aeaaab4cb7319cc37784d06eaa
2020-02-05 15:28:05 -08:00
Gaurav Singh
765904f1b9 [torch] fd error check
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32797

Differential Revision: D19642262

Pulled By: mrshenli

fbshipit-source-id: 1720812166dd583dca6d72cb7e24b65ec013a62b
2020-01-30 15:30:03 -08:00
comet
9a2691f2fc Fix spelling errors
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32673

Differential Revision: D19597118

Pulled By: pietern

fbshipit-source-id: f88c1da7548fcee141ed248f5f49d25c1d639955
2020-01-28 04:46:15 -08:00
Edward Yang
57519bd829 Revert "Fix iterator for ncclCommWatchdog. (#32571)" (#32649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32649

This reverts commit 59dbece371.

Revert "Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338)"

This reverts commit f86d6c6afd.

Test Plan: Imported from OSS

Differential Revision: D19584224

Pulled By: ezyang

fbshipit-source-id: 6cc0ad56ba1f3aec5b48db44e8c6c24c8105db4a
2020-01-27 14:25:30 -08:00
Pritam Damania
59dbece371 Fix iterator for ncclCommWatchdog. (#32571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32571

The watchdog thread would erase an element and call `it--` (implicitly
relying on `it++` in the for loop to position correctly). Although, `it--`
would cause undefined behavior if the iterator is pointing to begin(). As a
result, I've modified the logic to update the iterator appropriately.

I've also enhanced the watchdog thread to catch and log exceptions.
ghstack-source-id: 97150763

Test Plan: waitforbuildbot

Differential Revision: D19551365

fbshipit-source-id: 426835819ad8d467bccf5846b04d14442a342f78
2020-01-24 17:34:36 -08:00
Hongyi Jia
21d475e20d [gloo] Skip registry warning (#31126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31126

Gloo device creator registry is throwing warning that confuses users - https://fb.workplace.com/groups/1405155842844877/permalink/3217491788277931/
Create C10_DEFINE_SHARED_REGISTRY_WITHOUT_WARNING API to skip such warning

Test Plan:
{F224342749}

Tested both `C10_DEFINE_SHARED_REGISTRY` and `C10_DEFINE_SHARED_REGISTRY_WITHOUT_WARNING`.
Make sure nothing breaks

Reviewed By: d4l3k

Differential Revision: D18904783

fbshipit-source-id: 0e0065d530956249a18325d4ed3cb58dec255d4c
2020-01-22 22:46:27 -08:00
Yanli Zhao
c342c354a9 Put sparse all reduce results to input tensors (#32226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32226

right now if users call torch.dist.all_reduce() on dense tensors, outputs are put in input tensors. but if users call torch.dist.all_reduce() on sparse tensors, outputs are neither returned explicitly to users nor are put in input tensors.

To make torch.dist.all_reduce() API have same behavior on both dense tensors and sparse tensors, this diff is made to make torch.dist.all_reduce() on sparse tensors to put output in input tensors as well. This is acheived by simply calling input_sparse.copy_(output_sparse), see PR https://github.com/pytorch/pytorch/pull/9005 that implemented copy_ for sparse tensors.

close #31413
ghstack-source-id: 96984228

Test Plan: unit test

Differential Revision: D19192952

fbshipit-source-id: 2dd31dc057f20cc42b44b9e55df864afa2918c33
2020-01-22 08:06:56 -08:00
Pritam Damania
f86d6c6afd Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32338

Timed out ops could linger around if the user doesn't actually call
`wait()` on that OP. As result, to fix this I've introduced the following
functionality in this PR:

1. Keep track of all outstanding work in ProcessGroupNCCL.
2. Enhance NCCL watchdog to sweep through all outstanding work and perform the
following operations:
  i.   If the work has timed out, abort all communicators for that work and
       remove them from the cache.
  ii.  If the communicators for the work receive an error, abort the
       communicators and remove them from the cache.
  iii. If the work has completed (successfully/unsuccessfully), remove it from
       the list of outstanding work.
ghstack-source-id: 96895704

Test Plan: waitforbuildbot

Differential Revision: D19401625

fbshipit-source-id: 8f6f277ba2750a1e1aa03cdbc76e8c11862e7ce5
2020-01-21 12:05:40 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
Rohan Varma
bdd5e15437 skip testExceptions in ProcessGroupGloo if built with TSAN (#32242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32242

TSAN and fork don't play well together, so skip this test if we're
building under TSAN. It will still run in other modes.

Differential Revision: D19416113

fbshipit-source-id: 7e88d63a843356372160c2524c05e8fd1706553e
2020-01-17 14:17:06 -08:00
Rohan Varma
6a5a55d573 use gtest asserts in ProcessGroupGlooTest instead of other checks (#32138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32138

I personally prefer `throw std::runtime_error("BOOM")`, but we should
probably have asserts here now that it is gtest. Also ensures that the correct
exceptions are thrown by the `testSignal` tests.
ghstack-source-id: 96811000

Differential Revision: D19382905

fbshipit-source-id: 1b00dd70524d03c8bd6f48715baa5070a7985467
2020-01-17 10:31:59 -08:00
Rohan Varma
904ab092c2 fix testSend and testRecv in ProcessGroupGlooTest (#32134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32134

These tests weren't written in the most correct way and were often
flaky. It was tricky to identify these tests as flaky until we moved this file
to use gtest.

The gist of the issue is that the test previously would not coordinate sends
and recvs properly. For example, we created a single thread to test an
abortRecv and a successful recv. A separate sender thread was used to send 2
messages. What could go wrong here is that the first send could successfully
complete, resulting in the receiving end processing the message before it gets
the abort signal. In this case we would have an error in the test.
ghstack-source-id: 96806879

Differential Revision: D19379395

fbshipit-source-id: 24782ccaf6e6ec6b445378b29d5f10f901e0dee6
2020-01-17 04:00:39 -08:00
Yanli Zhao
7a9c920bac add lock for ncclCommAbort (#31901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31901

ncclCommAbort is not thread safe, so adding a lock for it
ghstack-source-id: 96829715

Test Plan: unit tests

Differential Revision: D19293869

fbshipit-source-id: 711b4a07605d6e5a81577247d2f90a78041c1809
2020-01-17 03:57:08 -08:00
Alexander Golynski
74621ca926 Add allgather_base as per our discussion re: ProcessGroup interface. (#31892)
Summary:
Introduce ProcessGroup::allgather_base. No implementation yet: plan to add it one PG backend at a time in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31892

Test Plan: No functional changes, no tests yet.

Differential Revision: D19290739

Pulled By: agolynski

fbshipit-source-id: c2f4947d2980995724c539de7c6d97618e1ba11a
2020-01-15 14:05:23 -08:00
Rohan Varma
7572501d40 move ProcessGroupGlooTest to gtest (#32133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32133

We should do this to better debug the test.

Differential Revision: D19375479

fbshipit-source-id: 8c2bf61bae605a38252bb793b091ade479bea11a
2020-01-14 17:42:42 -08:00
Pritam Damania
f003008d6e Allow TCPStore to pick a port to bind to. (#31674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31674

The motivation of this PR was to fix the problem where we would see
"Address already in use" issues for TCPStoreTest due to port conflicts. To
resolve this:

1. We can now pass in port 0 for TCPStore and retrieve the port it actually
bound to using a new getPort() API.
2. Added a `wait` flag to TCPStore constructor indicating whether or not it
should wait for workers (defaults to true).
3. Made `waitForWorkers` a public API to ensure that we can construct TCPStore
without waiting and wait for workers separately. This helps in TCPStoreTest to
ensure we can retrieve the port and pass it to the client stores.
ghstack-source-id: 96486845

Test Plan: waitforbuildbot

Differential Revision: D19240947

fbshipit-source-id: 7b1d1cb2730209fac788764845f1dbbe73d75d9b
2020-01-13 14:23:31 -08:00
Michael Suo
8420f205ee Remove refs from ArrayRef arguments (#31845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31845

ArrayRef is trivially copyable and should be passed by value. Removing
unnecessary `&`s.

Test Plan: Imported from OSS

Differential Revision: D19278523

Pulled By: suo

fbshipit-source-id: 026db693ea98d19246b02c48d49d1929ecb6478e
2020-01-03 22:50:55 -08:00
Rohan Varma
28c9dd4436 fix ProcessGroupGlooTest (#31255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31255

This test had 2 issues. A timeout would occasionally happen due to a timeout of 50ms, and CUDA could would get compiled and run on CPU, leading to errors. This PR fixes those issues.

Differential Revision: D19028231

fbshipit-source-id: e50752228affe0021e7c0caa83bce78d76473759
2020-01-03 18:35:29 -08:00
Mingbo Wan
647569e546 get rid of choco install (#30897)
Summary:
7zip and cmake are part of base image, no need to re-install. Remove the install step can make build/test more stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30897

Differential Revision: D19232961

Pulled By: mingbowan

fbshipit-source-id: fa3bbd1325839a2a977bf13fdbd97fda43793b8d
2019-12-27 13:12:04 -08:00
Sebastian Messmer
f0243ea712 Use [[deprecated]] instead of C10_DEPRECATED (#30918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30918

This is a C++14 feature we can use now
ghstack-source-id: 95811482

Test Plan: waitforsandcastle

Differential Revision: D18869636

fbshipit-source-id: b5b3d78b61b6ceb2deda509131f8502e95b1d057
2019-12-17 15:21:34 -08:00
Sebastian Messmer
643ca5def2 Replace c10::guts::stuff with std::stuff (#30915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915

Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609

Test Plan: waitforsandcastle

Differential Revision: D18869639

fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
2019-12-16 13:57:19 -08:00
Yanli Zhao
36d17f4105 abort nccl communicators before throwing operation timed out (#31128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31128

When operation times out due to some errors that are not detected by nccl communicators, ncclCommWatchdog can not check this time out error and thus can not abort ncclComms accordingly. So explicitly abort ncclComms here before throwing this timed out exception to users, after this, ncclCommWatchdog can detect nccl communicators are aborted and clean up devNCCLCommMap_ accordingly. if throwing timed out excepiton without aborting nccl communicators here, it was observed that CUDA GPU will have 100% utilization and can not run new events successfully.
ghstack-source-id: 95528488

Test Plan: newly revised test _test_nccl_errors_blocking passed with the changes in this diff; the reviesed test failed withtout the changes in this diff

Reviewed By: isunjin

Differential Revision: D18928607

fbshipit-source-id: be65a05ce4ff005f0c7fed36ae8e28903e8ffe2b
2019-12-13 00:33:36 -08:00
hxia11
06c7420fa2 Raise error if a block can not be found from a CUDA tensor (#30870)
Summary:
After several discussions, we agreed not to put any extra safety check for recordStream as either the check will cause failures in certain scenarios or there is no need to throw for user errors.

As a summary, it simply does what is described in https://github.com/pytorch/pytorch/issues/27405, check if a tensor is indeed allocated by a CUDACachingAllocator instance, if it is, then throw internal error if a block can not be retrieved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30870

Differential Revision: D18851669

Pulled By: yxia11

fbshipit-source-id: c2f01798cd24f1fd0f35db8764057d5d333dab95
2019-12-10 08:04:00 -08:00
Nathan Goldbaum
f531815526 Deprecate tensor.type() (#30281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29161.

I looked a bit at the code changes related to this and think I have all of the use cases of `DeprecatedTypeProperties` covered in the message, but suggestions from someone with more context on this would be very much appreciated :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30281

Differential Revision: D18830818

Pulled By: ezyang

fbshipit-source-id: 1a7fcee15354ae09e6644577e7fa33bd26acfe20
2019-12-05 10:55:34 -08:00
Sebastian Messmer
bc2e6d10fa Back out "Revert D17908478: Switch PyTorch/Caffe2 to C++14"
Summary: Original commit changeset: 775d2e29be0b

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D18775520

fbshipit-source-id: a350b3f86b66d97241f208786ee67e9a51172eac
2019-12-03 14:33:43 -08:00
Yanli Zhao
40146eb48e Skip ProcessGroupGlooAyncTest if there is no CUDA available (#30345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30345

Skip ProcessGroupGlooAyncTest if there is no CUDA available, otherwise in sandcastle non GPU host the test will abort with failing to load CUDA library
ghstack-source-id: 94771241

Test Plan: test skipped on non GPU host

Differential Revision: D18665322

fbshipit-source-id: 8c7b89aeecc6ec007bee12d864a6058384254e61
2019-12-03 13:27:34 -08:00
Brian Wignall
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
Pritam Damania
db81e13d6b Fix TCPStoreTest and improve tcputils::connect() (#30354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30354

TCPStoreTest would timeout since the TCPStore constructor for the
server would block the main thread waiting for workers. The workers themselves
were spawned later on once the server store is created. As a result, this test
would always timeout.

To fix the test, I moved the server store to a thread so that the workers can
register with the server in parallel.

In addition to this made a few improvements to tcputils::connect. When
tcputils::connect() encountered an exception, it always looked at `errno` for
the error code. In some cases `errno` could be overwritten and the real error
code would be stored in `std::system_error`. As a result, I've modified the
code to look at the error code in `std::system_error` if we catch an exception
of that type.
ghstack-source-id: 94758939

Test Plan: waitforbuildbot

Differential Revision: D18668454

fbshipit-source-id: d5a3c57b066b094bfecda9a79d9d31bfa32e17f0
2019-12-02 19:52:34 -08:00
Sebastian Messmer
a2ed50c920 Revert D17908478: Switch PyTorch/Caffe2 to C++14
Test Plan: revert-hammer

Differential Revision:
D17908478

Original commit changeset: 6e340024591e

fbshipit-source-id: 775d2e29be0bc3a0db64f164c8960c44d4877d5d
2019-11-27 14:57:05 -08:00
Sebastian Messmer
d0acc9c085 Switch PyTorch/Caffe2 to C++14 (#30406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30406

ghstack-source-id: 94642238

Test Plan: waitforsandcastle

Differential Revision: D17908478

fbshipit-source-id: 6e340024591ec2c69521668022999df4a33b4ddb
2019-11-27 10:47:31 -08:00
Pieter Noordhuis
0282c5ae69 Add helper to aggregate multiple process groups (#25768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25768

The round robin process group can be constructed from multiple other
process groups. Every collective call against this new process group
is delegated to the specified process groups in a round robin fashion.

Doing so may benefit performance when calling into multiple NCCL
process groups. Instead of adding support for round-robin usage of
NCCL communicators, we achieve the same without changing the NCCL
process group and adding this wrapper class.

The API to create this round robin process group is a bit harsh. If we
find it adds significant benefit we can revisit and make this a first
class citizen in the torch.distributed module.
ghstack-source-id: 94578376

Test Plan: The newly added test passes.

Reviewed By: chenyangyu1988

Differential Revision: D17226323

fbshipit-source-id: ec9f754b66f33b983fee30bfb86a1c4c5d74767d
2019-11-27 08:34:34 -08:00
Hongyi Jia
c7f988b8c6 transport open registration (#30167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/29164

- Created GlooDeviceFactory to hide device creation details
- Added transport option while on Python interface

The reason of making the factory class is to make it easier to extend gloo transport in the future

Test Plan: Imported from OSS

Reviewed By: satgera, d4l3k

Differential Revision: D18596527

fbshipit-source-id: e8114162ee8d841c0e0769315b48356b37d6ca0a
2019-11-22 17:41:52 -08:00
Pieter Noordhuis
a074080d57 Mark c10d::~NCCLUtils as noexcept (#29118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29118

It's never a good idea to throw from a destructor and per #28288 we
can't use `std::make_shared` on a class with a `noexcept(false)`
destructor.

To fix this, we `abort` instead of throw from the `NCCLComm` destructor.

Closes #28288.
ghstack-source-id: 93182910

Test Plan: ProcessGroupNCCLErrorsTest runs successfully.

Reviewed By: pritamdamania87

Differential Revision: D18298271

fbshipit-source-id: ccac37753fef64fb63cb304433f4f97dc5621379
2019-11-22 04:06:12 -08:00
Rohan Varma
cc16819028 Add abort API in gloo ProcessGroup Send/Recv Work (#29928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29928

Original author: Shihao Xu
- Add abort to `c10d::ProcessGroup::Work`.
- Change the return type of `c10d::ProcessGroup::Work::wait()` to boolean to indicate if the work is aborted after waiting.
- Add unit test for the correctness of abort.
ghstack-source-id: 94305515
ghstack-source-id: 94305515

Differential Revision: D5685727

fbshipit-source-id: 6e682bb563c2393a5c303c877331140417d3f607
2019-11-20 20:18:54 -08:00
Igor Fedan
65f3b98c35 explicitly provide memory format when calling to clone() at ProcessGroupGloo.cpp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28688

Test Plan: Imported from OSS

Differential Revision: D18333382

Pulled By: ifedan

fbshipit-source-id: b698b647eaa1e318210f445c864d6333e7d46a15
2019-11-11 11:48:53 -08:00
Alexander Golynski
23695ab23f Moving python allgather_coalesced impl from Py to C. (#29059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29059
This is a resubmit of reverted diff D18209289 ( PR #28857 ).

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: pietern

Differential Revision: D18277097

fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed
2019-11-04 08:34:34 -08:00
Shen Li
9041e29d94 Revert D18209289: Moving python allgather_coalesced impl from Py to C
Test Plan: revert-hammer

Differential Revision:
D18209289

Original commit changeset: c5a4c4a1aaa0

fbshipit-source-id: d4865e3f8c4eeee285c711e5c2250b8c9f9b0d25
2019-11-01 11:23:41 -07:00
Alexander Golynski
22a346ee34 Moving python allgather_coalesced impl from Py to C
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28857

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: mrshenli

Differential Revision: D18209289

fbshipit-source-id: c5a4c4a1aaa07286a05a7c842dda428eeb46f696
2019-11-01 10:34:23 -07:00
Jeremy Lilley
579ffb647d Add HashStore to c10d (#28921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28921

This implementation is quite similar to the HashStore in gloo -
an ephemeral in-process store with a lock and unordered_map<>.

There are a few tweaks/differences based on c10d vs gloo:
  - c10d expects add/check methods
  - c10d get() use cases expect to wait up to super::timeout_ if the value isn't present
  - c10d set() isn't expected to throw if the value is present.
  - c10d uses uint8_t vs char

It's potentially a better choice for some cases than FileStore when we
don't need cross-process access, or care about the backing file.
ghstack-source-id: 92992341

Test Plan:
buck build mode/dev-nosan caffe2/torch/lib/c10d/...
    buck-out/dev/gen/caffe2/torch/lib/c10d/HashStoreTest

Differential Revision: D18233713

fbshipit-source-id: ab23f3f93d3148c1337f2cc6a8f2aff4aa6549f3
2019-10-31 13:55:22 -07:00
Jeremy Lilley
331e09eca4 Make FileStore not segfault with concurrent accesses. (#28812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28812

FileStore isn't thread-safe. We've observed a few FB unittests
already using this class in an unsafe manner.

This change enforces at most a single concurrent use of
the various file options, from this specific Store instance.
This protects the cache_, pos_, and the relative integrity
of the operations.

An alternative would be simply to explicitly document this
class as non-thread-safe, though perhaps not everybody will
read the warning.

ghstack-source-id: 92874098

Test Plan:
buck test mode/dev-nosan caffe2/...
  Actual observed failures were in ThreadRpcAgentTest

Differential Revision: D18187821

fbshipit-source-id: 67c765da74c836a9ac9f887cdf1a28a75247e04b
2019-10-30 11:03:00 -07:00
Rohan Varma
a783563738 Skip ProcessGroupNCCLTest if CUDA is not available (#28393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28393

We should skip this test if CUDA is not available and alert the user.
Previously, if this test was ran on cpu it would fail with:
```
terminate called after throwing an instance of 'std::runtime_error'
  what():  cuda runtime error (3) : This binary is linked with CUDA lazy stubs and underlying .so files were not loaded. CUDA functionality is disabled. Set env variable CUDA_LAZY_DEBUG to get messages during startup
```

Test Plan:
Build on CPU and verify that that are no errors when running, we should get the message:
`CUDA not available, skipping test`. Previously, we would get an error:
```
terminate called after throwing an instance of 'std::runtime_error'
  what():  cuda runtime error (3) : This binary is linked with CUDA lazy stubs and underlying .so files were not loaded. CUDA functionality is disabled. Set env variable CUDA_LAZY_DEBUG to get messages during startup. at caffe2/aten/src/THC/THCGeneral.cpp:54
```

Differential Revision: D18054369

fbshipit-source-id: f1d06af88b780a24ca3373a7a133047a2cfe366e
2019-10-24 14:02:09 -07:00
Rohan Varma
00a2b36188 improve error handling in getNCCLVersion in NCCLUtils (#27883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27883

Returns early if NCCL version code returned to us is < 100, to prevent
division errors. This shouldn't actually happen since the nvidia nccl version is way past 0.1.0 but nice to have this safeguard.
ghstack-source-id: 91861083

Test Plan: Follow same process as https://github.com/pytorch/pytorch/pull/27068. Also force version to be < 100 and ensure that "Unknown NCCL Version" is returned.

Differential Revision: D17903234

fbshipit-source-id: c4df63bb1c18f1b2ef9e4cd434d4ca6c5ac556df
2019-10-15 17:33:09 -07:00
Rohan Varma
1054ab213d improve error message for scatter in processGroupGloo (#27458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27458

Same as the previous diff - improve error message by passing back the
size discrepancy.
ghstack-source-id: 91864213

Test Plan: `python test/test_c10d.py`

Differential Revision: D17785296

fbshipit-source-id: f939b8091aede768ea215f69df2c83e438c430cf
2019-10-15 11:09:47 -07:00
Rohan Varma
f36345eb0b improve error message on incorrect inputs into gather for (#27439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27439

When users call dist.gather, they have to pass in a `gather_list` to
the function on the destination worker, and this list needs to have the same
size as the number of processes in the group. When the user initializes this
list incorrectly, the current error message is not very helpful:

This changes the error message so that the incorrect gather_list size is
pointed out and the correct one is given.
ghstack-source-id: 91413442

Test Plan: Added a unit test and tested with an incorrect gather_list size.

Differential Revision: D17781370

fbshipit-source-id: b49aad1b1197daf77daa10911296664e6340e2fa
2019-10-11 11:00:42 -07:00
Pritam Damania
24242e86fa Ensure NCCL error handling code is disabled for NCCL versions < 2.4 (#27124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27124

ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.

For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.
ghstack-source-id: 91452959

Test Plan:
1) Test with 2.4.8
2) Test with 2.2.13
3) unit tests.

Differential Revision: D17178988

fbshipit-source-id: 5dc44b5f7b4b00466c67fd452315f1d4f5c47698
2019-10-07 17:39:32 -07:00
Rohan Varma
0be6641fbf add function to get nccl version for error messages (#27068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27068

Adds a function that uses ncclGetVersion from the NCCL API to retrieve the NCCL version. Converts it into a readable string, and is called in NCCL-related error messages to log the NCCL version. Hopefully this will help with debugging NCCL errors.

Test Plan:
Modify C10D_NCCL_CHECK in NCCLUtils.hpp to always error by setting ncclResult_t error = ncclSystemError
force an NCCL error with script test/simulate_nccl_errors.py:
Start master node: python test/simulate_nccl_errors.py localhost 9124 0 2
Start other node: python test/simulate_nccl_errors.py localhost 9124 1 2
On the master node, should see the following error message w/NCCL version:

```
Traceback (most recent call last):
  File "simulate_nccl_errors.py", line 29, in <module>
    process_group.allreduce(torch.rand(10).cuda(rank)).wait()
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:375, unhandled system error, NCCL version 2.4.8
```

Differential Revision: D17639476

fbshipit-source-id: a2f558ad9e883b6be173cfe758ec56cf140bc1ee
2019-10-04 12:49:45 -07:00
Ilia Cherniavskii
a444054d4b Fix build (#27318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27318

Fix TBB build
USE_TBB=1 ATEN_THREADING=TBB python setup.py develop install --cmake

Test Plan: Imported from OSS

Differential Revision: D17747449

Pulled By: ilia-cher

fbshipit-source-id: 421f362bd10f3be34bffe86ae4f26e8f1c15f1a4
2019-10-03 15:43:06 -07:00
Pieter Noordhuis
2991bfdbe0 Add bitwise distributed reduction ops (#26824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26824

These ops are named after the bitwise reduction ops in MPI.

This is based on the work done by knottb in #22449.

Closes #22449.

Test Plan: Imported from OSS

Differential Revision: D17600210

Pulled By: pietern

fbshipit-source-id: 44c7041ce01bc5de170a4591c5a696e4f24431ef
2019-09-26 08:09:49 -07:00
Sam Pepose
4bd1da1458 Revert D17473200: [pytorch][distributed] add function to get NCCL version for logging
Test Plan: revert-hammer

Differential Revision:
D17473200

Original commit changeset: 4881ed5221b3

fbshipit-source-id: c5635ce89de1644d2135b657427cbd0c3af83576
2019-09-25 14:53:59 -07:00
Rohan Varma
d9055319d4 add function to get NCCL version for logging (#26583)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26583

Adds a function that uses the nccl api to get the version code. Converts it to a readable version. Will be
used for logging NCCL version in exception messages.

Test Plan: See above

Differential Revision: D17473200

fbshipit-source-id: 4881ed5221b397f2f967262668c2b376b6bf3c64
2019-09-25 11:56:31 -07:00
Rohan Varma
f57ecd5f29 add timeout parameter to connect function in TCPStore (#26554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26554

Previously, in `TCPStore`'s constructor we did not pass in a timeout to
the `connect` function, which thus used the default timeout (-1, so infinite).
But the timeout variable in `TCPStore.cpp `is configurable by the user and set to
be 300 seconds by default, so we should be passing this into the connect function.

Test Plan: see above.

Differential Revision: D17486779

fbshipit-source-id: 42d38a3b8d492d9e9ff09110990a8e4a3a1292b2
2019-09-24 16:29:52 -07:00
Rohan Varma
efd933dd01 use timeout in connect function to prevent against (#26364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26364

Per https://github.com/pytorch/pytorch/issues/25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90480086

Test Plan: See above.

Differential Revision: D17430164

fbshipit-source-id: 1482aca72fcc3ddb95ea25649ec057edda5d1934
2019-09-20 10:28:30 -07:00
Edward Yang
9b7011c5c2 Implement multiple dispatch (#26468) (#26501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D17499154

Pulled By: ezyang

fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
2019-09-20 10:12:04 -07:00
Michael Suo
5304358859 Revert D17481256: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17481256

Original commit changeset: b3206936b4ca

fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2
2019-09-19 14:53:40 -07:00
Edward Yang
0705f759a3 Implement multiple dispatch (#26468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bddppq

Differential Revision: D17481256

Pulled By: ezyang

fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96
2019-09-19 14:29:38 -07:00
Junjie Bai
07bd76988e Revert D17265918: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17265918

Original commit changeset: 221efe4e86a4

fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b
2019-09-19 09:50:17 -07:00
Edward Yang
ece14ff473 Implement multiple dispatch (#25653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17265918

Pulled By: ezyang

fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d
2019-09-19 09:30:40 -07:00
Yanli Zhao
ed09704899 use allgatherv for sparse all reduce (#23917)
Summary:
per https://github.com/pytorch/pytorch/issues/22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes.

close https://github.com/pytorch/pytorch/issues/22226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23917

Test Plan:
buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics

buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics_cuda

buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_checks

Differential Revision: D16664985

Pulled By: zhaojuanmao

fbshipit-source-id: e7d3c0770cbc09f9175b3027b527e95053724843
2019-09-18 09:57:45 -07:00
Pieter Noordhuis
f43a2c9c2f Add ProcessGroupGloo::createDefaultDevice (#26166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26166

There were 2 variants to create a new device. One to do so based the
name of a network interface, and one to do so based on a hostname or
address. In the latter, if the address was not specified, it would
lookup the local hostname and try to resolve that. If that failed, the
process would crash.

In this default path, we now try to lookup and use the local hostname,
and if that fails we fallback to using the loopback address.

If the local hostname doesn't resolve to an address that we can bind
to, it is very likely that this process won't join other processes
over the network, and that the user is trying to run a local test.

If this assumption is wrong, the user can override the default
interface selection by setting the environment variable
`GLOO_SOCKET_IFNAME` to the name of the external network interface.

I tested this by changing the local hostname to a bogus name and
confirmed that default initialization works as expected.

Closes #26049.

Test Plan: Imported from OSS

Differential Revision: D17397898

Pulled By: pietern

fbshipit-source-id: 95a2467761d89df87b520d6e5837b92184b0dc12
2019-09-16 12:00:43 -07:00
Pieter Noordhuis
ebdb32c749 Remove global group name tracking for ProcessGroupNCCL (#25905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25905

Now that we can detect and recover from failures in NCCL we should
allow processes that are started at different times (and perhaps have
had previous NCCL process group instances), to eventually be part of
the same process group. Keeping track of group names in global
variables prevents that, because the processes will be out of sync.

This commit removes the global group name maps and defers
responsibility of isolating access to the same store from multiple
process groups to the store itself. Users can use `c10d::PrefixStore`
to derive new store instances whose keyspace is scoped to some
prefix. Functionally, this is identical to keeping a global map and
using a group name, but also gives more flexibility to the front-end
API to reset state and have processes that have started at different
times to join the same process group.
ghstack-source-id: 89804865

Test Plan: Tests pass.

Differential Revision: D17281416

fbshipit-source-id: eab3b48463a9b0ef24aedeca76e2bb970b9f33ef
2019-09-11 06:56:33 -07:00
Pieter Noordhuis
929764ac2a Remove superfluous check for POLLIN in TCPStore (#25911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25911

The check is practically equivalent to checking for equivalence with
POLLIN (because the constant is a single bit and poll(2) is asked to
check for POLLIN). On macOS, if a client disconnects, POLLHUP will be
set as well, and the check fails. Instead of performing the check and
letting it fail, we can simply run the `query` function and catch
exceptions, in case we see EOF.

Test Plan: Imported from OSS

Differential Revision: D17313301

Pulled By: pietern

fbshipit-source-id: 00c5a69043f70848ef632d53f8e046dc69e15650
2019-09-11 02:23:34 -07:00
Pieter Noordhuis
bf4a28175d Retry connecting to TCP store on ECONNRESET (#25707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25707

The retry logic dealt with ECONNREFUSED to deal with the client being
started before the server. It didn't yet deal with the server being
started but having its listen backlog exhausted. This may happen when
starting many processes that all try to connect at the same time.

The server implementation uses blocking I/O to read and write entire
messages, so it may take a bit longer to call `accept(2)` on new
connections compared to a fully event driven approach.

This commit both increases the default listen backlog on the server
side and implements retries on ECONNRESET after `connect(2)`.

Test Plan: Imported from OSS

Differential Revision: D17226958

Pulled By: pietern

fbshipit-source-id: 877a7758b29286e06039f31b5c900de094aa3100
2019-09-09 02:54:20 -07:00