Commit Graph

122 Commits

Author SHA1 Message Date
Rohan Varma
c9f6e70c09 Refactor DDP uneven inputs control flags (#47394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394

This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24739509

fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
2020-11-11 16:51:56 -08:00
Zhicheng Chen
3dd266304c Fix inaccurate note in DistributedDataParallel (#47156)
Summary:
Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ).

Here are some toy code to illustrate my point:

* non-DistributedDataParallel version

```python
import torch

if __name__ == "__main__":
    torch.manual_seed(0)
    inp = torch.randn(1,16)
    inp = torch.cat([inp, inp], dim=0)
    model = torch.nn.Linear(16, 2)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    loss = loss_func(model(inp), torch.tensor([0, 0]))
    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
```

* DistributedDataParallel version

```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    torch.manual_seed(0)
    x = torch.randn(1,16)

    model = torch.nn.Linear(16, 2)
    model = torch.nn.parallel.DistributedDataParallel(model)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.tensor([0])
    loss = loss_func(y, label)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)

def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()
```

Both of these two pieces of code have the same output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156

Reviewed By: mruberry

Differential Revision: D24675199

Pulled By: mrshenli

fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
2020-11-09 17:42:57 -08:00
Yi Wang
fccfe7bd1a [Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158

1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation.
2. Typo fix.
3. Reformat default_hooks.py.
4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.)

The new style can be used for testing any new comm hook like PowerSGD easily.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

ghstack-source-id: 116012600

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D24669639

fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939
2020-11-06 00:28:09 -08:00
Yi Wang
f91fcefc81 [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115783237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

//arvr/projects/eye_tracking/Masquerade:python_test

USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install

Reviewed By: mrshenli

Differential Revision: D24700959

fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
2020-11-03 18:33:50 -08:00
Yi Wang
b1b77148ac Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234

Revert the diff because of https://github.com/pytorch/pytorch/issues/47153

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115720415

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24691866

fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539
2020-11-02 20:51:18 -08:00
Yi Wang
ee0033af9b [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115629230

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

Reviewed By: pritamdamania87

Differential Revision: D24471910

fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a
2020-10-30 23:19:42 -07:00
Rohan Varma
ecdbea77bc Fix DDP documentation (#46861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46861

Noticed that in the DDP documentation:
https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel
there were some examples with `torch.nn.DistributedDataParallel`, fix this to
read `torch.nn.parallel.DistributedDataParallel`.
ghstack-source-id: 115453703

Test Plan: ci

Reviewed By: pritamdamania87, SciPioneer

Differential Revision: D24534486

fbshipit-source-id: 64b92dc8a55136c23313f7926251fe825a2cb7d5
2020-10-29 09:13:47 -07:00
Rohan Varma
7245d2c939 Avoid scatter for single-device case in DDP (#46304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304

In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.

The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677

Test Plan: Added unittest, and CI

Reviewed By: pritamdamania87

Differential Revision: D24296377

fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
2020-10-22 08:29:37 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Emilio Castillo
d38a71d579 torch.nn.modules.LazyModuleMixin and torch.nn.LazyLinear (Shape Inference II) (#44538)
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD

This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.

The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.

As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()
        self.conv = torch.nn.Conv2d(20, 4, 2)
        self.linear = torch.nn.LazyLinear(10)
    def forward(self, x):
        y = self.conv(x).clamp(min=0)
        return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).

TODO:

Add convolutions once the design is OK
Fix docstrings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538

Reviewed By: ngimel

Differential Revision: D24162854

Pulled By: albanD

fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
2020-10-19 13:13:54 -07:00
Rohan Varma
181afd5220 Add an option to DDP to take a list of parameters to ignore upfront. (#44826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826

As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.

To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109

Test Plan: Added unittest

Reviewed By: xw285cornell, mrshenli

Differential Revision: D23740639

fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
2020-09-30 11:52:50 -07:00
Shen Li
c5ade5f698 Fix no_sync docs (#45455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973365

Pulled By: mrshenli

fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb
2020-09-28 20:48:09 -07:00
Shen Li
6967e6295e Fix DDP docs (#45454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973367

Pulled By: mrshenli

fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0
2020-09-28 20:43:22 -07:00
Yanli Zhao
c6500bcf14 [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing  to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 112845787

Test Plan:
1. When grad_is_view=false:
a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli
b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588
https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli
d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli
e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli

2. When grad_is_view=true:
a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details
b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli
d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli
e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli

Reviewed By: mrshenli

Differential Revision: D23588186

fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
2020-09-24 20:54:51 -07:00
Rohan Varma
e57a08119b Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238

Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23719270

fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
2020-09-24 09:50:44 -07:00
Bugra Akyildiz
1b059f2c6d Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914)
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.

Fixes https://github.com/pytorch/pytorch/issues/43960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914

Reviewed By: rohan-varma

Differential Revision: D23798888

Pulled By: bugra

fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
2020-09-22 06:28:47 -07:00
Yanli Zhao
e14b2080be [reland] move rebuild buckets from end of first iteration to beginning of second iteration (#44798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798

[test all]

Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well.

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112279261
ghstack-source-id: 112279261

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D23735185

fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e
2020-09-17 17:10:21 -07:00
Ailing Zhang
fb085d90e3 Revert D23583017: move rebuild buckets from end of first iteration to beginning of second iteration
Test Plan: revert-hammer

Differential Revision:
D23583017 (f5d231d593)

Original commit changeset: ef67f79437a8

fbshipit-source-id: fd914b7565aba6a5574a32b31403525abb80ff07
2020-09-15 15:10:52 -07:00
Yanli Zhao
f5d231d593 move rebuild buckets from end of first iteration to beginning of second iteration (#44326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583017

fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
2020-09-15 09:51:33 -07:00
Yi Wang
ace81b6794 Remove an extra empty line in the warning comments. (#44622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622

Remove an extra empty line in the warning comments.Remove an extra empty line.

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23674070

fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
2020-09-14 11:15:35 -07:00
Rohan Varma
41f62b17e7 Fix DDP join() API in the case of model.no_sync() (#44427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427

Closes https://github.com/pytorch/pytorch/issues/44425

DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s.
ghstack-source-id: 111786479

Reviewed By: pritamdamania87

Differential Revision: D23609070

fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76
2020-09-10 18:31:40 -07:00
Rohan Varma
3806c939bd Polish DDP join API docstrings (#43973)
Summary:
Polishes DDP join api docstrings and makes a few minor cosmetic changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973

Reviewed By: zou3519

Differential Revision: D23467238

Pulled By: rohan-varma

fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56
2020-09-03 13:39:45 -07:00
Rohan Varma
4e4626a23d Join-based API to support DDP uneven inputs (#42577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577

Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes:

#### Approach
1) Add a context manager `torch.nn.parallel.distributed.join`
2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined.
3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros
4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness.
5) At the end of training, the last joined process is selected to be the "authoritative" model copy

We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes.

#### How is it tested?
We have tests covering the following models/scenarios:
- [x] Simple linear model
- [x] Large convolutional model
- [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm)
- [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm)
- [x] Model with unused params (with find unused parameters=True)
- [x] Scenarios where different processes iterate for a varying number of different iterations.
- [x] Test consistency in tie-breaking when multiple ranks are the last ones to join
- [x] Test that we divide by the effective world_size (no. of unjoined processes)

#### Performance implications

###### Trunk vs PR patched, 32 GPUs, batch size = 32
P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s

###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs
P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s

###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs
P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s

###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s

###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s
P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s

The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce.

#### Limitations
1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP.
2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought.
3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs.
ghstack-source-id: 111033819

Reviewed By: mrshenli

Differential Revision: D22893859

fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe
2020-08-31 13:29:03 -07:00
Haoran Li
f35e069622 Back out "Make grad point to bucket buffer in DDP to save memory usage" (#43557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557

backout the diff that caused some errors in pytext distributed training

Test Plan: Tested by rayhou who verified reverting the diff works

Differential Revision: D23320238

fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e
2020-08-25 18:04:39 -07:00
Yanli Zhao
97d594b9f7 Make grad point to bucket buffer in DDP to save memory usage (#41954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41954
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in https://github.com/pytorch/pytorch/pull/41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 110260297

Test Plan:
unit tests,

For roberta_base model with ~1GB parameters, peak memory dropped ~1GB (8250MB-7183MB).  Per iteration latency (0.982s ->0.909s), 8% speed up
https://www.internalfb.com/intern/fblearner/details/211713882?tab=operator_details
https://www.internalfb.com/intern/fblearner/details/211772923?tab=operator_details

For resnet model with ~97M parameters, peak memory dropped ~100MB (3089MB -> 2988MB). Per iteration latency has no change (0.122s -> 0.123s)
https://www.internalfb.com/intern/fblearner/details/211713577?tab=operator_details
https://www.internalfb.com/intern/fblearner/details/211712582?tab=operator_details

accuracy benchmark is expected as well
https://www.internalfb.com/intern/fblearner/details/213237067?tab=Outputs

Reviewed By: mrshenli

Differential Revision: D22707857

fbshipit-source-id: b5e767cfb34ccb3d067db2735482a86d59aea7a4
2020-08-20 15:33:44 -07:00
Sinan Nasir
6e1127ea3f [NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.

Reviewed By: pritamdamania87

Differential Revision: D23055807

fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
2020-08-19 19:42:22 -07:00
Sinan Nasir
752f433a24 DDP communication hook: skip dividing grads by world_size if hook registered. (#42400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400

mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109548696

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Test Plan: python test/distributed/test_c10d.py and perf benchmark tests.

Reviewed By: ezyang

Differential Revision: D22883905

fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e
2020-08-10 13:55:42 -07:00
Sinan Nasir
0a804be47d [NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596).

ghstack-source-id: 109461507

Test Plan:
```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py
Couldn't download test skip set, leaving all tests enabled...
..............................s.....................................................s................................
----------------------------------------------------------------------
Ran 117 tests in 298.042s

OK (skipped=2)
```
### Facebook Internal:
2\. HPC PT trainer run to validate no regression. Check the QPS number:
**Master:** QPS after 1000 iters: around ~34100
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912
```
```
[0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs)

**getFuture/new design:** QPS after 1000 iters: around ~34030
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/new design Run 2:** QPS after 1000 iters: around ~34200
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/old design (Regression):** QPS after 1000 iters: around ~31150
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (d904ea5972)" --trainers 16 --trainer-version 1cb5cbb
```
```
priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (d904ea5972).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159
```
3\. `flow-cli` tests; roberta_base; world_size=4:
**Master:** f210039922
```
total:
  32 GPUs -- 32 GPUs: p25:  0.908    35/s  p50:  1.002    31/s  p75:  1.035    30/s  p90:  1.051    30/s  p95:  1.063    30/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   452/s  p50:  0.071   449/s  p75:  0.072   446/s  p90:  0.072   445/s  p95:  0.072   444/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.821    38/s  p50:  0.915    34/s  p75:  0.948    33/s  p90:  0.964    33/s  p95:  0.976    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2035/s  p75:  0.016  2027/s  p90:  0.016  2019/s  p95:  0.016  2017/s
```
**getFuture new design:** f210285797
```
total:
  32 GPUs -- 32 GPUs: p25:  0.952    33/s  p50:  1.031    31/s  p75:  1.046    30/s  p90:  1.055    30/s  p95:  1.070    29/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   449/s  p50:  0.072   446/s  p75:  0.072   445/s  p90:  0.072   444/s  p95:  0.072   443/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.865    37/s  p50:  0.943    33/s  p75:  0.958    33/s  p90:  0.968    33/s  p95:  0.982    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2033/s  p75:  0.016  2022/s  p90:  0.016  2018/s  p95:  0.016  2017/s

```

Reviewed By: ezyang

Differential Revision: D22833298

fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208
2020-08-07 18:48:35 -07:00
Nikita Shulga
56fc7d0345 Fix doc build (#42559)
Summary:
Add space between double back quotes and left curly bracket

Otherwise doc generation failed with `Inline literal start-string without end-string.`

This regression was introduced by b56db305cf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559

Reviewed By: glaringlee

Differential Revision: D22931527

Pulled By: malfet

fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055
2020-08-04 15:15:15 -07:00
Zhicheng Chen
b56db305cf Improve the documentation of DistributedDataParallel (#42471)
Summary:
Fixes #{issue number}

It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly.

Here is some toy code to illustrate my point:

* non-DistributedDataParallel version
    ```python
    import torch
    import torch.nn as nn

    x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1)
    print("input:", x)

    model = nn.Linear(in_features=1, out_features=1, bias=False)
    model.weight.data.zero_()
    model.weight.data.add_(1.0)

    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.zeros(4, 1, dtype=torch.float)
    loss = torch.sum((y - label)**2)

    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)

    # OUTPUT
    # $ python test.py
    # input: tensor([[-1.],
    #         [ 2.],
    #         [-3.],
    #         [ 4.]])
    # grad: tensor([[60.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9400]], requires_grad=True)
    ```

* DistributedDataParallel version
    ```python
    import os
    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from torch.multiprocessing import Process

    def run(rank, size):
        x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1)
        print("input:", x)

        model = nn.Linear(in_features=1, out_features=1, bias=False)
        model.weight.data.zero_()
        model.weight.data.add_(1.0)
        model = torch.nn.parallel.DistributedDataParallel(model)

        opti = torch.optim.SGD(model.parameters(), lr=0.001)
        opti.zero_grad()

        y = model(x)

        label = torch.zeros(2, 1, dtype=torch.float)
        loss = torch.sum((y.view(-1, 1) - label)**2)

        loss.backward()
        opti.step()

        if rank == 0:
            print("grad:", model.module.weight.grad)
            print("updated weight:\n", model.module.weight)

    def init_process(rank, size, fn, backend="gloo"):
        os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_PORT'] = '29500'
        dist.init_process_group(backend, rank=rank, world_size=size)
        fn(rank, size)

    if __name__ == "__main__":
        size = 2
        process = []
        for rank in range(size):
            p = Process(target=init_process, args=(rank, size, run))
            p.start()
            process.append(p)

        for p in process:
            p.join()

    # OUTPUT
    # $ python test_d.py
    # input: tensor([[-3.],
    #         [ 4.]])input: tensor([[-1.],
    #         [ 2.]])

    # grad: tensor([[30.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9700]], requires_grad=True)
    ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42471

Reviewed By: glaringlee

Differential Revision: D22923340

Pulled By: mrshenli

fbshipit-source-id: 40b8c8ba63a243f857cd5976badbf7377253ba82
2020-08-04 08:36:42 -07:00
Yanli Zhao
79cfd85987 grad detach_ only when it has grad_fn in zero_grad call (#41283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283

in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function
ghstack-source-id: 108702289

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D22487315

fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38
2020-07-29 11:40:13 -07:00
Jongsoo Park
73ff252913 Back out "[NCCL] DDP communication hook: getFuture()" (#42152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152

Original commit changeset: 8c059745261d

Test Plan: .

Reviewed By: ajtulloch, jianyuh

Differential Revision: D22786183

fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca
2020-07-28 10:05:35 -07:00
Shen Li
c76fada4a8 Let DDP.train() return self to stay consistent with nn.Module (#42131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42131

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D22775311

Pulled By: mrshenli

fbshipit-source-id: ac9e6cf8b2381036a2b6064bd029dca361a81777
2020-07-27 18:22:13 -07:00
Sinan Nasir
d904ea5972 [NCCL] DDP communication hook: getFuture() (#41596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748

Test Plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).

Reviewed By: izdeby

Differential Revision: D22583690

fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
2020-07-24 11:22:44 -07:00
Sinan Nasir
d5ae4a07ef DDP Communication Hook Main Structure (#40848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40848

Sub-tasks 1 and 2 of [39272](https://github.com/pytorch/pytorch/issues/39272)
ghstack-source-id: 107787878

Test Plan:
1\. Perf tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP. Execute the following command with diff patched/unpatched (with V25):

* **Unpatched Runs:**
```
hg checkout D22514243
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_masterD22514243 --run-as-secure-group pytorch_distributed
```
* **Run 1 (unpatched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 59 s
f204539235
```
sum:
8 GPUs: p25:  0.156   205/s  p50:  0.160   200/s  p75:  0.164   194/s  p90:  0.169   189/s  p95:  0.173   185/s
fwds:
8 GPUs: p25:  0.032  1011/s  p50:  0.032  1006/s  p75:  0.032  1000/s  p90:  0.032   992/s  p95:  0.033   984/s
bwds:
8 GPUs: p25:  0.121   265/s  p50:  0.125   256/s  p75:  0.129   248/s  p90:  0.134   239/s  p95:  0.137   232/s
opts:
8 GPUs: p25:  0.003  11840/s  p50:  0.003  11550/s  p75:  0.004  8037/s  p90:  0.006  5633/s  p95:  0.007  4631/s
```
* **Run 2 (unpatched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 1 s
f204683840
```
sum:
8 GPUs: p25:  0.145   220/s  p50:  0.147   217/s  p75:  0.150   213/s  p90:  0.154   207/s  p95:  0.157   204/s
fwds:
8 GPUs: p25:  0.032  1015/s  p50:  0.032  1009/s  p75:  0.032  1002/s  p90:  0.032   994/s  p95:  0.032   990/s
bwds:
8 GPUs: p25:  0.107   297/s  p50:  0.111   288/s  p75:  0.115   278/s  p90:  0.119   268/s  p95:  0.122   262/s
opts:
8 GPUs: p25:  0.003  11719/s  p50:  0.004  9026/s  p75:  0.006  5160/s  p90:  0.009  3700/s  p95:  0.010  3184/s
```

* **Patched Runs:**
```
hg checkout D22328310
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_localD22328310 --run-as-secure-group pytorch_distributed
```
* **Run 1 (patched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 30 s
f204544541
```
sum:
8 GPUs: p25:  0.148   216/s  p50:  0.152   210/s  p75:  0.156   205/s  p90:  0.160   200/s  p95:  0.163   196/s
fwds:
8 GPUs: p25:  0.032  1011/s  p50:  0.032  1005/s  p75:  0.032   999/s  p90:  0.032   991/s  p95:  0.033   984/s
bwds:
8 GPUs: p25:  0.112   286/s  p50:  0.116   275/s  p75:  0.120   265/s  p90:  0.125   256/s  p95:  0.128   250/s
opts:
8 GPUs: p25:  0.003  11823/s  p50:  0.003  10948/s  p75:  0.004  7225/s  p90:  0.007  4905/s  p95:  0.008  3873/s
```
* **Run 2 (patched):** `elastic_gang:benchmark_single.elastic_operator`
Ran for 3 mins 14 s
f204684520
```
sum:
8 GPUs: p25:  0.146   219/s  p50:  0.147   217/s  p75:  0.150   214/s  p90:  0.152   210/s  p95:  0.153   208/s
fwds:
8 GPUs: p25:  0.032  1013/s  p50:  0.032  1008/s  p75:  0.032  1002/s  p90:  0.032   996/s  p95:  0.032   990/s
bwds:
8 GPUs: p25:  0.107   299/s  p50:  0.110   290/s  p75:  0.114   280/s  p90:  0.117   274/s  p95:  0.119   269/s
opts:
8 GPUs: p25:  0.003  11057/s  p50:  0.005  6490/s  p75:  0.008  4110/s  p90:  0.010  3309/s  p95:  0.010  3103/s
```
* **Run 3 (patched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 54 s
f204692872
```
sum:
8 GPUs: p25:  0.145   220/s  p50:  0.147   217/s  p75:  0.150   213/s  p90:  0.154   207/s  p95:  0.156   204/s
fwds:
8 GPUs: p25:  0.032  1001/s  p50:  0.032   995/s  p75:  0.032   988/s  p90:  0.033   980/s  p95:  0.033   973/s
bwds:
8 GPUs: p25:  0.108   295/s  p50:  0.111   287/s  p75:  0.114   280/s  p90:  0.119   269/s  p95:  0.121   264/s
opts:
8 GPUs: p25:  0.003  11706/s  p50:  0.003  9257/s  p75:  0.005  6333/s  p90:  0.008  4242/s  p95:  0.009  3554/s
```

* **Memory:**
   * Unpatched:
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     430    |     396    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     222    |     203    |
|===========================================================================|

```
   * Patched:
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|

```

2\. As of v18: `python test/distributed/test_c10d.py`
```
....................s.....s.....................................................s................................
----------------------------------------------------------------------
Ran 114 tests in 215.983s

OK (skipped=3)

```

3\. Additional tests in `python test/distributed/test_c10d.py`:
* `test_ddp_comm_hook_future_passing_cpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
* `_test_ddp_comm_hook_future_passing_gpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
* `test_ddp_comm_hook_future_passing_gpu_gloo`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using gloo backend.
* `test_ddp_comm_hook_future_passing_gpu_nccl`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using nccl backend.
* `test_ddp_invalid_comm_hook_init`: This unit test makes sure that register_comm_hook properly checks the format of hook defined by user. The Python hook must be callable. This test also checks whether bucket annotation checked properly if defined.
* `test_ddp_invalid_comm_hook_return_type`: This test checks whether return annotation checked properly if defined. It also checks whether an internal error is thrown if return type is incorrect and user hasn't specified any return type annotation.
* `test_ddp_comm_hook_register_just_once`: DDP communication hook can only be registered once. This test validates whether the error is thrown properly when register_comm_hook is called more than once.

Reviewed By: ezyang

Differential Revision: D22328310

fbshipit-source-id: 77a6a71808e7b6e947795cb3fcc68c8c8f024549
2020-07-15 11:25:29 -07:00
Yi Huang (PyTorch)
4196605776 helper function to print out all DDP-relevant env vars (#41297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297

GH issue: https://github.com/pytorch/pytorch/issues/40105

Add a helper function to DDP to print out all relevant env vars for debugging

Test Plan:
test through unittest, example output:
 ---
env:RANK=3
env:LOCAL_RANK=N/A
env:WORLD_SIZE=N/A
env:MASTER_PORT=N/A
env:MASTER_ADDR=N/A
env:CUDA_VISIBLE_DEVICES=N/A
env:GLOO_SOCKET_IFNAME=N/A
env:GLOO_DEVICE_TRANSPORT=N/A
env:NCCL_SOCKET_IFNAME=N/A
env:NCCL_BLOCKING_WAIT=N/A
...
 ---

Reviewed By: mrshenli

Differential Revision: D22490486

fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c
2020-07-13 14:03:04 -07:00
Shen Li
0edbe6b063 Add a link in RPC doc page to point to PT Distributed overview (#41108)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108

Test Plan: Imported from OSS

Differential Revision: D22440751

Pulled By: mrshenli

fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb
2020-07-08 14:00:05 -07:00
chengjun
8d570bc708 Decouple DataParallel/DistributedDataParallel from CUDA (#38454)
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.

Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454

Differential Revision: D22051557

Pulled By: mrshenli

fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
2020-07-07 12:48:16 -07:00
Sinan Nasir
15864d1703 Skip allreducing local_used_maps_dev_ when find_unused_param=False
Summary:
1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`.
If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`.
2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942).
3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`.

ghstack-source-id: 106693089

Test Plan:
1. Run `test/distributed/test_c10d.py` and make sure all tests pass.
2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test.
3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well.

Imported from OSS

**Output of version 14:**
```
................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
.test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
.....s...............................
----------------------------------------------------------------------
Ran 108 tests in 214.210s

OK (skipped=3)
```

Differential Revision: D22176231

fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836
2020-06-26 19:20:59 -07:00
Michael Carilli
8066fba226 [RELAND2] Change AccumulateGrad to yield .grads that match weights' memory layout (#40358)
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.

This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master.  In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.

Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358

Differential Revision: D22165785

Pulled By: albanD

fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
2020-06-22 17:13:21 -07:00
Shen Li
30364f0b01 Remove obsolete warning message from DDP (#40190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40190

Fixed by #36503

Test Plan: Imported from OSS

Differential Revision: D22101516

Pulled By: mrshenli

fbshipit-source-id: 9abd6dce602530c11b7fe623ac0f4d556dccc961
2020-06-17 17:58:21 -07:00
Alban Desmaison
08227fea4f Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D22079377

Original commit changeset: 9bd2b7e0c34f

fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59
2020-06-17 10:17:27 -07:00
Michael Carilli
1ec8ece2b9 [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout (#40129)
Summary:
https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).

This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129

Differential Revision: D22079377

Pulled By: albanD

fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
2020-06-17 09:02:54 -07:00
Pritam Damania
15823ac6d5 Enhance DDP docstrings for DDP + RPC support. (#39916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39916

ghstack-source-id: 105999275

Test Plan: waitforbuildbot

Differential Revision: D22013190

fbshipit-source-id: be3bb12b2281579610581b809c822ab6b027fa71
2020-06-16 20:05:13 -07:00
Alban Desmaison
f1e575a0bf Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D20496044

Original commit changeset: 248d680f4b1b

fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d
2020-06-16 10:38:40 -07:00
Michael Carilli
2beb9690c3 Change AccumulateGrad to yield .grads that match weights' memory layout (#34904)
Summary:
Currently, whether `AccumulateGrad`  [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout.  If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient.  This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).

Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads.  This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout).  The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes.  I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix.  The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

## BC notes
ngimel albanD gchanan

#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change.  Any user code that was accustomed to `view(-1)`ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed.  In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params.  Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous.  IMO this is a mild BC breakage.  Param backward hooks still see grads come in with whatever format the backward kernel gave them.  The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`.  Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place.  Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout.  After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`.  This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR):  Presetting `param.grad` does not ensure in-place accumulation all the time.  For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~  PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately.  Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904

Differential Revision: D20496044

Pulled By: albanD

fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
2020-06-16 08:43:31 -07:00
Yanli Zhao
b98948e6dd implement dynamic bucket order in DDP (#35137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35137

bucket order is rebuilt dynamically in the first reduction backward pass when find_unused_parameters = false
ghstack-source-id: 104794018

Test Plan: unit test

Differential Revision: D20128537

fbshipit-source-id: fad73de965cdcb59a51c0a12b248271344584b9f
2020-05-28 12:59:52 -07:00
Shen Li
8d6a8d2b3f Fix DDP bug in single process multiple device use cases (#36503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36503

Test Plan: Imported from OSS

Differential Revision: D21179274

Pulled By: mrshenli

fbshipit-source-id: 0afce30ae0ddda753d1e240584a0f80df9aec4c2
2020-04-22 15:06:28 -07:00
Shen Li
5afd816793 Add a warning for Single-Process Multi-GPU DDP (#36656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36656

Test Plan: Imported from OSS

Differential Revision: D21042537

Pulled By: mrshenli

fbshipit-source-id: fa3501dc2bba14550ec4f254612a80f61fe86a4a
2020-04-15 12:43:50 -07:00
Xiang Gao
df8d6eeb19 Update docs about DP and DDP for CUDA (#35063)
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063

Differential Revision: D20549621

Pulled By: ngimel

fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
2020-03-20 20:06:37 -07:00