Commit Graph

30217 Commits

Author SHA1 Message Date
Iurii Zdebskyi
722faeb2a4 [RELAND] Added optimizers based on multi tensor apply (#45408)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/45299.  The present PR fixes minor bugs that caused revert.

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45408

Reviewed By: gchanan

Differential Revision: D23956680

Pulled By: izdeby

fbshipit-source-id: c5eab7bf5fce14a287c15cead1cdc26e42cfed94
2020-09-28 13:14:04 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
Nikolay Korovaiko
6ab1c0b1ca Disable a few tests in preparation to enabling PE+TE (#44815)
Summary:
Disable a few tests in preparation to enabling PE+TE
Next PR: https://github.com/pytorch/pytorch/pull/45396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44815

Reviewed By: ZolotukhinM

Differential Revision: D23948445

Pulled By: Krovatkin

fbshipit-source-id: 93e641b7b8a3f13bd3fd3840116076553408f224
2020-09-28 12:55:12 -07:00
Xiang Gao
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
Bert Maher
03342af3a3 Add env variable to bypass CUDACachingAllocator for debugging (#45294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294

While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc.  This way, cuda-memcheck will actually work.

Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```

Reviewed By: ngimel

Differential Revision: D23964734

Pulled By: bertmaher

fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
2020-09-28 11:40:04 -07:00
Nikolay Korovaiko
993628c74a Build shape expressions and remove outputs that are only used by aten::sizes (#45080)
Summary:
Currently, TE materializes all intermediate results even if they are only used for computing their shapes. This diff ports the approach the OF (Old Fuser) took to deal with this issue. Namely, given the structure of a fusion group we infer all the sizes outside a fusion group based on fusion group's inputs.

A simple example would be:

```
        def test_fuse(a, b):
            c = a + b
            d = c + b
            return d
```

Here we don't need to cache `c` as computing a gradient for `b` in `d = c + b` doesn't need it. We do need to compute sizes for all arguments here in case broadcasts happen.

Without this optimization, TE would need to materialize `c` so we can get its size

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %83 : Double(1:1, requires_grad=0, device=cuda:0), %84 : Double(1:1, requires_grad=0, device=cuda:0), %85 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : Tensor, %87 : Tensor = prim::If(%85)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0), %c.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%83, %84)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4, %c.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %94 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %95 : (Tensor, Tensor) = prim::CallFunction(%94, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %96 : Tensor, %97 : Tensor = prim::TupleUnpack(%95)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%96, %97)
[DUMP profiling_graph_executor_impl.cpp:499]   %60 : int[] = aten::size(%87) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %60) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %60) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %67 : int[] = aten::size(%86) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%60, %67) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %67) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%86, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3, %c.3)
```

With this optimization we use `prim::BroadcastSizes` to compute the size of `c`. No need to materialize it.

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %88 : Double(1:1, requires_grad=0, device=cuda:0), %89 : Double(1:1, requires_grad=0, device=cuda:0), %90 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %91 : Tensor = prim::If(%90)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%88, %89)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %97 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %98 : (Tensor) = prim::CallFunction(%97, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %99 : Tensor = prim::TupleUnpack(%98)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%99)
[DUMP profiling_graph_executor_impl.cpp:499]   %85 : int[] = aten::size(%91)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : int[] = prim::BroadcastSizes(%59, %62)
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %86) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %86) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%86, %85) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %85) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%91, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45080

Reviewed By: bertmaher

Differential Revision: D23856410

Pulled By: Krovatkin

fbshipit-source-id: 2956286eb03a4894a5baa151c35e6092466322b1
2020-09-28 10:45:56 -07:00
Luca Wehrstedt
e5242aaf89 Update TensorPipe submodule (#45433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433

Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user.

Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI.

Reviewed By: beauby

Differential Revision: D23962289

fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be
2020-09-28 10:32:06 -07:00
Rong Rong
48d29c830d [hotfix] disable problematic cuda tests on rocm builds (#45435)
Summary:
Disable the recent 3 cuda tests on amd rocm build/tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435

Reviewed By: malfet

Differential Revision: D23962881

Pulled By: walterddr

fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f
2020-09-28 10:02:12 -07:00
Eli Uriegas
e2ffdf467a docker: Add torchelastic to docker image (#45438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45438

Adds torchelastic (as well as its dependencies) to the official docker
images

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: tierex

Differential Revision: D23963787

Pulled By: seemethere

fbshipit-source-id: 54ebb4b9c50699e543f264975dadf99badf55753
2020-09-28 09:53:07 -07:00
Nikita Vedeneev
e4950a093a Backward support for generalized eigenvalue solver with LOBPCG in forward [only k-rank SYMEIG case] (#43002)
Summary:
As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002

Reviewed By: zou3519

Differential Revision: D23931326

Pulled By: albanD

fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e
2020-09-28 07:22:35 -07:00
Mike Ruberry
6417a70465 Updates linalg warning + docs (#45415)
Summary:
Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415

Reviewed By: ngimel

Differential Revision: D23958252

Pulled By: mruberry

fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac
2020-09-28 05:28:42 -07:00
generatedunixname89002005325676
7818a214c5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23959094

fbshipit-source-id: 6caa046d263114bff38a38d756099aac357e4f04
2020-09-28 05:08:46 -07:00
Negin Raoof
95a97e51b5 [ONNX] Improve scripting inplace indexing ops (#44351)
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.

2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.

Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351

Reviewed By: ezyang

Differential Revision: D23880267

Pulled By: bzinodev

fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
2020-09-28 00:32:36 -07:00
Zino Benaissa
13f76f2be4 Fix preserve submodule attribute in freezing (#45143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143

This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23844969

Pulled By: bzinodev

fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
2020-09-28 00:05:38 -07:00
liqunfu
c3bf402cbb handle onnx nll with default ignore index (#44816)
Summary:
in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value.
therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816

Reviewed By: ezyang

Differential Revision: D23880354

Pulled By: bzinodev

fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b
2020-09-27 23:26:19 -07:00
Mike Ruberry
8bdbedd4ee Revert "Updates and simplifies nonzero as_tuple behavior"
This reverts commit 8b143771d0.
2020-09-27 20:58:42 -07:00
Mike Ruberry
8b143771d0 Updates and simplifies nonzero as_tuple behavior 2020-09-27 20:56:30 -07:00
shubhambhokare1
5b839bca78 [ONNX] Optimize export_onnx api to reduce string and model proto exchange (#44332)
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332

Reviewed By: bwasti, eellison

Differential Revision: D23880129

Pulled By: bzinodev

fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
2020-09-27 16:29:08 -07:00
neginraoof
4005afe94b [ONNX] Update narrow for dynamic inputs (#44039)
Summary:
Update narrow for dynamic inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44039

Reviewed By: mruberry

Differential Revision: D23742215

Pulled By: bzinodev

fbshipit-source-id: 0d58d2fe996f91a124af988a9a21ee433e842d07
2020-09-27 15:52:57 -07:00
Natalia Gimelshein
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
Natalia Gimelshein
f84b2e865f Revert D23878455: [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey
Test Plan: revert-hammer

Differential Revision:
D23878455 (cf808bed73)

Original commit changeset: 0a17ecf66b28

fbshipit-source-id: 93e60b23f66324e3e5266c45abb0cec295bb3d23
2020-09-27 12:02:24 -07:00
Mikhail Zolotukhin
bc5710f2f7 Benchmarks: tweak PE config settings. (#45349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935518

Pulled By: ZolotukhinM

fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc
2020-09-26 23:13:29 -07:00
Mikhail Zolotukhin
a07d82982a CI: Add a run of FastRNN benchmarks in default executor/fuser configuration. (#45348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45348

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935520

Pulled By: ZolotukhinM

fbshipit-source-id: efecaaab68caaaa057b354884f4ae37b6ef36983
2020-09-26 23:13:27 -07:00
Mikhail Zolotukhin
8cef7326f4 Benchmarks: add 'default' options for fuser and executor. (#45347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935519

Pulled By: ZolotukhinM

fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff
2020-09-26 23:09:02 -07:00
Natalia Gimelshein
37a671abc7 Revert D23828257: Quantization: add API summary section
Test Plan: revert-hammer

Differential Revision:
D23828257 (d2bd556e7d)

Original commit changeset: 9311ee3f394c

fbshipit-source-id: 80b16fc123191e249e6a070ec5360a15fe91cf61
2020-09-26 22:53:10 -07:00
Natalia Gimelshein
110aa45387 Revert D23842456: Quantization: combine previous summary with new summary
Test Plan: revert-hammer

Differential Revision:
D23842456 (278da57255)

Original commit changeset: db2399e51e9a

fbshipit-source-id: 7878257330bf83751cb17c0971a5c894bdf256ba
2020-09-26 22:53:07 -07:00
Natalia Gimelshein
3da1061059 Revert D23916669: quant docs: add reduce_range explanatation to top level doc
Test Plan: revert-hammer

Differential Revision:
D23916669 (eb39624394)

Original commit changeset: ef93fb774cb1

fbshipit-source-id: 7b56020427e76e13f847494044179c81d508db11
2020-09-26 22:48:38 -07:00
Mike Ruberry
54a253fded Revert D23931987: Added optimizers based on multi tensor apply
Test Plan: revert-hammer

Differential Revision:
D23931987 (2b21e7767e)

Original commit changeset: 582134ef2d40

fbshipit-source-id: ffd500aea55fda34155442fb15e2529cb9c00100
2020-09-26 18:11:54 -07:00
Mike Ruberry
e52762cbb7 Revert D23917034: quant docs: document how to customize qconfigs in eager mode
Test Plan: revert-hammer

Differential Revision:
D23917034 (7763e1d7b1)

Original commit changeset: ccf71ce4300c

fbshipit-source-id: 9ce99e880b4a22e824f4413354a0f3703e7c5c2c
2020-09-26 18:05:38 -07:00
Rohan Varma
23dfca8351 Support record_shapes in RPC profiling (#44419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419

Closes https://github.com/pytorch/pytorch/issues/39969

This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.

This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899

Reviewed By: pritamdamania87

Differential Revision: D23591274

fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
2020-09-26 13:26:44 -07:00
Rohan Varma
19dda7c68a Fallback to CPU when remote end does not have CUDA for profiling (#44967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23790729

fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
2020-09-26 13:12:55 -07:00
Iurii Zdebskyi
2b21e7767e Added optimizers based on multi tensor apply (#45299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45299

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931987

Pulled By: izdeby

fbshipit-source-id: 582134ef2d402909d27d89a45c5b588fb7130ea1
2020-09-26 12:17:43 -07:00
Thomas Bredillet
0fa551f0ab [c2] Fix int types for learning rate
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule

Test Plan:
unittest

buck test  caffe2/caffe2/python/operator_test:learning_rate_op_test

Differential Revision: D23938169

fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
2020-09-26 10:59:29 -07:00
Omkar Salpekar
cf808bed73 [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey (#45223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45223

Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112939763

Test Plan: Ensured these new python tests as well as previous C++ tests pass

Reviewed By: jiayisuse

Differential Revision: D23878455

fbshipit-source-id: 0a17ecf66b28d46438a77346e5bf36414e05e25c
2020-09-26 00:54:28 -07:00
Omkar Salpekar
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
Omkar Salpekar
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
Zafar
d9af3d2fcd [quant] ConvTranspose warnings (#45081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45081

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23822449

Pulled By: z-a-f

fbshipit-source-id: f21a5f3ef4d09f703c96fff0bc413dbadeac8202
2020-09-25 22:30:14 -07:00
Wang Xu
92189b34b7 Add get_all_users_of function to GraphManipulation (#45216)
Summary:
This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216

Reviewed By: ezyang

Differential Revision: D23883572

Pulled By: scottxu0730

fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4
2020-09-25 19:32:49 -07:00
Vasiliy Kuznetsov
7763e1d7b1 quant docs: document how to customize qconfigs in eager mode (#45306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306

Adds details to the main quantization doc on how specifically
users can skip or customize quantization of layers.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23917034

Pulled By: vkuzo

fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af
2020-09-25 18:33:35 -07:00
Vasiliy Kuznetsov
eb39624394 quant docs: add reduce_range explanatation to top level doc (#45305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305

Adds an explanatation for reduce_range to the main quantization
doc page.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23916669

Pulled By: vkuzo

fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051
2020-09-25 18:33:32 -07:00
Vasiliy Kuznetsov
278da57255 Quantization: combine previous summary with new summary (#45135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135

The previous quantization summary had steps on what to do for
dynamic, static, QAT.  This PR moves these steps to comments in the
example code, so it is more clear how to accomplish the steps.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23842456

Pulled By: vkuzo

fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0
2020-09-25 18:33:30 -07:00
Vasiliy Kuznetsov
d2bd556e7d Quantization: add API summary section (#45093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093

This adds a tl;dr; style summary of the quantization API
to the documentation. Hopefully this will make this easier
for new folks to learn how to use quantization.

This is not meant to be all-encompassing.  Future PRs
can improve the documentation further.

Test Plan:
1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation
2. inspect the quantization page in Chrome, format looks good

Reviewed By: jerryzh168

Differential Revision: D23828257

Pulled By: vkuzo

fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38
2020-09-25 18:30:51 -07:00
Zafar
958c208666 [quant] conv_transpose graph patterns (#45078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45078

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23821580

Pulled By: z-a-f

fbshipit-source-id: 813a4ef1bbc429720765d61791fe754b6678a334
2020-09-25 18:14:29 -07:00
Ailing Zhang
606b1a9a2e Move xla codegen to aten. (#45241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45241

Test Plan: Imported from OSS

Reviewed By: soumith

Differential Revision: D23926750

Pulled By: ailzhang

fbshipit-source-id: f768e24a9baeca9f9df069a62d6f8b94a853a1ee
2020-09-25 18:07:32 -07:00
Wanchao Liang
32c355af5b [dist_optim] introduce distributed functional optimizer (#45221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221

This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935256

Pulled By: wanchaol

fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
2020-09-25 17:13:10 -07:00
Wanchao Liang
08caf15502 [optimizer] refactor Adam to use functional API (#44791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44791

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935257

Pulled By: wanchaol

fbshipit-source-id: 6f6e22a9287f5515d2e4e6abd4dee2fe7e17b945
2020-09-25 17:13:08 -07:00
Wanchao Liang
0444c372e1 [optimizer] introduce optimizer functional API, refactor Adagrad (#44715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715

We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency.

This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935258

Pulled By: wanchaol

fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a
2020-09-25 17:10:26 -07:00
Nikita Shulga
8ab2ad306d Enable torch.cuda.nccl typechecking (#45344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344

Reviewed By: walterddr

Differential Revision: D23935306

Pulled By: malfet

fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69
2020-09-25 17:02:47 -07:00
Shen Li
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
Brian Hirsh
439930c81b adding a beta parameter to the smooth_l1 loss fn (#44433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433

Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time

fixing some type errors, updated fn signature in a few more files

removing my usage of Scalar, making beta a double everywhere instead

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23636720

Pulled By: bdhirsh

fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
2020-09-25 16:36:28 -07:00