Commit Graph

23444 Commits

Author SHA1 Message Date
neginraoof
0b57b383b1 Im2col export (#30972)
Summary:
Added im2col to opset 11.
This symbolic is used to export torch.nn.Unfold
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30972

Reviewed By: hl475

Differential Revision: D18946921

Pulled By: houseroad

fbshipit-source-id: 13dd0cbae899700df32fd74d6dff1f29033a2b4c
2019-12-20 09:45:45 -08:00
Nikita Shulga
6cd987e7c0 Make fully_qualified_type_name_impl() compatible with VS2017 15.9 (#31455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31455

In 15.9, __FUNCSIG__ unwraps using definitions as well as preserves noexcept qualifiers

Test Plan: Build caffe2 on Windows using VS2017

Differential Revision: D19166204

fbshipit-source-id: b6c5f70e5262d13adf585f77b92223cf5f1e78dd
2019-12-20 09:17:44 -08:00
Nikita Shulga
2099cfa13d Fix input_channels divisibility check in concat_split_op (#31448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31448

Replace `(!x%y)` with `(x%y != 0)`

Test Plan: CI

Reviewed By: orionr

Differential Revision: D19165492

fbshipit-source-id: 246635fb8ddd5823196bcef9d0e6cdf1c349015e
2019-12-20 09:12:54 -08:00
Gregory Chanan
b38901aa15 Test reading __cuda_array_interface__ inferred strides. (#31451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31451

The PR that fixed this, https://github.com/pytorch/pytorch/pull/24947, didn't add a test.

Fixes: https://github.com/pytorch/pytorch/issues/31443

Test Plan: Imported from OSS

Differential Revision: D19170020

Pulled By: gchanan

fbshipit-source-id: bdbf09989ac8a61b1b70bb1ddee103caa8ef435b
2019-12-20 08:21:39 -08:00
Brian Vaughan
d0d6e0b5e3 add type promotion support for sparse tensors (#30429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30429

also fix a bug in uncoalesced division

General approach here is that we:
* compute the common dtype based on input tensors
* error if the output tensor is specified and the common type can't be cast back to the output type (e.g. for inplace ops)
* convert input tensor (values) to the common dtype
* perform the op as normal (computing at the common dtype instead of the result type).
* convert/copy the result values back to that of the result tensor (for in-place ops).

For uncoalesced division we need to coalesce, because an integral tensor with values=[1,1] at the same index divided by 2 would give 1/2 + 1/2 =0 instead of 2/2=1.

Test Plan: Imported from OSS

Differential Revision: D19143223

Pulled By: nairbv

fbshipit-source-id: 480fa334c0b2b3df046818f2342cfd4e2d9d892a
2019-12-20 08:01:00 -08:00
svcscm
e9ef087d2d Updating submodules
Summary:
GitHub commits:

357842e091
d62f47c763
dc94cd4972

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: dcb9813e1469cc867d9c826daa873c535ef408ab
2019-12-20 00:57:39 -08:00
Chunli Fu
4c341582ea modify model to enable loading by blob (#31507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31507

This script is used to generate a model with bound shape inference and
blob reorder, which are requirements for big model loading on T17.
1. Load existing model.
2. Do bound shape inference and blob reorder (put embedding blobs at the end).
3. Save the modified model.

Test Plan:
Generated a new moel and tested on NNPI.
P124181047 (mismatch is AA variance)

Reviewed By: ipiszy

Differential Revision: D19165467

fbshipit-source-id: c3522fc5dc53b7ec652420558e9e8bf65a1ccfae
2019-12-19 21:57:22 -08:00
davidriazati
06dbef663d Add support for del (#31273)
Summary:
Adds the `del` keyword to the parser and corresponding `aten::Delete` op for lists and dicts

Fixes #20615
](https://our.intern.facebook.com/intern/diff/19181473/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31273

Pulled By: driazati

Differential Revision: D19181473

fbshipit-source-id: c42a2d43ec361a98e0c425232981edc9c39388c4
2019-12-19 21:48:11 -08:00
Xiang Gao
624088e444 Don't dispatch to cudnn if it is not possible to make it 32bit by splitting batch dim (#31383)
Summary:
Also a step towards supporting 64bit indexing in convolution.

See also: https://github.com/pytorch/pytorch/pull/31379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31383

Differential Revision: D19183443

Pulled By: ngimel

fbshipit-source-id: 0c2030fac147e629d7be0c29f0683ec2b3f28c71
2019-12-19 18:00:03 -08:00
svcscm
87768e5ade Updating submodules
Summary:
GitHub commits:

286867987e
09cbf47ea5
db100834c1
1ba92b8582
60240e3f08
beb5c4798e
c37eb5d377
1ada29037c
f12539bbc9

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 75b16ea1bc038b599b3540d0615dd9eb9ecfda74
2019-12-19 17:30:48 -08:00
Zachary DeVito
457286a383 fix missing type check in dictionary literal
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31375

Test Plan: Imported from OSS

Differential Revision: D19145440

Pulled By: zdevito

fbshipit-source-id: 69909089586149ef766b4858d3420864a81b2493
2019-12-19 16:22:36 -08:00
Rohan Varma
348d42114e Kill MessageType::SHUTDOWN related logic in pg agent (#31270)
Summary:
https://github.com/pytorch/pytorch/pull/30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message.

I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in https://github.com/pytorch/pytorch/pull/30710 is ever moved to C++.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31270

Test Plan: All existing unit tests pass

Differential Revision: D19146983

Pulled By: rohan-varma

fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647
2019-12-19 13:47:43 -08:00
davidriazati
57caeb3fc1 Fix builtins table (#31492)
Summary:
Fixes a bad merge that is breaking distributed tests on master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31492

Pulled By: driazati

Differential Revision: D19180978

fbshipit-source-id: f69f525e2c7f61194686f07cf75db00eb642882f
2019-12-19 13:33:15 -08:00
Jerry Zhang
226c2d79ce Get QScheme from observer module (#31293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31293

Previously we check the number of elements in scale to determine if we are using per channel quantization,
but we should get qscheme information from observer module directly and we'll expose this information
to caller as well

Test Plan:
.

Imported from OSS

Differential Revision: D19146669

fbshipit-source-id: ea430eeae0ef8f441be39aa6dcc1bb530b065554
2019-12-19 13:33:11 -08:00
Richard Zou
dbe2f265d0 Better error msg for autograd profiler + multi-worker dataloader crash (#31473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31473

Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

Differential Revision: D19178080

Pulled By: zou3519

fbshipit-source-id: c632525ba1f7b168324f1aa55416e5250f56a086
2019-12-19 13:30:19 -08:00
Richard Zou
e67064a96f Exclude generated source docs from Google (#31484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31484

See https://github.com/pytorch/pytorch/issues/26123 for context.

Previously, when someone googles for `pytorch "adaptive_max_pool2d"`,
https://pytorch.org/docs/stable/_modules/torch/nn/modules/pooling.html
is the first result. This PR changes the docs build script to exclude
all such generated source docs under `_modules/` from Google.

It does this by doing a search for `<head>` and then appending
`<meta name="robots" content="noindex">`.
The [google developer
docs](https://support.google.com/webmasters/answer/93710?hl=en) suggest
that this is the right way to prevent google from indexing the page.

In the future, when the CI
builds documentation (both master and stable docs), the newly created
docs under _modules will have the meta noindex tag.

Test Plan:
- I ran `find "$install_path/_modules" -name "*.html" -print0 | xargs -0
sed -i '/<head>/a \ \ <meta name="robots" content="noindex">'` on a docs
build locally and checked that it does indeed append the meta noindex
tag after `<head>`.
- In a few days we should rerun the search to see if these pages are
still being indexed.

Differential Revision: D19180300

Pulled By: zou3519

fbshipit-source-id: 5f5aa95a85dd9f065607c2a16f4cdd24ed699a83
2019-12-19 13:27:12 -08:00
Richard Zou
8f3c0d541e Speed up Tensor::has_names for unnamed tensors (#31436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31436

Tensor::has_names is slower than it should be for unnamed tensors
because of the following:
- it always tries to access the TLS for NamesMode. Unnamed tensors don't
need to peek at NamesMode to determine if they have names or not.
- There is some virtual function being called because TensorImpl is in
c10 and NamedTensorMeta is in libtorch.

This PR short-circuits Tensor::has_names for unnamed tensors by
checking if the underlying TensorImpl hold a pointer to NamedTensorMeta
or not. If the NamedTensorMeta is nullptr; then the tensor is definitely
unnamed.

Benchmarks:
- I have a dedicated benchmarking machine where I isolate a single CPU
and make sure it runs at a fixed frequency.
- I benchmarked torch.add, which calls `tensor::has_names` three times.
- The TL;DR is that torch.add between size-1 unnamed tensors gets sped up
~200ns after this change which is a 9% improvement.
- Before, on my machine:
https://gist.github.com/zou3519/dfd648a1941d584711d850754e0694bc
- After on my machine:
https://gist.github.com/zou3519/e78f0d8980b43d0d9c3e3e78ecd0d4d5

Test Plan: - run tests

Differential Revision: D19166510

Pulled By: zou3519

fbshipit-source-id: 1888a4e92d29152a5e3b778a95e531087e532f53
2019-12-19 13:19:30 -08:00
anjali411
9d9bc93bfb Added error message to indicate that reduction operations are not supported for dim>=64 (#31476)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/23159
Currently we don't support reduction operations for dim>=64 and we should give a descriptive RuntimeError indicating the same
Diff: D19179039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31476

Differential Revision: D19179039

Pulled By: anjali411

fbshipit-source-id: 58568f64627bf3df6b3e00a1498544c030e74a0e
2019-12-19 13:00:53 -08:00
Elias Ellison
779b128872 add back in reference to jit_unsupported section (#31486)
Summary:
It was added in https://github.com/pytorch/pytorch/pull/31329 and removed in a bad merge in https://github.com/pytorch/pytorch/pull/31138/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31486

Differential Revision: D19181967

Pulled By: eellison

fbshipit-source-id: 7e4b4a9b2042c30ec18f7f737bc4a9a56fac7d92
2019-12-19 12:44:16 -08:00
anjali411
49fe7a7401 Updated documentation for NLLLoss to explain what x, y and w refer to (#31488)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/31385

In the current documentation for NLLLoss, it's unclear what `y` refers to in the math section of the loss description. There was an issue(https://github.com/pytorch/pytorch/issues/31295) filed earlier where there was a confusion if the loss returned for reduction=mean is right or not, perhaps because of lack in clarity of formula symbol description in the current documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31488

Differential Revision: D19181391

Pulled By: anjali411

fbshipit-source-id: 8b75f97aef93c92c26ecbce55b3faf2cd01d3e74
2019-12-19 12:28:16 -08:00
Jerry Zhang
d6acc87c93 Guard against copying from quantized Tensor to non-quantized Tensor (#29660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29660

att

Test Plan:
python test/test_quantized_tensor.py

Imported from OSS

Differential Revision: D18799897

fbshipit-source-id: 5d1b4ef84f5ae8eba830784b74485d78fa1e6fcf
2019-12-19 12:16:44 -08:00
peterjc123
c4121ed8db Fix is_fundamental template for MSVC (#30959)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30932
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30959

Differential Revision: D18891797

Pulled By: mingbowan

fbshipit-source-id: e6c36ee80065e66117873e768f86f507c48aaef1
2019-12-19 12:10:22 -08:00
svcscm
6d6a91fb0f Updating submodules
Summary:
GitHub commits:

58a1ec274c
24da1c8b66
77d5ba7887
c7b80d7ab5

Test Plan: n/a

Reviewed By: tgreenidge

fbshipit-source-id: be872df9014b795b279b93bd81efbaa41f2d0fd7
2019-12-19 12:05:29 -08:00
davidriazati
28376e826d Fix lint
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31463

Pulled By: driazati

Differential Revision: D19173580

fbshipit-source-id: 6e5bb24949ec357c4d5b29a16d1733b664f21e05
2019-12-19 10:17:01 -08:00
Gregory Chanan
540b9da41e Bump numba version in circleCI config to 0.46.0. (#31435)
Summary:
The current numba version doesn't appear to actually work with our numba-cuda tests (numba.cuda.is_available()) fails.

Previous attempts to upgrade were blocked by https://github.com/numba/numba/issues/4368.

It's a bit unclear to me, but I believe 0.46.0 fixes the above version.  I'm verify that we catch that issue in CI via https://github.com/pytorch/pytorch/pull/31434.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31435

Differential Revision: D19166865

Pulled By: gchanan

fbshipit-source-id: e01fa48c577e35de178423db7a7f79ac3dd3894d
2019-12-19 07:55:55 -08:00
Nikolay Korovaiko
fc3103b116 fixing a naming issue in creating a residual loop node in a bailout graph (#31400)
Summary:
This addresses the issue of differentiating between `%4` in
`%12 : int, %y.1 : Tensor = prim::Loop(%9, %6, %4, %3)` and `%y.5 : Double(3) = aten::cat(%22, %4) # test_jit.py:3772:24` in `%4` loop's body in a residual continuation loop, because these should be different values.

```
[DUMP profiling_graph_executor_impl.cpp:124] with prim::BailoutTemplate_0 = graph(%z.1 : int,
[DUMP profiling_graph_executor_impl.cpp:124]       %size.1 : int):
[DUMP profiling_graph_executor_impl.cpp:124]   %2 : Tensor = prim::Constant[value= 1  1 [ CPUDoubleType{2} ]]()
[DUMP profiling_graph_executor_impl.cpp:124]   %3 : Double(2) = prim::BailOut[index=0](%2, %z.1, %size.1)
[DUMP profiling_graph_executor_impl.cpp:124]   %4 : int = prim::Constant[value=0]() # test_jit.py:3772:54
[DUMP profiling_graph_executor_impl.cpp:124]   %5 : None = prim::Constant()
[DUMP profiling_graph_executor_impl.cpp:124]   %6 : bool = prim::Constant[value=1]() # test_jit.py:3770:16
[DUMP profiling_graph_executor_impl.cpp:124]   %counters.1 : int[] = prim::ListConstruct()
[DUMP profiling_graph_executor_impl.cpp:124]   %8 : int = prim::Constant[value=8]()
[DUMP profiling_graph_executor_impl.cpp:124]   %9 : int = aten::__round_to_zero_floordiv(%size.1, %8)
[DUMP profiling_graph_executor_impl.cpp:124]   %10 : int = aten::mul(%9, %8)
[DUMP profiling_graph_executor_impl.cpp:124]   %11 : int = aten::sub(%size.1, %10)
[DUMP profiling_graph_executor_impl.cpp:124]   %12 : int, %y.1 : Tensor = prim::Loop(%9, %6, %4, %3) # test_jit.py:3770:16
[DUMP profiling_graph_executor_impl.cpp:124]     block0(%i.2 : int, %15 : int, %y.7 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:124]       %17 : Double(2) = prim::BailOut[index=1](%y.7, %z.1, %counters.1, %9, %11, %i.2, %15)
[DUMP profiling_graph_executor_impl.cpp:124]       %18 : int[] = aten::append(%counters.1, %15) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %19 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %20 : Tensor = aten::ones(%19, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %21 : Double(1) = prim::BailOut[index=2](%20, %z.1, %counters.1, %9, %11, %i.2, %15, %17)
[DUMP profiling_graph_executor_impl.cpp:124]       %22 : Tensor[] = prim::ListConstruct(%17, %21)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.5 : Double(3) = aten::cat(%22, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %24 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:124]       %25 : int = aten::add(%15, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %26 : int[] = aten::append(%counters.1, %25) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %27 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %28 : Tensor = aten::ones(%27, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %29 : Double(1) = prim::BailOut[index=3](%28, %z.1, %counters.1, %9, %11, %i.2, %y.5, %25)
[DUMP profiling_graph_executor_impl.cpp:124]       %30 : Tensor[] = prim::ListConstruct(%y.5, %29)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.9 : Double(4) = aten::cat(%30, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %32 : int = aten::add(%25, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %33 : int[] = aten::append(%counters.1, %32) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %34 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %35 : Tensor = aten::ones(%34, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %36 : Double(1) = prim::BailOut[index=4](%35, %z.1, %counters.1, %9, %11, %i.2, %y.9, %32)
[DUMP profiling_graph_executor_impl.cpp:124]       %37 : Tensor[] = prim::ListConstruct(%y.9, %36)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.10 : Double(5) = aten::cat(%37, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %39 : int = aten::add(%32, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %40 : int[] = aten::append(%counters.1, %39) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %41 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %42 : Tensor = aten::ones(%41, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %43 : Double(1) = prim::BailOut[index=5](%42, %z.1, %counters.1, %9, %11, %i.2, %y.10, %39)
[DUMP profiling_graph_executor_impl.cpp:124]       %44 : Tensor[] = prim::ListConstruct(%y.10, %43)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.11 : Double(6) = aten::cat(%44, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %46 : int = aten::add(%39, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %47 : int[] = aten::append(%counters.1, %46) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %48 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %49 : Tensor = aten::ones(%48, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %50 : Double(1) = prim::BailOut[index=6](%49, %z.1, %counters.1, %9, %11, %i.2, %y.11, %46)
[DUMP profiling_graph_executor_impl.cpp:124]       %51 : Tensor[] = prim::ListConstruct(%y.11, %50)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.12 : Double(7) = aten::cat(%51, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %53 : int = aten::add(%46, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %54 : int[] = aten::append(%counters.1, %53) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %55 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %56 : Tensor = aten::ones(%55, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %57 : Double(1) = prim::BailOut[index=7](%56, %z.1, %counters.1, %9, %11, %i.2, %y.12, %53)
[DUMP profiling_graph_executor_impl.cpp:124]       %58 : Tensor[] = prim::ListConstruct(%y.12, %57)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.13 : Double(8) = aten::cat(%58, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %60 : int = aten::add(%53, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %61 : int[] = aten::append(%counters.1, %60) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %62 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %63 : Tensor = aten::ones(%62, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %64 : Double(1) = prim::BailOut[index=8](%63, %z.1, %counters.1, %9, %11, %i.2, %y.13, %60)
[DUMP profiling_graph_executor_impl.cpp:124]       %65 : Tensor[] = prim::ListConstruct(%y.13, %64)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.14 : Double(9) = aten::cat(%65, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %67 : int = aten::add(%60, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %68 : int[] = aten::append(%counters.1, %67) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %69 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %70 : Tensor = aten::ones(%69, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %71 : Double(1) = prim::BailOut[index=9](%70, %z.1, %counters.1, %9, %11, %i.2, %y.14, %67)
[DUMP profiling_graph_executor_impl.cpp:124]       %72 : Tensor[] = prim::ListConstruct(%y.14, %71)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.15 : Tensor = aten::cat(%72, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %74 : int = aten::add(%67, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       -> (%6, %74, %y.15)
[DUMP profiling_graph_executor_impl.cpp:124]   %75 : Double(10) = prim::BailOut[index=10](%y.1, %z.1, %counters.1, %11, %12)
[DUMP profiling_graph_executor_impl.cpp:124]   %76 : int, %y : Tensor = prim::Loop(%11, %6, %12, %75) # test_jit.py:3770:16
[DUMP profiling_graph_executor_impl.cpp:124]     block0(%i.1 : int, %79 : int, %y.6 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:124]       %81 : Double(*) = prim::BailOut[index=11](%y.6, %z.1, %counters.1, %11, %i.1, %79)
[DUMP profiling_graph_executor_impl.cpp:124]       %82 : int[] = aten::append(%counters.1, %79) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %83 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %84 : Tensor = aten::ones(%83, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %85 : Double(1) = prim::BailOut[index=12](%84, %counters.1, %11, %i.1, %79, %81)
[DUMP profiling_graph_executor_impl.cpp:124]       %86 : Tensor[] = prim::ListConstruct(%81, %85)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.4 : Tensor = aten::cat(%86, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %88 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:124]       %89 : int = aten::add(%79, %88)
[DUMP profiling_graph_executor_impl.cpp:124]       -> (%6, %89, %y.4)
[DUMP profiling_graph_executor_impl.cpp:124]   %90 : Double(12) = prim::BailOut[index=13](%y, %counters.1)
[DUMP profiling_graph_executor_impl.cpp:124]   %91 : (Tensor, int[]) = prim::TupleConstruct(%90, %counters.1)
[DUMP profiling_graph_executor_impl.cpp:124]   return (%91)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31400

Differential Revision: D19172750

Pulled By: Krovatkin

fbshipit-source-id: 85d3aac4e80b65b83b6be3c0bca8075a731a2b7e
2019-12-19 00:34:50 -08:00
David Riazati
1e116a5089 Revert D19054937: Add support for del
Test Plan: revert-hammer

Differential Revision:
D19054937

Original commit changeset: c535ea16a9e6

fbshipit-source-id: e57d31811441947b7ee38c8c2b16eecde5005792
2019-12-18 22:39:41 -08:00
Junjie Bai
489dd6cb90 Add TORCH_DCHECK macro that checks only in debug builds (#31240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31240

Follow up on discoveries/discussions in https://github.com/pytorch/pytorch/pull/30810

Mimic the `DCHECK` macro from https://github.com/pytorch/pytorch/blob/e5eb871/c10/util/logging_is_not_google_glog.h#L117-L125

With this change the perf gap is eliminated:

```
================================================================================
Program Output:
================================================================================
Run on (36 X 1601 MHz CPU s)
2019-12-12 20:12:13
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_IntrusivePtrCtorDtor           23 ns         23 ns   30914703
BM_SharedPtrCtorDtor              27 ns         27 ns   25895944
BM_IntrusivePtrArray/16          503 ns        503 ns    1392139
BM_IntrusivePtrArray/32         1006 ns       1006 ns     695749
BM_IntrusivePtrArray/64         2013 ns       2013 ns     347714
BM_IntrusivePtrArray/128        4024 ns       4024 ns     173964
BM_IntrusivePtrArray/256        8047 ns       8047 ns      86994
BM_IntrusivePtrArray/512       16106 ns      16106 ns      43461
BM_IntrusivePtrArray/1024      32208 ns      32207 ns      21731
BM_IntrusivePtrArray/2048      64431 ns      64430 ns      10865
BM_IntrusivePtrArray/4096     128940 ns     128938 ns       5429
BM_SharedPtrArray/16             503 ns        503 ns    1392128
BM_SharedPtrArray/32            1006 ns       1006 ns     695940
BM_SharedPtrArray/64            2012 ns       2012 ns     347817
BM_SharedPtrArray/128           4024 ns       4023 ns     173927
BM_SharedPtrArray/256           8069 ns       8069 ns      86741
BM_SharedPtrArray/512          16143 ns      16142 ns      43357
BM_SharedPtrArray/1024         32283 ns      32283 ns      21685
BM_SharedPtrArray/2048         64718 ns      64717 ns      10817
BM_SharedPtrArray/4096        129469 ns     129466 ns       5407
================================================================================
```
```
================================================================================
Program Output:
================================================================================
Run on (80 X 2001 MHz CPU s)
2019-12-12 20:12:23
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_IntrusivePtrCtorDtor           18 ns         18 ns   38630411
BM_SharedPtrCtorDtor              22 ns         22 ns   32356114
BM_IntrusivePtrArray/16          402 ns        402 ns    1739637
BM_IntrusivePtrArray/32          805 ns        805 ns     869818
BM_IntrusivePtrArray/64         1610 ns       1609 ns     434881
BM_IntrusivePtrArray/128        3218 ns       3218 ns     217437
BM_IntrusivePtrArray/256        6436 ns       6436 ns     108739
BM_IntrusivePtrArray/512       12882 ns      12882 ns      54356
BM_IntrusivePtrArray/1024      25763 ns      25763 ns      27177
BM_IntrusivePtrArray/2048      51532 ns      51531 ns      13590
BM_IntrusivePtrArray/4096     103091 ns     103091 ns       6778
BM_SharedPtrArray/16             402 ns        402 ns    1740165
BM_SharedPtrArray/32             804 ns        804 ns     869035
BM_SharedPtrArray/64            1610 ns       1610 ns     434975
BM_SharedPtrArray/128           3218 ns       3218 ns     217505
BM_SharedPtrArray/256           6457 ns       6457 ns     108510
BM_SharedPtrArray/512          12909 ns      12909 ns      54249
BM_SharedPtrArray/1024         25810 ns      25810 ns      27127
BM_SharedPtrArray/2048         51763 ns      51763 ns      13531
BM_SharedPtrArray/4096        103506 ns     103505 ns       6759
================================================================================
```

Test Plan:
buck test caffe2/c10/...
buck test mode/opt caffe2/c10/...

Differential Revision: D18998243

fbshipit-source-id: ddf0a118a80efe032b52d403867c1f416c721590
2019-12-18 21:55:58 -08:00
Elias Ellison
fb24f7c4ad catch all exceptions in converting default values to ivalues (#31398)
Summary:
Previously we would only catch `py::cast_error` which led to incomprehensible error messages like: `TypeError: 'NoneType' object is not iterable`. We are running arbitrary pybind code here, and not doing anything with the error message, so we should be less restrictive with the types of errors we catch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31398

Differential Revision: D19166655

Pulled By: eellison

fbshipit-source-id: 84db8b3714c718b475913f2f4bb6f19e62f2d9ec
2019-12-18 20:27:46 -08:00
Jerry Zhang
1bb6c51421 Fix getAttribute (#31011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31011

`getAttribute` is supposed to throw when there the attribute is not
found rather than return a `nullptr`.

Test Plan:
.

Imported from OSS

Differential Revision: D18898417

fbshipit-source-id: 0fe7d824b978ad19bb5ef094d3aa560e9fc57f87
2019-12-18 19:27:39 -08:00
Jeremy Lilley
dff7b945bf Avoid sending large unneeded data over wire in process_group_agent. (#31357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31357

If a user selects a subset of a Tensor and sends it in an RPC, we were sending
the whole original Tensor Storage over the network.

While this sounds reasonable, in practice, we observed view-like Tensors being sent
over rpc, where only 1% of the data in the provided Tensor's Storage was
actually used/needed.

The simple solution here is to just force a clone in the serializer code if we see that
less than (arbitrary) half the bits are used, and the tensor is more than a nominal few KB.
Add related tests to ensure this doesn't break.

An alternate approach would be to modify the Pickler. That said, since Pickler is shared by more
components, the logic might be harder to tailor appropriately at that layer (particularly
given that the Pickler has explicit logic to share a single Storage* among several Tensors
that commonly point to the same Storage*).

It's possible that we might want to further refine the basic thresholds in this change.
In practice, we've seen a mostly bimodal distribution thus far for the percent of Tensor
Storage referred by a Tensor in observed rpcs (i.e. either 90%+ or sub-10% of the Storage
referenced), hence the existing 50% threshold here is probably not an unreasonable
starting point.
ghstack-source-id: 95925474

Test Plan: buck test mode/dev caffe2/test/cpp/rpc/...

Differential Revision: D19137056

fbshipit-source-id: e2b3a4dd0cc6e1de820fd0740aa1d59883dbf8d4
2019-12-18 19:24:24 -08:00
svcscm
1bb800cf5c Updating submodules
Summary:
GitHub commits:

f5d37bdcfd
21ba9e3692
576eeaee27
7ba1f57d53
e520f8f5b3
54f9092b0c
88bb770ce1
d91888de6c
ff06eb0881
fdaeb6ea30
1fd432f00f
60b7cb3408

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: f63bd0a879f4d08e159f530f595067f5a09ffe70
2019-12-18 18:41:23 -08:00
Jerry Zhang
fe707c7849 Use default_observer and default_weight_observer in tests (#31424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31424

att

Test Plan:
test_jit.py

Imported from OSS

Differential Revision: D19162368

fbshipit-source-id: 33b95ba643eeeae942283bbc33f7ceda8d14c431
2019-12-18 18:35:07 -08:00
davidriazati
e1509cb468 Add support for del (#31273)
Summary:
Adds the `del` keyword to the parser and corresponding `aten::Delete` op for lists and dicts

Fixes #20615
](https://our.intern.facebook.com/intern/diff/19054937/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31273

Pulled By: driazati

Differential Revision: D19054937

fbshipit-source-id: c535ea16a9e62d176f8ad45947670fc3535af77c
2019-12-18 18:19:22 -08:00
Michael Suo
e7d25a3e4d add a suggested alternative to _get_trace_graph
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31441

Test Plan: Imported from OSS

Differential Revision: D19165646

Pulled By: suo

fbshipit-source-id: 96a264bc55ceafd798d92b986d319cddbb0d9c69
2019-12-18 17:34:25 -08:00
Kaikai Wang
d2e66b44cc Temporary fix to support building pytorch from fbsource (for xplat dependencies) (#31393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31393

pytorch build was set up with the include paths (-I) relative to fbcode/. This works well for fbcode builds, but doesn't work for the new fbcode_deps args for xplat build targets that work across xplat and fbcode. When these targets are built, the include paths need to be relative to fbsource, so fbcode/ suffix needs to be added to those paths.

Longer term, to properly fix this, we need to use raw_headers with public_include_directories specified for all of these targets.

Test Plan: buck test mode/dev //papaya/integration/service/local/test:mnist_federated_system_test -- 'MnistFederatedSystemTest\.test' --run-disabled

Reviewed By: mzlee

Differential Revision: D19148465

fbshipit-source-id: a610e84bf4cad5838e54e94bae71b957c4b6d4b5
2019-12-18 17:30:57 -08:00
James Reed
a3cdb7eca3 Fix default instantation of dynamic quantized LSTM
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31433

Test Plan: Imported from OSS

Differential Revision: D19164539

Pulled By: jamesr66a

fbshipit-source-id: 7045817ab3dfb530c4480a10523c4c6bcdbfc7eb
2019-12-18 16:59:00 -08:00
Tristan Rice
1e80ff7a67 autograd/profiler: make record_function more threadsafe (#31346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31346

This makes it so that if profiling is enabled/disabled from a different thread while a RecordFunction span is active via an op it doesn't crash the process.

We currently see when using torch.distributed.rpc to enable/disable profiling on other nodes while other things are running.

Test Plan: buck test //caffe2/test:autograd -- test_record_function

Reviewed By: albanD

Differential Revision: D19133258

fbshipit-source-id: 30712b06c6aa051789948de2918dcfb9b78967ba
2019-12-18 16:27:42 -08:00
davidriazati
148bcd3ee5 Add support for builtins as attributes (#31269)
Summary:
Fixes #27495

This adds builtins as another piece of a concrete type. They're separate from normal functions since they represent the `BuiltinFunction` sugared value (which is a direct call to a builtin op). It also moves the builtins related logic from `jit/__init__.py` to `jit/_builtins.py` so it can be used from `jit/_recursive.py` to look up functions in the builtins table.
](https://our.intern.facebook.com/intern/diff/19149779/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31269

Pulled By: driazati

Differential Revision: D19149779

fbshipit-source-id: d4e5e5d7d7d528b75a2f503e6004394251a4e82d
2019-12-18 15:24:45 -08:00
davidriazati
503a4e9019 Cleanup after moving language reference (#31146)
Summary:
Stacked PRs
 * **#31146 - [jit] Cleanup after moving language reference**
 * #31138 - [jit] Move TorchScript language reference to its own page

Preview: https://driazati.github.io/pytorch_doc_previews/jit.html#torchscript-language

Pull Request resolved: https://github.com/pytorch/pytorch/pull/31146

Pulled By: driazati

Differential Revision: D19167390

fbshipit-source-id: f28daed36754a553264fc8ac142ed22c3e26d63e
2019-12-18 15:09:35 -08:00
davidriazati
ae2487bf4d Move TorchScript language reference to its own page (#31138)
Summary:
Stacked PRs
 * #31146 - [jit] Cleanup after moving language reference
 * **#31138 - [jit] Move TorchScript language reference to its own page**

Preview: https://driazati.github.io/pytorch_doc_previews/jit.html#torchscript-language

Pull Request resolved: https://github.com/pytorch/pytorch/pull/31138

Pulled By: driazati

Differential Revision: D19167375

fbshipit-source-id: d37110d85fc8b8d2c741be49846e873de1357c2a
2019-12-18 15:09:31 -08:00
Yanghan Wang
d08250c223 fix zero-batch handling in convtranspose (#24341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24341

ConvTransposeOp doesn't crash for zero-batch, but it doesn't modify the output blob. This leads to buggy behaviour especially when running the same network twice using different input, or backprop during training.

Seems `ConvTransposeUnpoolBase<Context>::GetOutputSize` works for zero-batch, so I remove the check for `input.numel() > 0`, and reshape the output blob before returning.

For CudnnConvTransposeGradientOp, it's a bit verbose to set `dfilter` and `dbias`, it's a  seems the Cudnn can handle it, so simply remove the `X.numel() == 0` branch.

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:conv_transpose_test -- --run-disabled

Reviewed By: BIT-silence

Differential Revision: D16807606

fbshipit-source-id: 0d72c5bd8f2e03c34465e7b530cca548d9bdd5e1
2019-12-18 15:06:36 -08:00
davidriazati
7692494c67 Fix hex literal parsing (#29935)
Summary:
Stacked PRs
 * #29940 - [jit] Fix parsing of big float literals
 * **#29935 - [jit] Fix hex literal parsing**
 * #29931 - [jit] Throw a better error for int too big for int64_t

Previously these were all parsed as `0`
](https://our.intern.facebook.com/intern/diff/19124944/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29935

Pulled By: driazati

Differential Revision: D19124944

fbshipit-source-id: 1ee0c1dee589933363a5efba069a2cfaf94373c5
2019-12-18 14:00:22 -08:00
davidriazati
1f50cfc24d Throw a better error for int too big for int64_t
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29931

Pulled By: driazati

Differential Revision: D19124934

fbshipit-source-id: 91841d7ba4f2f6142c51fba07b7faa14bb817e3a
2019-12-18 14:00:16 -08:00
Elias Ellison
fb30a48b4e add unsupported section (#31329)
Summary:
Add a section for unsupported ops, and modules. Automatically generate the properties and attributes that aren't bound, and for ops that have semantic mismatches set up tests so the docs stay up to date.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31329

Differential Revision: D19164472

Pulled By: eellison

fbshipit-source-id: 46290bb8a64d9de928cfb1eda5ff4558c3799c88
2019-12-18 13:56:02 -08:00
Andreas Koepf
5e8bac24b4 Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) (#28135)
Summary:
Fix: https://github.com/pytorch/pytorch/issues/24631, https://github.com/pytorch/pytorch/issues/24632, https://github.com/pytorch/pytorch/issues/24764, https://github.com/pytorch/pytorch/issues/24765

Port of TH SoftMarginCriterion to ATen using un-fused tensor operators but with custom backward code. This is a follow-up/fixc of reverted PR https://github.com/pytorch/pytorch/issues/27673.

Benchmark results:

CPU became faster, GPU slower. To reach previous TH perf probably manual fusion is necessary.

### WITH patch
```
CPU warmup 1000 took 7.997200009413064e-05
CPU warmup 10000 took 0.0008116499957395718
CPU warmup 100000 took 0.0012691459996858612
CPU warmup TOTAL time 0.0021982479956932366
CPU forward 1000 took 7.320100849028677e-05
CPU forward 10000 took 0.00015837099635973573
CPU forward 100000 took 0.0010471990099176764
CPU forward 1000000 took 0.01238470000680536
CPU forward 10000000 took 0.12747182900784537
CPU forward 100000000 took 1.2076255190040683
CPU forward TOTAL time 1.3488940890092636
CPU for- & backward 1000 took 0.00032587299938313663
CPU for- & backward 10000 took 0.0006926299975020811
CPU for- & backward 100000 took 0.002146183993318118
CPU for- & backward 1000000 took 0.019158899012836628
CPU for- & backward 10000000 took 0.2957490350090666
CPU for- & backward 100000000 took 1.7630806300003314
CPU for- & backward TOTAL time 2.081367089995183

GPU warmup 1000 took 0.0004558280052151531
GPU warmup 10000 took 0.0002567449992056936
GPU warmup 100000 took 0.0001593509950907901
GPU warmup TOTAL time 0.0009442300070077181
GPU forward 1000 took 0.00015061900194268674
GPU forward 10000 took 0.00015258099301718175
GPU forward 100000 took 0.00015409699699375778
GPU forward 1000000 took 0.0008183339959941804
GPU forward 10000000 took 0.004424853003001772
GPU forward 100000000 took 0.04356115800328553
GPU forward TOTAL time 0.04938192600093316
GPU for- & backward 1000 took 0.0008062430133577436
GPU for- & backward 10000 took 0.0006074949924368411
GPU for- & backward 100000 took 0.0007091690058587119
GPU for- & backward 1000000 took 0.001022183001623489
GPU for- & backward 10000000 took 0.009945805999450386
GPU for- & backward 100000000 took 0.0944173600000795
GPU for- & backward TOTAL time 0.28060428200114984
```

### WITHOUT patch
```
CPU warmup 1000 took 6.394000956788659e-05
CPU warmup 10000 took 0.00038220599526539445
CPU warmup 100000 took 0.0034939230099553242
CPU warmup TOTAL time 0.003981974994530901
CPU forward 1000 took 4.7855006414465606e-05
CPU forward 10000 took 0.000347569992300123
CPU forward 100000 took 0.003367935001733713
CPU forward 1000000 took 0.03605044000141788
CPU forward 10000000 took 0.35935167300340254
CPU forward 100000000 took 3.630371332008508
CPU forward TOTAL time 4.029640004009707
CPU for- & backward 1000 took 0.00028494100843090564
CPU for- & backward 10000 took 0.0006738200027029961
CPU for- & backward 100000 took 0.0051178760040784255
CPU for- & backward 1000000 took 0.04925115800870117
CPU for- & backward 10000000 took 0.7172313440096332
CPU for- & backward 100000000 took 5.441953932997421
CPU for- & backward TOTAL time 6.21466830400459

GPU warmup 1000 took 0.001803738996386528
GPU warmup 10000 took 0.00041877900366671383
GPU warmup 100000 took 0.0003870719956466928
GPU warmup TOTAL time 0.0026561370032140985
GPU forward 1000 took 0.00037833399255760014
GPU forward 10000 took 0.00038825398951303214
GPU forward 100000 took 0.0003841099969577044
GPU forward 1000000 took 0.0007090550061548129
GPU forward 10000000 took 0.0016171559982467443
GPU forward 100000000 took 0.013463679002597928
GPU forward TOTAL time 0.017010531009873375
GPU for- & backward 1000 took 0.0007374050037469715
GPU for- & backward 10000 took 0.0006343529967125505
GPU for- & backward 100000 took 0.0006375070079229772
GPU for- & backward 1000000 took 0.0007550300069851801
GPU for- & backward 10000000 took 0.002672752001672052
GPU for- & backward 100000000 took 0.023170708998804912
GPU for- & backward TOTAL time 0.20251446698966902
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28135

Differential Revision: D18001447

Pulled By: VitalyFedyunin

fbshipit-source-id: ad90dc1cca42dcaf3ea9e17e4f8fd79cee0a293e
2019-12-18 13:33:59 -08:00
xiaobing.zhang
7cf8b9bada Move leaky_relu to Aten(CPU, CUDA) (#29899)
Summary:
VitalyFedyunin, This PR is about port LeakyReLU activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.LeakyReLU()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).

CPU:
OMP_NUM_THREADS=56
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 4.21 (ms); backwad avg time is 8.02 (ms).
OMP_NUM_THREADS=1
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.07 (ms).
input size(128, 10000) forward time is 1.98 (ms); backwad avg time is 6.21 (ms)
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).

CPU:
OMP_NUM_THREADS=56
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.03 (ms); backwad avg time is 0.09 (ms).
OMP_NUM_THREADS=1
input size(128, 100) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10000) forward time is 0.47 (ms); backwad avg time is 1.02 (ms).
```
How to set the numbers of thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run .**/run.sh num_threads test.py**.

Fixes https://github.com/pytorch/pytorch/issues/24583 #24584 https://github.com/pytorch/pytorch/issues/24720 #24721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29899

Differential Revision: D18816231

Pulled By: VitalyFedyunin

fbshipit-source-id: afb1e43a99317d17f50cff1b593cd8f7a0a83da2
2019-12-18 13:14:11 -08:00
Tristan Rice
b0bd35ff13 caffe2/event: allow multiple errors such as when cancelled (#31335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31335

When an error occurs in a net we end up cancelling all the async ops. If one error occurs it's highly likely other errors will occur as well.

Typically we see:
1. SendOp failed due to a network error
2. async scheduling cancels all other ops via `SetFinished("Cancelled");`
3. Another SendOp fails due to a network error and crashes the process when the exception is thrown.

This changes caffe2 ops to allow failing twice.

Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu

Reviewed By: andrewwdye

Differential Revision: D19106548

fbshipit-source-id: 4b7882258a240894cc16d061a563c83a3214d3d9
2019-12-18 13:10:57 -08:00
Mingbo Wan
4d22c3ba01 fix docker login, add docker image tag list after purge as html (#31328)
Summary:
example of the generated html: http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31328

Differential Revision: D19147113

Pulled By: mingbowan

fbshipit-source-id: 5104e92d4490f047a6474e2b12aed3293b52a9df
2019-12-18 12:08:51 -08:00
Pavel Belevich
47766e648f C++ API parity: MultiheadAttention
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27309

Test Plan: Imported from OSS

Differential Revision: D17766736

Pulled By: pbelevich

fbshipit-source-id: 7a5f2399f081945d31d4c13d7a8d248c387fc1a6
2019-12-18 10:13:29 -08:00