Commit Graph

12304 Commits

Author SHA1 Message Date
Mikhail Zolotukhin
29da553dd9 [TensorExpr] Loopnest: unify intermediate_tensors_ and temp_bufs_. (#45947)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45947

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24155999

Pulled By: ZolotukhinM

fbshipit-source-id: d82acf6aba570f6a675eea683c306088e2a41f91
2020-10-08 00:58:08 -07:00
Mikhail Zolotukhin
598caddd93 [TensorExpr] Add shorthand versions for splitWith{Mask,Tail} functions. (#45946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45946

Also, make these functions static - they are not using anything from
`LoopNest` and can be applied to any `Stmt`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156002

Pulled By: ZolotukhinM

fbshipit-source-id: 1c7d205f85a2a1684e07eb836af662f10d0a50fc
2020-10-08 00:58:06 -07:00
Mikhail Zolotukhin
b65ffa365c [TensorExpr] Nuke Function class and directly use Tensor instead. (#45936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45936

`Tensor` has been a view into a `Function` that was supposed to be used
for a more general case when we have multiple computations over the same
domain (aka multiple output functions). We have never got to a point
where we need this and now have other ideas in mind on how to support
this case if need be. For now, let's just nuke `Function` to reduce the
overall system complexity.

The change should not affect any existing behavior.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24153214

Pulled By: ZolotukhinM

fbshipit-source-id: 26d5f11db5d661ff5e1135f4a49eff1c6d4c1bd5
2020-10-08 00:55:31 -07:00
Nikita Shulga
c19b9cd18d Add torch::cuda::ncll::all2all (#45900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes https://github.com/pytorch/pytorch/issues/42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
2020-10-07 23:56:31 -07:00
Kurt Mohler
ef4817fe5a Add tensor_split function, based on numpy.array_split (#45168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168

Reviewed By: ngimel

Differential Revision: D24166164

Pulled By: mruberry

fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6
2020-10-07 23:14:48 -07:00
James Reed
00b8ebe60c [FX] Preserve type annotations on generated code in Graph (#45880)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45880

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24127303

Pulled By: jamesr66a

fbshipit-source-id: 3a042bcfb0bf9f58ac318cc814dfc3cca683c7f8
2020-10-07 21:34:47 -07:00
Nick Gibson
19da1d22fe [NNC] Registerizer V2, supporting partial and conditional replacement (#45574)
Summary:
This is a rewrite of the Registerizer, supporting scalar replacement in *vastly* more situations. As a refresher, the registerizer does this:

Before:
``` A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}
```
After:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;
```

Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups).

I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet.

It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses.

There are a few cases that make this more challenging:

 - For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements.

E.g. Before:
```
A[0] = 2;
for (int x1 = 0; x1 < 10; x1++) {
  A[0] = (A[0]) + x1;
}
for (int x2 = 1; x2 < 10; x2++) {
  A[x2] = A[x2 - 1];
}
for (int x3 = 0; x3 < 10; x3++) {
  A[0] = (A[0]) + x3;
}
```
After:
```
int A_1 = 2;
for (int x1 = 0; x1 < 10; x1++) {
  A_1 = A_1 + x1;
}
A[0] = A_1;
for (int x2 = 1; x2 < 10; x2++) {
  A[x2] = A[x2 - 1];
}
int A_2 = A[0];
for (int x3 = 0; x3 < 10; x3++) {
  A_2 = A_2 + x3;
}
A[0] = A_2;
```
- Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it.

E.g. this example:
```
if (x<5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
A[x] = (A[x]) + 1;
if (x>5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
```
The A[x] access can be registerized due to the unconditional access between the two conditions:
```
int A_1 = A[x];
if (x<5 ? 1 : 0) {
  A_1 = A_1 + 1;
}
A_1 = A_1 + 1;
if (x>5 ? 1 : 0) {
  A_1 = A_1 + 1;
}
A[x] = A_1;
```
But this example has no accesses that can be registerized:
```
if (x<5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
if (x>5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
```

- IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within.

E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it.
```
A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]);
```

But the equivalent kernel using Cond can be registerized:
```
if (x<3 ? 1 : 0) {
  float B_1 = B[x];
  A[x] = B_1 + B_1;
} else {
  A[x] = B[x];
}
```
- Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var.

E.g. no accesses in this example can be registerized:
```
for (int x = 0; x < 10; x++) {
  int y = 30;
  A[y] = x + (A[y]);
}
```

But they can in this example:
```
int y = 30;
for (int x = 0; x < 10; x++) {
  A[y] = x + (A[y]);
}
```

**Testing**

The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests.

**Performance**

Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount.

Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574

Reviewed By: albanD

Differential Revision: D24151517

Pulled By: nickgg

fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae
2020-10-07 18:17:27 -07:00
Elias Ellison
c86655a815 [JIT] Fix Dict bug in constant hashing (#45929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929

We were checking `and` when we should have been checking `or`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24148804

Pulled By: eellison

fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d
2020-10-07 17:40:17 -07:00
Elias Ellison
72e4f51bc0 [JIT] fix dict update (#45857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45857

Fix for https://github.com/pytorch/pytorch/issues/45627

Op was calling `insert` instead of `insert_or_assign`, so it wouldn't overwrite an existing key.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24148805

Pulled By: eellison

fbshipit-source-id: bf39c71d5d928890b82cff1a9a0985dc47c1ffac
2020-10-07 17:36:02 -07:00
Natalia Gimelshein
de0d0bd5ee Revert D24093032: Improve logging in ProcessGroupNCCL for debugging purposes.
Test Plan: revert-hammer

Differential Revision:
D24093032 (c8d76ff7dc)

Original commit changeset: 240b03562f8c

fbshipit-source-id: dab7d54a5ba517bb308a1825b0d63ed146e5269d
2020-10-07 16:41:35 -07:00
Wanchao Liang
505be08c75 [dist_optim] serialize compilation when creating dist_optim (#45871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45871

Attempt to fix https://github.com/pytorch/pytorch/issues/45845

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D24125209

Pulled By: wanchaol

fbshipit-source-id: e3697dd6ef107d8153d2a82d78a17c66d109b4fa
2020-10-07 15:10:41 -07:00
Andy Zhang
ce82b522c8 Define objects using classes instead of namedtuples in torch.utils.data._utils.worker (#45870)
Summary:
This PR fixes a bug when torch is used with pyspark, by converting namedtuples in `torch.utils.data._utils.worker` into classes.

Before this PR, creating an IterableDataset and then running `list(torch.utils.data.DataLoader(MyIterableDataset(...), num_workers=2)))` will not terminate, if pyspark is also being used. This is because pyspark hijacks namedtuples to make them pickleable ([see here](https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L370)). So `_IterableDatasetStopIteration` would be modified, and then the check at [this line in dataloader.py](5472426b9f/torch/utils/data/dataloader.py (L1072)) is never true.
Converting the namedtuples to classes avoids this hijack and allows the iteration to correctly stop when signaled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45870

Reviewed By: ngimel

Differential Revision: D24162748

Pulled By: albanD

fbshipit-source-id: 52f009784500fa594b2bbd15a8b2e486e00c37fb
2020-10-07 15:03:38 -07:00
Pritam Damania
c8d76ff7dc Improve logging in ProcessGroupNCCL for debugging purposes. (#45780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45780

When training jobs running with NCCL fail sometimes it is hard to
debug the reason of the failure and our logging doesn't provide enough
information at times to narrow down the issue.

To improve the debugging experience, I've enhanced our logging to add a lot
more information about what the ProcessGroup is doing under the hood.

#Closes: https://github.com/pytorch/pytorch/issues/45310

Sample output:
```
> I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started!
> I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options:
> NCCL_ASYNC_ERROR_HANDLING: 0
> NCCL_BLOCKING_WAIT: 1
> TIMEOUT(ms): 1000
> USE_HIGH_PRIORITY_STREAM: 0
> I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally
> I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000)
> I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally
> I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators
> I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally
```
ghstack-source-id: 113731163

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24093032

fbshipit-source-id: 240b03562f8ccccc3d872538f5e331df598ceca7
2020-10-07 12:18:41 -07:00
Guilherme Leobas
9679e1affc annotate torch.autograd.* modules (#45004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45004

Reviewed By: VitalyFedyunin

Differential Revision: D24113562

Pulled By: ezyang

fbshipit-source-id: a85018b7e08b2fe6cf2bc14a217eb418cb2b9de4
2020-10-07 10:53:41 -07:00
Jerry Zhang
83d2c9a232 [quant] Add quantized Sigmoid module (#45883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45883

Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_sigmoid

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24129116

fbshipit-source-id: aa960549509c60374012f35b1f5be39e90418099
2020-10-07 10:33:18 -07:00
Nikita Vedeneev
30bf799f9c torch.matrix_exp doc fix (#45909)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45909

Reviewed By: dzhulgakov

Differential Revision: D24147314

Pulled By: albanD

fbshipit-source-id: fc21094f4dbdd04cc2063a9639b9d1f5728cb53f
2020-10-07 10:23:37 -07:00
neginraoof
5ce31b6f3f [ONNX] Improve error handling for adaptive_pool (#45874)
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/43032
This update would also improve error handling for interpolate with 'area' mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45874

Reviewed By: albanD

Differential Revision: D24141266

Pulled By: bzinodev

fbshipit-source-id: 7559f1d6af4f1ef3507c15a1aee76fe01fa433cd
2020-10-07 09:20:35 -07:00
Michael Carilli
5640b79bf8 Allow consumer ops to sync on GraphRoot's gradient (#45787)
Summary:
Currently, a GraphRoot instance doesn't have an associated stream.  Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream.  If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.

The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
    # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
    # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
    loss.backward()

    # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
    # and the side_stream context is irrelevant.  GraphRoot's interaction with its first consumer(s) is the spot where
    # the side_stream context causes a problem.
```

This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)

The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.

With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
    loss.backward()

# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
    kickoff_grad = torch.ones_like(loss)
    loss.backward(gradient=kickoff_grad)

# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)

# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787

Reviewed By: nairbv

Differential Revision: D24138376

Pulled By: albanD

fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
2020-10-07 08:53:53 -07:00
James Reed
be45c3401a [JIT] Make objects throw Python AttributeError on nonexistant attr access (#45911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45911

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D24140971

Pulled By: jamesr66a

fbshipit-source-id: 046a2cffff898efad5bcc36a41bf992f36f555f9
2020-10-07 01:57:29 -07:00
James Reed
8cdb638c62 [FX] Track use nodes in Node (#45775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45775

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24091082

Pulled By: jamesr66a

fbshipit-source-id: b09bb6ae78436a7722fb135b8ec71464ef9587cd
2020-10-07 00:15:04 -07:00
Zachary DeVito
205ab49612 [packaging] simpler dependency plotting (#45686)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45686

This uses an online graphviz viewer. The code is simpler, and
since it embeds all the data in the url you can just click the url
from your terminal.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D24059157

Pulled By: zdevito

fbshipit-source-id: 94d755cc2986c4226180b09ba36f8d040dda47cc
2020-10-06 23:40:00 -07:00
Kurt Mohler
ed1552a48f Add note about in-place weight modification for nn.Embedding (#45595)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26596

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45595

Reviewed By: albanD

Differential Revision: D24143456

Pulled By: mruberry

fbshipit-source-id: a884a32809105ce16959b40ec745ec873b3c8375
2020-10-06 23:11:39 -07:00
Peter Bell
8b39498a23 codegen: Allow string arguments to have defaults (#45665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45665

Fixes #43944

Note that the codegen doesn't use a proper parser so, in the same way as with lists, the string `, ` cannot appear in defaults or it will be interpreted as a splitting point between arguments.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24141835

Pulled By: ezyang

fbshipit-source-id: 578127861fd2504917f4486c44100491a2c40343
2020-10-06 21:53:56 -07:00
Supriya Rao
43dc7ef933 [quant] Support for 4-bit quantized EmbeddingBag module (#45865)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45865

Test Plan:
python test/test_quantization.py TestPostTrainingStatic.test_quantized_embedding_bag
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_bag_api

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24120995

fbshipit-source-id: c55fc6b2cfd683d14d2a05be7c04f787fdf8cc79
2020-10-06 21:11:52 -07:00
Supriya Rao
11c32611d7 [quant] Support 4-bit embedding_bag operators using the dtype quint4x2 (#45752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45752

Use the torch.quint4x2 dtype to create 4-bit packed tensors in the previous PR.
These packed tensors can be directly consumed by the operator.
Serialization of the packed tensors is supported using torchbind custom class.
Module support will follow in a later PR.

Test Plan:
python test/test_quantization.py TestEmbeddingBagOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24120996

fbshipit-source-id: 2639353b3343ebc69e058b5ba237d3fc56728e1c
2020-10-06 21:11:49 -07:00
Hao Lu
e8d8de32b4 [StaticRuntime] Implement StaticRuntime::benchmark (#45639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639

`StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan.

Test Plan: Test results are fb internal only.

Reviewed By: yinghai, dzhulgakov

Differential Revision: D24012088

fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0
2020-10-06 20:54:43 -07:00
Meghan Lele
4fdba30500 [JIT] Add API for ignoring arbitrary module attributes (#45262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45262

**Summary**
This commit adds an API for ignoring arbitrary module attributes during
scripting. A class attribute named `ignored_attributes` containing names
of attributes to ignore can be added to the class of the instance being
scripted. Attributes ignored in this fashion cannot be used in
`forward`, methods used by `forward` or by `exported` methods. They
are, however, copied to the `RecursiveScriptModule` wrapper and can be
used by `ignored` methods and regular Python code.

**Test Plan**
This commit adds unit tests to `TestScriptPy3` to test this new API.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23971882

Pulled By: SplitInfinity

fbshipit-source-id: 8c81fb415fde7b78aa2f87e5d83a477e876a7cc3
2020-10-06 18:02:06 -07:00
Nikita Shulga
49af421143 Embed callgrind headers (#45914)
Summary:
Because access to https://sourceware.org/git/valgrind.git can be really slow especially in some regions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45914

Reviewed By: seemethere

Differential Revision: D24144420

Pulled By: malfet

fbshipit-source-id: a454c8c3182c570ec344bf6468bb5e55d8b8da79
2020-10-06 17:51:10 -07:00
Bert Maher
624084e6d6 [te][llvm] Enable fused multiply-add (fma) in code generation (#45906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45906

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142404

Pulled By: bertmaher

fbshipit-source-id: a8db2e66c1e65bbb255886e165a1773723cbcd20
2020-10-06 16:57:34 -07:00
Jerry Zhang
14997f2125 [quant][graphmode][fx] Add warning for unsupported case (#45714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45714

Hit the problem when writing a test like following:
```
class M(...):
      def forward(self, x):
          x = x.some_op()
          return x
```
we need to know the scope of `x` to figure out the qconfig for `x`

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24069959

fbshipit-source-id: 95ac8963c802ebce5d0e54d55f5ebb42085ca8a6
2020-10-06 15:33:34 -07:00
Ansley Ussery
5072728d88 Fix stride printing/parsing formatting (#45156)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45156

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078695

Pulled By: ansley

fbshipit-source-id: dab993277d43b31105c38d12098c37653747b42a
2020-10-06 15:06:46 -07:00
anjali411
a3662fa78c Minor gradcheck update to reduce computations (#45757)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45757

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24137143

Pulled By: anjali411

fbshipit-source-id: e0174ec03d93b1fedf27baa72c3542dac0b70058
2020-10-06 13:59:01 -07:00
Vaidotas Simkus
e154b36685 Standardized clamp kernels to Numpy-like implementation (#43288)
Summary:
**BC-breaking note**

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)

but in other places it clamps differently:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)

78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

```
t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)
```

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

**PR Summary**

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288

Reviewed By: colesbury

Differential Revision: D24079453

Pulled By: mruberry

fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7
2020-10-06 13:42:08 -07:00
Yanan Cao
64681d6bec Add all remaining method declarations from torch.distributed Python API to C++ (#45768)
Summary:
Also ran formatter on previous sections

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45768

Reviewed By: wanchaol

Differential Revision: D24129467

Pulled By: gmagogsfm

fbshipit-source-id: aa8a5c45c3609d5b96e5f585b699d9e3e71394c8
2020-10-06 12:36:36 -07:00
Jerry Zhang
0da6730f02 [quant][graphmode][fx][eagermode] Add leaky relu support in quantization workflows (#45712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45712

Eager mode will still be able to use functional leaky relu, but it will be less accurate than
LeakyReLU module.
FX graph mode will support both leaky relu functional and module

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24069961

fbshipit-source-id: 8d91c3c50c0bcd068ba3072378ebb4da9549be3b
2020-10-06 12:16:04 -07:00
Jerry Zhang
8b7ee33ee6 [quant] Add quantized LeakyReLU module (#45711)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45711

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24069960

fbshipit-source-id: ccdd294308e07fd215556a63fa47191c09a1519f
2020-10-06 11:34:48 -07:00
Nikita Shulga
930bddd403 Cleanup nccl.cpp (#45899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45899

Use function polymorphism to avoid repeated casts
I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument
Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24138012

Pulled By: malfet

fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1
2020-10-06 11:26:14 -07:00
Peter Bell
d44eaf63d1 torch.fft helper functions (#44877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44877

Part of gh-42175. This implements the `torch.fft` helper functions: `fftfreq`, `rfftfreq`, `fftshift` and `ifftshift`.

* #43009 Cleanup tracer handling of optional arguments

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24043473

Pulled By: mruberry

fbshipit-source-id: 35de7b70b27658a426773f62d23722045ea53268
2020-10-05 22:04:52 -07:00
Eric Cotner
e4efc420ae Correct Categorical docstring (#45804)
Summary:
Clarified that the `Categorical` distribution will actually accept input of any arbitrary tensor shape, not just 1D and 2D tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45804

Reviewed By: dzhulgakov

Differential Revision: D24125415

Pulled By: VitalyFedyunin

fbshipit-source-id: 5fa1f07911bd85e172199b28d79763428db3a0f4
2020-10-05 21:49:10 -07:00
Pritam Damania
bf85642c4c Remove lock from GraphTask::set_exception_without_signal. (#45867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867

In most cases the lock ordering was hold a lock in local autograd and
then hold a lock in DistAutogradContext.

In case of `set_exception_without_signal` the lock order was in reverse and as
a result we saw potential deadlock issues in our TSAN tests. To fix this, I
removed the lock and instead just used std::atomic exchange.

In addition to this, I fixed TestE2E to ensure that we use the appropriate
timeout.

TestE2EProcessGroup was flaky for these two reasons and now is fixed.
ghstack-source-id: 113592709

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D24120962

fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66
2020-10-05 20:02:29 -07:00
Mingzhe Li
10d86d1196 [NCCL] create NCCL communicator for send/recv on demand (#44922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44922

For NCCL send/recv operations, we will create NCCL communicator on demand following the same design as how it's currently done for collective operations.
ghstack-source-id: 113592757

Test Plan: to add

Reviewed By: pritamdamania87

Differential Revision: D23773726

fbshipit-source-id: 0d47c29d670ddc07f7181e8485af0e02e2c9cfaf
2020-10-05 18:33:03 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
James Reed
b04ae953b4 [FX][WIP] Mutable Graph APIs (#45227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45227

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23880730

Pulled By: jamesr66a

fbshipit-source-id: eb4e8c14d7f6b1deb1ddd6cf38a360413a1705ed
2020-10-05 17:07:08 -07:00
Nikita Shulga
1558a3657b Add LazyNVRTC (#45674)
Summary:
Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables:
On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template:
```
  auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME)));
  if (!fn)
    throw std::runtime_error("Can't get" ## NAME);
  lazyNVRTC.NAME = fn;
  return fn(...)
```
Fixes https://github.com/pytorch/pytorch/issues/31985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674

Reviewed By: ezyang

Differential Revision: D24073946

Pulled By: malfet

fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874
2020-10-05 16:27:40 -07:00
Meghan Lele
4ab73c1f74 [docs] Fix EmbeddingBag docs (#45763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45763

**Summary**
This commit updates the documentation for `EmbeddingBag` to say that for
bags of constant length with no per-sample weights, the class is
equivalent to `Embedding` followed by `torch.sum(dim=1)`. The current
docs say `dim=0` and this is readily falsifiable.

**Test Plan**
1) Tried `Embedding` + `sum` with `dim`=0,1 in interpreter and compared
to `EmbeddingBag`
```
>>> import torch
>>> weights = torch.nn.Parameter(torch.randn(10, 3))
>>> e = torch.nn.Embedding(10, 3)
>>> eb = torch.nn.EmbeddingBag(10, 3, mode="sum")
>>> e.weight = weights
>>> eb.weight = weights
# Use 2D inputs because we are trying to test the case in which bags have constant length
>>> inputs = torch.LongTensor([[4,1,2,7],[5,6,0,3]])
>>> eb(inputs)
tensor([[-2.5497, -0.1556, -0.5166],
        [ 2.2528, -0.3627,  2.5822]], grad_fn=<EmbeddingBagBackward>)
>>> torch.sum(e(inputs), dim=0)
tensor([[ 1.6181, -0.8739,  0.8168],
        [ 0.0295,  2.3274,  1.2558],
        [-0.7958, -0.4228,  0.5961],
        [-1.1487, -1.5490, -0.6031]], grad_fn=<SumBackward1>)
>>> torch.sum(e(inputs), dim=1)
tensor([[-2.5497, -0.1556, -0.5166],
        [ 2.2528, -0.3627,  2.5822]], grad_fn=<SumBackward1>)
```
So clearly `torch.sum` with `dim=0` is not correct here.

2) Built docs and viewed in browser.

*Before*
<img width="882" alt="Captura de Pantalla 2020-10-02 a la(s) 12 26 20 p  m" src="https://user-images.githubusercontent.com/4392003/94963035-557be100-04ac-11eb-986c-088965ac3050.png">

*After*
<img width="901" alt="Captura de Pantalla 2020-10-05 a la(s) 11 26 51 a  m" src="https://user-images.githubusercontent.com/4392003/95117732-ea294d80-06fd-11eb-9d6b-9b4e6c805cd0.png">

**Fixes**
This commit closes #43197.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24118206

Pulled By: SplitInfinity

fbshipit-source-id: cd0d6b5db33e415d8e04ba04f2c7074dcecf3eee
2020-10-05 15:56:35 -07:00
Meghan Lele
78f055272c [docs] Add 3D reduction example to tensordot docs (#45697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45697

**Summary**
This commit adds an example of a reduction over three dimensions with
`torch.tensordot`. It is unclear from existing docs whether `dims`
should be a list of pairs or a pair of lists.

**Test Plan**
Built the docs locally.

*Before*
<img width="864" alt="Captura de Pantalla 2020-10-01 a la(s) 1 35 46 p  m" src="https://user-images.githubusercontent.com/4392003/94866838-f0b17f80-03f4-11eb-8692-8f50fe3b9863.png">

*After*
<img width="831" alt="Captura de Pantalla 2020-10-05 a la(s) 12 06 28 p  m" src="https://user-images.githubusercontent.com/4392003/95121092-670af600-0703-11eb-959f-73c7797a76ee.png">

**Fixes**
This commit closes #22748.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24118186

Pulled By: SplitInfinity

fbshipit-source-id: c19b0b7e001f8cd099dc4c2e0e8ec39310510b46
2020-10-05 15:36:59 -07:00
Zachary DeVito
26a9012f84 [fx] import used modules for code gen (#45471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45471

Intead of assuming that 'torch' is the only module used by generated code,
use the qualified names of builtin functions to generate import statements
for all builtins. This allows user-captured functions to also get code generated correctly.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23978696

Pulled By: zdevito

fbshipit-source-id: ecbff150e3de38532531cdadbfe4965468f29a38
2020-10-05 15:21:44 -07:00
Dmytro Dzhulgakov
5177f8de2b Revert D23398534: [pytorch][PR] [ONNX] Improve error handling for adaptive_pool
Test Plan: revert-hammer

Differential Revision:
D23398534 (45ddeb5ce6)

Original commit changeset: f2d60d40340f

fbshipit-source-id: acc9d6c3d031662c37447fcee027b0c97b8492a7
2020-10-05 15:16:59 -07:00
Ansley Ussery
f18cc9c57d Change type inferred from empty annotation (#45360)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45360

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078645

Pulled By: ansley

fbshipit-source-id: 5d37d07df75bd7a2111d44638befe53c1021ee82
2020-10-05 15:16:56 -07:00
KyleCZH
a9a9d0b181 Rocm skip test cases (#45782)
Summary:
Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA)
- test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)
- test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA)
- test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest)
jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782

Reviewed By: VitalyFedyunin

Differential Revision: D24115581

Pulled By: xw285cornell

fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5
2020-10-05 15:12:25 -07:00
Lillian Johnson
9a668f94bb [jit] allow slicing multiple dimensions with indicies (#45239)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45239

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23886919

Pulled By: Lilyjjo

fbshipit-source-id: d45c2a550fa8df9960cf2ab5da9d1ae0058a967a
2020-10-05 15:03:54 -07:00
Taras Galkovskyi
f11f9a8c1f [pytorch][improvement] Improve torch logging to identify problematic key (#45766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45766

As per subj, making KeyError message more verbose.

Test Plan:
Verified that breakage can be successfully investigated with verbose error message
unit tests

Reviewed By: esqu1

Differential Revision: D24080362

fbshipit-source-id: f4e22a78809e5cff65a69780d5cbbc1e8b11b2e5
2020-10-05 14:54:52 -07:00
Jerry Zhang
21fa877026 [quant][test] Remove numeric equivalence test for debug and non-debug option (#45852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45852

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24115329

fbshipit-source-id: ad32e68cbd54431fd440c8437a4361905a5dbdad
2020-10-05 14:11:07 -07:00
Jane Xu
ffbffc0436 fixed formatting in function rstrings in torch.autograd.functional (#45849)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44426

The changes look like:
![Screen Shot 2020-10-05 at 12 34 32 PM](https://user-images.githubusercontent.com/31798555/95107954-9839f500-0708-11eb-88b0-444486f53061.png)
(compare with https://pytorch.org/docs/stable/autograd.html#torch.autograd.functional.jacobian)

and also
![Screen Shot 2020-10-05 at 12 35 15 PM](https://user-images.githubusercontent.com/31798555/95107966-9bcd7c00-0708-11eb-979a-b3578b8203da.png)
(compare with https://pytorch.org/docs/stable/autograd.html#torch.autograd.functional.hessian)

and lastly
![Screen Shot 2020-10-05 at 12 38 19 PM](https://user-images.githubusercontent.com/31798555/95107971-9e2fd600-0708-11eb-9919-5b809f5f0f20.png)
(compare with https://pytorch.org/docs/stable/autograd.html#torch.autograd.functional.hvp)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45849

Reviewed By: albanD

Differential Revision: D24114223

Pulled By: janeyx99

fbshipit-source-id: bfea5f0d594933db4b2c400291d330f747f518e8
2020-10-05 13:39:01 -07:00
Pritam Damania
b5a2f04089 Disallow creation of ProcessGroupNCCL without GPUs. (#45642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642

Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
2020-10-05 12:05:48 -07:00
Negin Raoof
45ddeb5ce6 [ONNX] Improve error handling for adaptive_pool (#43032)
Summary:
This would also improve error handling for interpolate with 'area' mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43032

Reviewed By: malfet

Differential Revision: D23398534

Pulled By: bzinodev

fbshipit-source-id: f2d60d40340f46e7c0499ea73c1e39945713418d
2020-10-05 11:53:14 -07:00
Nikolay Korovaiko
adc21c6db2 Rename jobs and cli switches for testing GraphExecutor configurations to something a little bit more sensical. (#45715)
Summary:
Rename jobs for testing GraphExecutor configurations to something a little bit more sensical.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45715

Reviewed By: ezyang, anjali411

Differential Revision: D24114344

Pulled By: Krovatkin

fbshipit-source-id: 89e5f54aaebd88f8c5878e060e983c6f1f41b9bb
2020-10-05 11:43:28 -07:00
Thomas Viehmann
3ab88c3903 Enable TorchBind tests on ROCm (#45426)
Summary:
The torchbind tests didn't work be cause somehow we missed the rename of caffe2_gpu to torch_... (hip for us) in https://github.com/pytorch/pytorch/issues/20774 (merged 2019-06-13, oops) and still tried to link against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45426

Reviewed By: VitalyFedyunin

Differential Revision: D24112439

Pulled By: walterddr

fbshipit-source-id: a66a574e63714728183399c543d2dafbd6c028f7
2020-10-05 09:38:12 -07:00
kshitij12345
f65ab89edd [numpy] Add torch.nan_to_num (#44592)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO:
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44592

Reviewed By: colesbury

Differential Revision: D24079472

Pulled By: mruberry

fbshipit-source-id: 2b67d36cba46eaa7ca16cd72671b57750bd568bc
2020-10-05 01:38:56 -07:00
James Reed
2ab74a4839 [FX] Make Tracer.trace() just return a Graph (#45704)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45704

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24067982

Pulled By: jamesr66a

fbshipit-source-id: c82aa6be504d45e110055a3c4db129d0b9ac3ef5
2020-10-03 21:13:48 -07:00
Hao Lu
8a6b919163 [StaticRuntime] Fix broken tests (#45813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45813

Fix tests broken by D23996656 (2b48dd168d).

Test Plan:
```
buck test mode/opt //pytorch/tensorboardX:test_pytorchtb -- 'test_pytorch_graph \(pytorch\.tensorboardX\.tests\.test_pytorch_graph\.PytorchGraphTest\)'
buck test mode/opt //pytext/tests:
buck test mode/dev-nosan //mobile-vision/projects/detectron2go/tests:test_caffe2_compatibles
```

Reviewed By: yinghai

Differential Revision: D24100807

fbshipit-source-id: e2f92aadca4161f5cf9f552e922fb4d6500af3a4
2020-10-03 16:54:22 -07:00
Nikita Shulga
24fa2daea6 Revert D24100389: Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb
Test Plan: revert-hammer

Differential Revision:
D24100389

Original commit changeset: b32c5163e4fb

fbshipit-source-id: 9ce7bfbcf411c0584e5d535ee107fb5a135ee6e6
2020-10-03 15:33:42 -07:00
Nikita Shulga
ff568a0e6b Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb
Test Plan: revert-hammer

Differential Revision:
D24072697 (e3d2defdc8)

Original commit changeset: 7f56b9f3cbe5

fbshipit-source-id: b32c5163e4fb6df99447f95fdb82674e5ae62f22
2020-10-03 12:27:26 -07:00
Hao Lu
2b48dd168d [StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640

Reviewed By: dzhulgakov

Differential Revision: D23996656

fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5
2020-10-02 23:03:05 -07:00
Edward Yang
546aab66c1 Revert D24027761: Update backward definition for more operators and reenable tests in test_ops.py
Test Plan: revert-hammer

Differential Revision:
D24027761 (7d809f5d8e)

Original commit changeset: c1f707c2a039

fbshipit-source-id: 30750d2f08886036fb8b2cd0ae51c7732d3b7b19
2020-10-02 18:52:57 -07:00
Michael Suo
31621c828d Fix JIT tests when run locally in fbcode (#45776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45776

Splitting out backend and custom class registration into their own library is
not currently implemented in fbcode, so detect that we are running tests in
fbcode and disable those tests.

Test Plan: buck test mode/no-gpu mode/dev caffe2/test:jit

Reviewed By: smessmer

Differential Revision: D24085871

fbshipit-source-id: 1fcc0547880bc4be59428e2810b6a7f6e50ef798
2020-10-02 17:43:01 -07:00
James Reed
53aea60bce [FX] Make output a non-special Node (#45599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45599

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24027586

Pulled By: jamesr66a

fbshipit-source-id: 747c25e3c7668ca45f03bed0be71fd3c9af67286
2020-10-02 17:08:17 -07:00
Xiang Gao
2fa062002e CUDA BFloat16 infrastructure (#44925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44925

Reviewed By: agolynski

Differential Revision: D23783910

Pulled By: ngimel

fbshipit-source-id: dacac2ad87d58056bdc68bfe0b7ab1de5c2af0d8
2020-10-02 16:21:30 -07:00
Shen Li
8cb7280242 Revert "Remove device maps from TensorPipe for v1.7 release (#45353)" (#45762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762

This reverts commit 5211fb97ac.

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D24088231

Pulled By: mrshenli

fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4
2020-10-02 15:14:05 -07:00
Yanan Cao
d150d3e276 Make sure each warnings.warn only executes once inside TorchScript. (#45382)
Summary:
* Add a pass at end of runCleanupPasses to annotate `aten::warn` so that each has its unique id
* Enhanced interpreter so that it tracks which `aten::warn` has been executed before and skip them
* Improved insertInstruction so that it correctly checks for overflow

Fixes https://github.com/pytorch/pytorch/issues/45108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45382

Reviewed By: mrshenli

Differential Revision: D24060677

Pulled By: gmagogsfm

fbshipit-source-id: 9221bc55b9ce36b374bdf614da3fe47496b481c1
2020-10-02 14:55:10 -07:00
anjali411
7d809f5d8e Update backward definition for more operators and reenable tests in test_ops.py (#44444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44444

This PR:
1. Fixes https://github.com/pytorch/pytorch/issues/41510. Updates backward formula for the following functions: `asin`, `acos`, `asinh`, `acosh`, `atan`, `atanh`, `div`, `log`, `log10`, `log2`, `log1p`, `pow`, `reciprocal`, `angle`.
2. Re-enables the tests in `test_ops.py`.
3. Adds dispatch for complex dtypes for `tanh_backward`.
4. Re-enables commented tests in `common_methods_invocation.py`.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24027761

Pulled By: anjali411

fbshipit-source-id: c1f707c2a039149a6e04bbde53ee120d9119d99a
2020-10-02 13:37:10 -07:00
Bert Maher
e3d2defdc8 [te] Get llvm codegen to compile with llvm9 and llvm-fb (#45726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45726

FB has an old internal platform that uses some random llvm version
that looks sort of like llvm 7.  I've guarded that with the appropriate
LLVM_VERSION_PATCH.

I've also swapped out some of our uses of ThreadSafeModule/ThreadSafeContext
for the variants without ThreadSafe in the name.  As far as I can tell we
weren't using the bundled locks anyways, but I'm like 85% sure this is OK since
we compile under the Torch JIT lock anyways.

Test Plan: unit tests

Reviewed By: ZolotukhinM, asuhan

Differential Revision: D24072697

fbshipit-source-id: 7f56b9f3cbe5e6d54416acdf73876338df69ddb2
2020-10-02 13:33:13 -07:00
Rohan Varma
f8c1ca5dd8 Enable NamedTuple data type to work with DDP (#44220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220

Closes https://github.com/pytorch/pytorch/issues/44009
Currently if a dataloader returns objects created with a
collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple.

Fix this in
`scatter_gather.py` to resolve the issue reported in
https://github.com/pytorch/pytorch/issues/44009
ghstack-source-id: 113423287

Test Plan: CI

Reviewed By: colesbury

Differential Revision: D23536752

fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b
2020-10-02 13:33:08 -07:00
Rong Rong
322855e380 type check for torch.quantization.observer (#45630)
Summary:
add type checker for observer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45630

Reviewed By: malfet

Differential Revision: D24058304

Pulled By: walterddr

fbshipit-source-id: ac1c0f5ff0d34b0445bd1364653fc5c9d7571b05
2020-10-02 13:25:41 -07:00
Ansley Ussery
db8b076272 Change signature for torch.poisson (#45656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45656

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078609

Pulled By: ansleyadelaide

fbshipit-source-id: 97a95b08334ed0d710e032a267b940c2fc9f7f40
2020-10-02 13:14:12 -07:00
Ansley Ussery
7726754e70 Add function signature for pixel_shuffle (#45661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45661

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078627

Pulled By: ansleyadelaide

fbshipit-source-id: 44917ff5932e4d0adcc18ce24ecfc0b5686818e3
2020-10-02 11:46:35 -07:00
Omkar Salpekar
3799ba83e5 [Docs] Adding Store API Docs (#45543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543

This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195

Test Plan: Will verify screenshots by building the docs.

Reviewed By: pritamdamania87

Differential Revision: D24005598

fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
2020-10-02 11:16:56 -07:00
Eli Uriegas
a052597e6c Bump nightlies to 1.8.0 (#45696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45696

Similar to https://github.com/pytorch/pytorch/pull/40519

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D24064381

Pulled By: seemethere

fbshipit-source-id: 1484b9c4fc5fa8cfa7be591a0a5d4b6e05968589
2020-10-02 11:10:34 -07:00
Pritam Damania
6e43f0db8b Use correct signatures for METH_NOARGS. (#45528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45528

As described in https://github.com/pytorch/pytorch/issues/45419,
resolving a bunch of cpython signature issues.

#Closes: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 113385726

Test Plan: sentinel

Reviewed By: albanD

Differential Revision: D24000626

fbshipit-source-id: d334596f1f0256063691aa044c8fb2face260817
2020-10-02 10:43:58 -07:00
Andrew Millspaugh
cdf93b03de Add string versions of argument funcs in jit Node (#45464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45464

Usage of Symbols to find arguments requires one to generate a nonsense symbol for inputs which don't already have one. The intention of symbols appears to be something of an internalized string, but the namespace component doesn't apply to an argument. In order to access the arguments by name without adding new symbols, versions of those functions with std::string input was added. These can be proved valid based on the existing codepath. Additionally, a hasNamedInput convenience function was added to remove the necessity of a try/catch block in user code.

The primary motivation is to be able to easily handle the variable number of arguments in glow, so that the arange op may be implemented.

Reviewed By: eellison

Differential Revision: D23972315

fbshipit-source-id: 3e0b41910cf07e916186f1506281fb221725a91b
2020-10-02 10:26:29 -07:00
Sam Estep
24187a0b42 Enable type check for torch.quantization.fake_quantize (#45701)
Summary:
Addresses part of https://github.com/pytorch/pytorch/issues/42969.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45701

Reviewed By: walterddr

Differential Revision: D24066672

Pulled By: samestep

fbshipit-source-id: 53bb5e7b4703738d3de86fa89fb0980f1d6251f3
2020-10-02 09:27:34 -07:00
Brian Hirsh
869b2ca048 some documentation and style fixes to smooth_l1_loss (#45587)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45587

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24024313

Pulled By: bdhirsh

fbshipit-source-id: c50efb2934d7b9d3b090e92678319cde42c0df45
2020-10-02 07:47:31 -07:00
Brian Hirsh
c703602e17 make broadcasting explanation clearer in matmul doc: #22763 (#45699)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45699

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24065584

Pulled By: bdhirsh

fbshipit-source-id: 5e2cdd00ed18ad47d24d11751cfa5bee63853cc9
2020-10-02 06:51:42 -07:00
Natalia Gimelshein
9201c37d02 Use addmm directly for 1x1 convolution (#45557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45274
Based on https://github.com/pytorch/pytorch/issues/44041, sets intermediate for backward computation (otherwise, backward tests are failing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45557

Reviewed By: izdeby

Differential Revision: D24030655

Pulled By: ngimel

fbshipit-source-id: 368fe9440668dffc004879f8b1d2dd3787d915c9
2020-10-02 00:26:53 -07:00
Supriya Rao
04526a49d3 [quant] creating quint4x2 dtype for quantized tensors (#44678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678

This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2
The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it.

This change uses most of the existing scaffolding for qtensor storage. We allocate storage
based on the dtype before creating a new qtensor.

It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info
while quantizing and packing the qtensor (when we add 2-bit qtensor)

Kernels that use this dtype should be aware of the packing format.

Test Plan:
Locally tested
```
x = torch.ones((100, 100), dtype=torch.float)
qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8)
qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2)

torch.save(x, "temp.p")
print('Size float (B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx_8bit, "temp.p")
print('Size quantized 8bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx, "temp.p")
print('Size quantized 4bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
```

Size float (B): 40760
Size quantized 8bit(B): 10808
Size quantized 4bit(B): 5816

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23993134

fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d
2020-10-01 23:53:34 -07:00
Nikolay Korovaiko
a0d08b2199 Set the default bailout depth to 20 (#45710)
Summary:
This modifies the default bailout depth to 20 which gives us a reasonable performance in benchmarks we considered (fastrnns, maskrcnn, hub/benchmark, etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45710

Reviewed By: robieta

Differential Revision: D24071861

Pulled By: Krovatkin

fbshipit-source-id: 472aacc136f37297b21f577750c1d60683a6c81e
2020-10-01 23:37:41 -07:00
Meghan Lele
402caaeba5 [docs] Update docs for NegativeBinomial (#45693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45693

**Summary**
This commit updates the docstring for
`torch.distributions.NegativeBinomial` to better match actual behaviour.
In particular, the parameter currently documented as probability of
success is actually probability of failure.

**Test Plan**
1) Ran the code from the issue to make sure this is still an issue (it
is)
2) `make html` and viewed the docs in a browser.

*Before*
<img width="879" alt="Captura de Pantalla 2020-10-01 a la(s) 1 35 28 p  m" src="https://user-images.githubusercontent.com/4392003/94864456-db3a5680-03f0-11eb-977e-3bab0fb9c206.png">

*After*
<img width="877" alt="Captura de Pantalla 2020-10-01 a la(s) 2 12 24 p  m" src="https://user-images.githubusercontent.com/4392003/94864478-e42b2800-03f0-11eb-965a-51493ca27c80.png">

**Fixes**
This commit closes #42449.

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D24071048

Pulled By: SplitInfinity

fbshipit-source-id: d345b4de721475dbe26233e368af62eb57a47970
2020-10-01 23:20:34 -07:00
Lillian Johnson
f6dc256bc6 example of splitting up an FX graph into smaller subgraphs with own submodules (#45404)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45404

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23956147

Pulled By: Lilyjjo

fbshipit-source-id: a35e33a0b9f1ed5f3fb6e5cd146f66c29bf3d518
2020-10-01 20:40:27 -07:00
lixinyu
fc4209bd4f Fix the bucketization wrong doc for right argument (#45684)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45684

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24057996

Pulled By: glaringlee

fbshipit-source-id: 3db1c24f3cae9747effa4b1f3c5c3baf6888c9a1
2020-10-01 18:16:49 -07:00
Abaho Katabarwa
de3a48013a Use CAFFE2_USE_MSVC_STATIC_RUNTIME to determine when to avoid waiting for global destructors on Windows (#43532)
Summary:
We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](54c05fa34e/torch/csrc/autograd/engine.cpp (L228)) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day.

This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained.

I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds.

Fixes https://github.com/pytorch/pytorch/issues/44470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532

Reviewed By: mrshenli

Differential Revision: D24053767

Pulled By: albanD

fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8
2020-10-01 16:41:14 -07:00
Jerry Zhang
4f685ecc25 [reland][quant][graphmode][fx] Merge all quantization mode (#45292) (#45672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45672

This PR merges all quantization mode and will only expose the following top level functions:
```
prepare_fx
prepare_qat_fx
convert_fx
```

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24053439

fbshipit-source-id: 03d545e26a36bc22a73349061b751eeb35171e64
2020-10-01 15:47:11 -07:00
Wang Xu
03e4e94d24 Find single partition (#45429)
Summary:
WIP: This PR is working in progress for the partition of fx graph module. _class partitioner_ generates partitions for the graph module. _class partition_ is a partition node in the partitions.
_Partitioner()_ : create a partitioner
_partition_graph(self, fx_module: GraphModule, devices: List[str]) -> None_:
use fx graph module and devices as the input and create partition_ids for each node inside the graph module

_dump_partition_DAG(self) -> None_:
print out the information about each partition, including its id, its backend type (what type of device this partition uses), all the nodes included in this partition,  its parent partitions, children partitions, input nodes, and output nodes.

So far, only a single partition is considered, which means there is only one device with unlimited memory.
A test unit call _test_find_single_partition()_ is added to test if all nodes in the graph are marked for the only partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45429

Reviewed By: izdeby

Differential Revision: D24026268

Pulled By: scottxu0730

fbshipit-source-id: 119d506f33049a59b54ad993670f4ba5d8e15b0b
2020-10-01 13:07:34 -07:00
Richard Zou
381f6d32a7 [docs] Fix hyperlinks for nn.CrossEntropyLoss (#45660)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45460. This PR makes it so that LogSoftmax and NLLLoss are correctly linked from the nn.CrossEntropyLoss documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45660

Test Plan:
- built and viewed docs locally

![image](https://user-images.githubusercontent.com/5652049/94816513-ee85fb80-03c9-11eb-8289-56642c133e11.png)

Reviewed By: glaringlee

Differential Revision: D24049009

Pulled By: zou3519

fbshipit-source-id: 3bd0660acb8575d753cefd2d0f1e523ca58a25b6
2020-10-01 12:18:43 -07:00
Richard Zou
1efdbfabcc [docs] Fix back quote rendering in loss modules docs (#45662)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42855. Previously, back quotes weren't rendering correctly in
equations. This is because we were quoting things like `'mean'`. In
order to backquote properly in latex in text-mode, the back-quote needs
to be written as a back-tick.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45662

Test Plan:
- built docs locally and viewed the changes.

For NLLLoss (which is not the original module mentioned in the issue, but it has the same problem), we can see how the back quotes now render properly:

![image](https://user-images.githubusercontent.com/5652049/94819862-c5676a00-03cd-11eb-9e92-01380ee52bd6.png)

Reviewed By: glaringlee

Differential Revision: D24049880

Pulled By: zou3519

fbshipit-source-id: 61a1257994144549eb8f29f19d639aea962dfec0
2020-10-01 11:52:27 -07:00
Ivan Yashchuk
77cd8e006b Added support for complex torch.symeig (#45121)
Summary:
This PR adds support for complex-valued input for `torch.symeig`.

TODO:
- [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat`
Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work.

Fixes https://github.com/pytorch/pytorch/issues/45061.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121

Reviewed By: mrshenli

Differential Revision: D24049649

Pulled By: anjali411

fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5
2020-10-01 08:57:13 -07:00
Michael Carilli
72bc3d9de4 Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for).  This PR adds an MTA unscale+infcheck functor.  Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).

The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`).  This allows calling code to pass lambdas.

Open question:  As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops.  However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control.  I can easily rewrite it that way if you prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778

Reviewed By: gchanan

Differential Revision: D23944102

Pulled By: izdeby

fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
2020-10-01 07:51:16 -07:00
generatedunixname89002005325676
84cf3372d1 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D24044108

fbshipit-source-id: 6dfe2f1201304fa58e42472e3f53c72cbb63d7d2
2020-10-01 05:29:03 -07:00
Mike Ruberry
c36b354072 Revert D23913105: [quant][graphmode][fx] Merge all quantization mode
Test Plan: revert-hammer

Differential Revision:
D23913105 (ffcb0989e7)

Original commit changeset: 4e335286d6de

fbshipit-source-id: 5765b4e8ec917423f1745f73a9f3f235fc53423d
2020-10-01 03:12:42 -07:00
James Reed
78b95b6204 Revert "Revert D24024606: [FX] Shape propagation example" (#45637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45637

This reverts commit 869b05648d.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24037870

Pulled By: jamesr66a

fbshipit-source-id: 851beb42fe72383108ceeff1fe97f388d9ad059e
2020-10-01 01:07:56 -07:00
Xingying Cheng
4339f5c076 [PyTorch][QPL] Add instance_key into MOBILE_MODULE_LOAD_STATS logging. (#45518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45518

Similar to previous diff,  Add instance_key into MOBILE_MODULE_LOAD_STATS logging.
ghstack-source-id: 113149713

Test Plan:
```
09-29 11:50:23.345  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterLoadModel instance_key = 2015064908
09-29 11:50:23.409  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_name = bi_pytext_v10
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_type = FBNet
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"]
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitLoadModel instance_key = 2015064908
```

Reviewed By: iseeyuan

Differential Revision: D23996150

fbshipit-source-id: 7bf76af3b7e6b346afd20ab341204743c81cfe83
2020-09-30 23:31:35 -07:00
BowenBao
3da4cea658 [ONNX] Add dim_param support in export with onnx shape inference (#44920)
Summary:
* Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic.
* Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12.
* Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy.
* This PR depends on existing PR such as 44332.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44920

Reviewed By: eellison

Differential Revision: D23958398

Pulled By: bzinodev

fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d
2020-09-30 21:56:24 -07:00
Jerry Zhang
ffcb0989e7 [quant][graphmode][fx] Merge all quantization mode (#45292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45292

This PR merges all quantization mode and will only expose the following top level functions:
```
prepare_fx
prepare_qat_fx
convert_fx
```

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23913105

fbshipit-source-id: 4e335286d6de225839daf51d1df54322d52d68e5
2020-09-30 21:20:34 -07:00
Xingying Cheng
3f440d74fc [PyTorch][QPL] Add instance_key into MOBILE_MODULE_STATS logging. (#45517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45517

Add unique instance_key instead of the default one into MOBILE_MODULE_STATS logging to avoid multiple events overlaps.
ghstack-source-id: 113149453

Test Plan:
Make sure that each event's start, annotate and end are having the same instancekey:
```
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, method_name = forward
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_name = bi_pytext_v10
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_type = FBNet
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"]
09-28 23:46:03.181 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1123198800
09-28 23:46:04.183 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, method_name = forward
09-28 23:46:04.184 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, model_name = __torch__.Model
09-28 23:46:04.205 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1521608147
```

Reviewed By: iseeyuan

Differential Revision: D23985178

fbshipit-source-id: bcd5db8dc680e3cf8d12edf865377e80693cc23b
2020-09-30 20:13:33 -07:00
Jerry Zhang
9d5607fcd9 [quant] Use PlaceholderObserver as default dynamic quant observer (#45343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45343

Current default dynamic quant observer is not correct since we don't accumulate
min/max and we don't need to calculate qparams.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23933995

fbshipit-source-id: 3ff497c9f5f74c687e8e343ab9948d05ccbba09b
2020-09-30 19:01:18 -07:00
Taylor Robie
2b13d9413e Re-land: Add callgrind collection to Timer #44717 (#45586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586

Test Plan: The unit test has been softened to be less platform sensitive.

Reviewed By: mruberry

Differential Revision: D24025415

Pulled By: robieta

fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
2020-09-30 17:43:06 -07:00
Yanan Cao
3a2d45304d [Experimental][Partial] New implementation for torch.distributed APIs in C++ (#45547)
Summary:
This is an attempt at refactoring `torch.distributed` implementation. Goal is to push Python layer's global states (like _default_pg) to C++ layer such that `torch.distributed` becomes more TorchScript friendly.

This PR adds the skeleton of C++ implementation, at the moment it is not included in any build (and won't be until method implementations are filled in). If you see any test failures related, feel free to revert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45547

Reviewed By: izdeby

Differential Revision: D24024213

Pulled By: gmagogsfm

fbshipit-source-id: 2762767f63ebef43bf58e17f9447d53cf119f05f
2020-09-30 17:35:51 -07:00
David Reiss
869b05648d Revert D24024606: [FX] Shape propagation example
Test Plan: revert-hammer

Differential Revision:
D24024606 (ac9a708ed0)

Original commit changeset: 5340eab20f80

fbshipit-source-id: f465eb5e8e994b3b0bedbc779901f76b9ab16f02
2020-09-30 17:03:14 -07:00
Hector Yuen
f2c2b75e80 flush the buffer when printing the IR (#45585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45585

I discovered this bug when I was trying to print the graph to a file. Turns out I had to close the file, but flushing should be a good safeguard in case other users forget.

Test Plan:
Tested with and without flushing.
with P144064292
without P144064767

Reviewed By: mortzur

Differential Revision: D24023819

fbshipit-source-id: 39574b3615feb28e5b5939664c04ddfb1257706a
2020-09-30 16:55:27 -07:00
Zino Benaissa
4be42034b6 Clear shape information before finalizing graph-mode quantization (#45282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45282

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23909601

Pulled By: bzinodev

fbshipit-source-id: 3062cda46b15a79094a360216c35906afab7c723
2020-09-30 16:13:55 -07:00
Malgi Nikitha Vivekananda
85a70ce71f Add multiline string dedent support (#45580)
Summary:
Fixes #{44842}
Summary
========
This PR adds support for multiline string dedents.

Test
=====
pytest -k test_multiline_string_dedents test/test_jit.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45580

Reviewed By: wconstab

Differential Revision: D24025866

Pulled By: nikithamalgifb

fbshipit-source-id: 0f49739fb93f70f73a8f367caca2887f558a3937
2020-09-30 16:08:26 -07:00
Sam Tsai
2596113a79 Add fuse support for batchnorm with affine=False (#45474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45474

When batchnorm affine is set to false, weight and bias is set to None, which is not supported in this case. Added a fix to set weights to 1 and bias to 0 if they are not set.

Test Plan: Add unit test for testing fusing conv, batchnorm where batchnorm is in affine=False mode.

Reviewed By: z-a-f

Differential Revision: D23977080

fbshipit-source-id: 2782be626dc67553f3d27d8f8b1ddc7dea022c2a
2020-09-30 14:15:05 -07:00
Negin Raoof
6b42ca2d69 [ONNX] Update embedding_bag export (#44693)
Summary:
Export of embedding bag with dynamic list of offsets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44693

Reviewed By: malfet

Differential Revision: D23831980

Pulled By: bzinodev

fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a
2020-09-30 13:36:40 -07:00
James Reed
ac9a708ed0 [FX] Shape propagation example (#45589)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45589

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24024606

Pulled By: jamesr66a

fbshipit-source-id: 5340eab20f805c232bfeb37e4e2156f39a161c19
2020-09-30 13:18:23 -07:00
Xinyu Li
c9bb990707 [c++] Distance-agnostic triplet margin loss (#45377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45377

This PR adds a C++ implementation of the TripletMarginWithDistanceLoss, for which the Python implementation was introduced in PR #43680.  It's based on PR #44072, but I'm resubmitting this to unlink it from Phabricator.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24003973

fbshipit-source-id: 2d9ada7260a6f27425ff2fdbbf623dad0fb79405
2020-09-30 12:37:35 -07:00
Rohan Varma
181afd5220 Add an option to DDP to take a list of parameters to ignore upfront. (#44826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826

As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.

To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109

Test Plan: Added unittest

Reviewed By: xw285cornell, mrshenli

Differential Revision: D23740639

fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
2020-09-30 11:52:50 -07:00
Jerry Zhang
5539066d12 [quant][graphmode][fx] Support quantization for custom module (#44074)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44074

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23580642

fbshipit-source-id: a80b0b3e5e1f4c4a9647da872239cc0a4d58dd3b
2020-09-30 10:24:54 -07:00
Mike Ruberry
51d0ae9207 Revert D24010742: [pytorch][PR] Add callgrind collection to Timer
Test Plan: revert-hammer

Differential Revision:
D24010742 (9b27e0926b)

Original commit changeset: df6bc765f8ef

fbshipit-source-id: 4c1edd57ea932896f7052716427059c924222501
2020-09-30 10:15:46 -07:00
Brian Hirsh
6c4aa2a79c Revert D24002415: Some fixes to smooth_l1_loss
Test Plan: revert-hammer

Differential Revision:
D24002415 (fdbed7118e)

Original commit changeset: 980c141019ec

fbshipit-source-id: 8981b5f6d982ed66c670122e437540444cb5f39c
2020-09-30 10:00:17 -07:00
Rong Rong
4f3920951e type check for torch.quantization.quantize_jit (#45548)
Summary:
added type signal for more jit python functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45548

Reviewed By: malfet

Differential Revision: D24010922

Pulled By: walterddr

fbshipit-source-id: 2fdd75482481adf2eddc01b915d7d5720fbb2b82
2020-09-30 09:17:00 -07:00
anjali411
415ed434aa Add whitelist for complex backward (#45461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45461

This PR disables autograd for all C -> C, R -> C functions which are not included in the whitelist `GRADIENT_IMPLEMENTED_FOR_COMPLEX`. In practice, there will be a RuntimeError during forward computation when the outputs are differentiable:
```
>>> x=torch.randn(4, 4, requires_grad=True, dtype=torch.cdouble)
>>> x.pow(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: pow does not support automatic differentiation for outputs with complex dtype.
```

The implicit assumption here is that all the C -> R functions have correct backward definitions. So before merging this PR, the following functions must be tested and verified to have correct backward definitions:
`torch.abs` (updated in #39955 ), `torch.angle`, `torch.norm`, `torch.irfft`, `torch.istft`.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23998156

Pulled By: anjali411

fbshipit-source-id: 370eb07fe56ac84dd8e2233ef7bf3a3eb8aeb179
2020-09-30 08:45:55 -07:00
Erjia Guan
96540e918c Add ShuffleDataset with buffer (#45290)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45290

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24001084

Pulled By: erjia-guan

fbshipit-source-id: d8a7455cf3f18e1f8c1edc53c42c1a99c8573c51
2020-09-30 07:58:15 -07:00
Brian Hirsh
fdbed7118e Some fixes to smooth_l1_loss (#45532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45532

- updated documentation
- explicitly not supporting negative values for beta (previously the
result was incorrect)
- Removing default value for beta in the backwards function, since it's
only used internally by autograd (as per convention)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24002415

Pulled By: bdhirsh

fbshipit-source-id: 980c141019ec2d437b771ee11fc1cec4b1fcfb48
2020-09-30 07:28:44 -07:00
VinodSKumar
e02868e12d Unify Transformer coder Constructors (#45515)
Summary:
Fixes #{[45502](https://github.com/pytorch/pytorch/issues/45502)}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45515

Reviewed By: zhangguanheng66, ZolotukhinM

Differential Revision: D23994644

Pulled By: glaringlee

fbshipit-source-id: b8728e8dfd8857e27246ebb11b17c2d1b48796ca
2020-09-30 07:05:41 -07:00
Nikolay Korovaiko
7566823779 Enable PE + TE (#45546)
Summary:
This PR enables PE + TE for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45546

Reviewed By: ZolotukhinM

Differential Revision: D24006940

Pulled By: Krovatkin

fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115
2020-09-30 06:49:59 -07:00
Taylor Robie
9b27e0926b Add callgrind collection to Timer (#44717)
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:

A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.

Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
    "x.backward()",
    setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()

for c, fn in counts[:20]:
    print(f"{c:>12}  {fn}")
```

```
      812800  ???:_dl_update_slotinfo
      355600  ???:update_get_addr
      308300  work/Python/ceval.c:_PyEval_EvalFrameDefault'2
      304800  ???:__tls_get_addr
      196059  ???:_int_free
      152400  ???:__tls_get_addr_slow
      138400  build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
      126526  work/Objects/dictobject.c:_PyDict_LoadGlobal
      114268  ???:malloc
      101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
       85900  work/Python/ceval.c:_PyEval_EvalFrameDefault
       79946  work/Objects/typeobject.c:_PyType_Lookup
       72000  build/../c10/core/Device.h:c10::Device::validate()
       70000  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
       66400  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
       63000  ???:pthread_mutex_lock
       61200  work/Objects/dictobject.c:PyDict_GetItem
       59800  ???:free
       58400  work/Objects/tupleobject.c:tupledealloc
       56707  work/Objects/dictobject.c:lookdict_unicode_nodummy
```

Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions:  {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
    _ = count_dict.setdefault(fn, 0)
    count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
    print(f"{c:>8}  {fn}")
```

```
Head instructions: 7609547
1.6 instructions:  6059648
  169600  ???:_dl_update_slotinfo
  101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
   74200  ???:update_get_addr
   63600  ???:__tls_get_addr
   46800  work/Python/ceval.c:_PyEval_EvalFrameDefault
   33512  work/Objects/dictobject.c:_PyDict_LoadGlobal
   31800  ???:__tls_get_addr_slow
   31700  build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
   28300  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
   27800  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
   27401  work/Objects/dictobject.c:lookdict_unicode_nodummy
   24115  work/Objects/typeobject.c:_PyType_Lookup
   24080  ???:_int_free
   21700  work/Objects/dictobject.c:PyDict_GetItemWithError
   20700  work/Objects/dictobject.c:PyDict_GetItem
          ...
   -3200  build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
   -3400  build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
   -3500  /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
   -3700  build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
   -4207  work/Objects/obmalloc.c:PyMem_Calloc
   -4500  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
   -4800  build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
   -5000  build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
   -5300  work/Objects/listobject.c:PyList_New
   -5400  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
   -5600  /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
   -6231  work/Objects/obmalloc.c:PyMem_Free
   -6300  work/Objects/listobject.c:list_repeat
  -11200  work/Objects/listobject.c:list_dealloc
  -28900  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```

Remaining TODOs:
  * Include a timer in the generated script for cuda sync.
  * Add valgrind to CircleCI machines and add a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717

Reviewed By: soumith

Differential Revision: D24010742

Pulled By: robieta

fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
2020-09-30 05:52:54 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Xiang Gao
c2c7099944 Fix docs for kwargs, q-z (#43589)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43589

Reviewed By: zhangguanheng66

Differential Revision: D24006259

Pulled By: mruberry

fbshipit-source-id: 39abd474744f152648aad201d7311b42d20efc88
2020-09-29 22:57:02 -07:00
Peng-Jen Chen
93650a82c9 Move prim::tolist math.log and aten::cpu to lite interpreter for translation model (#45482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482

Working on some models that need these ops on lite interpreter.

Test Plan: locally build and load/run the TS model without problem.

Reviewed By: iseeyuan

Differential Revision: D23906581

fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e
2020-09-29 21:42:18 -07:00
Mikhail Zolotukhin
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
Taylor Robie
ccad73ab41 Fix D23995953 import.
Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported

Test Plan: See https://github.com/pytorch/pytorch/pull/45511

Reviewed By: zhangguanheng66

Differential Revision: D23995953

fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4
2020-09-29 19:30:23 -07:00
Xiang Gao
0a15646e15 CUDA RTX30 series support (#45489)
Summary:
I also opened a PR on cmake upstream: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/5292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45489

Reviewed By: zhangguanheng66

Differential Revision: D23997844

Pulled By: ezyang

fbshipit-source-id: 4e7443dde9e70632ee429184f0d51cb9aa5a98b5
2020-09-29 18:19:23 -07:00
Guilherme Leobas
c1e6592964 Enable type-checking of torch.nn.quantized.* modules (#43110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43029

I am not changing the following files in this PR:
* `torch/nn/quantized/dynamic/modules/rnn.py` due to https://github.com/pytorch/pytorch/issues/43072
* `torch/nn/quantized/modules/conv.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43110

Reviewed By: gchanan

Differential Revision: D23963258

Pulled By: ezyang

fbshipit-source-id: 0fb0fd13af283f6f7b3434e7bbf62165357d1f98
2020-09-29 18:14:29 -07:00
Guilherme Leobas
375a83e6c1 Annotate torch.utils.(tensorboard/show_pickle/hypify) (#44216)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44216

Reviewed By: gchanan

Differential Revision: D23963216

Pulled By: ezyang

fbshipit-source-id: b3fed51b2a1cbd05e3cd0222c89c38d61d8968c1
2020-09-29 18:14:26 -07:00
Guilherme Leobas
eb39542e67 Add typing annotations for torch.utils.data.* modules (#44136)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44136

Reviewed By: gchanan

Differential Revision: D23963273

Pulled By: ezyang

fbshipit-source-id: 939234dddbe89949bd8e5ff05d06f6c8add6935c
2020-09-29 18:12:05 -07:00
Thomas Viehmann
22a34bcf4e ROCm {emoji:2764} TensorExpr (#45506)
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506

Reviewed By: zhangguanheng66

Differential Revision: D23991410

Pulled By: Krovatkin

fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
2020-09-29 16:52:16 -07:00
Hongyi Jia
06a566373a [PyTorch/NCCL] Fix async error handling (#45456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456

Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100%

SyncBatchNorm failure trace: P143879560

Test Plan:
**Desync test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync

**SyncBatchNorm test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient

Reviewed By: osalpekar

Differential Revision: D23972071

fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f
2020-09-29 15:44:34 -07:00
Garret Catron
ef41472544 Create experimental FX graph manipulation library (#44775)
Summary:
This PR adds a new GraphManipulation library for operating on the GraphModule nodes.
It also adds an implementation of replace_target_nodes_with, which replaces all nodes in the GraphModule or a specific op/target with a new specified op/target. An example use of this function would be replacing a generic operator with an optimized operator for specific sizes and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44775

Reviewed By: jamesr66a

Differential Revision: D23874561

Pulled By: gcatron

fbshipit-source-id: e1497cd11e0bbbf1fabdf137d65c746248998e0b
2020-09-29 15:32:41 -07:00
Randall Hunt
ab5cf16b6c fix standard deviation gradient NaN behavior (#45468)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45468

Reviewed By: zhangguanheng66

Differential Revision: D23991064

Pulled By: albanD

fbshipit-source-id: d4274895f2dac8b2cdbd73e5276ce3df466fc341
2020-09-29 13:47:29 -07:00
anjali411
18876b5722 Update backward formula for torch.dot and add backward definition for torch.vdot (#45074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45074

TODO: Add R -> C tests in https://github.com/pytorch/pytorch/pull/44744 (blocked on some JIT changes)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23975361

Pulled By: anjali411

fbshipit-source-id: 3512bd2962b588a198bc317673bd18cc96ac823f
2020-09-29 12:52:03 -07:00
Mike Ruberry
b2925671b6 Updates deterministic flag to throw a warning, makes docs consistent (#45410)
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410

Reviewed By: ngimel

Differential Revision: D23974988

Pulled By: mruberry

fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
2020-09-29 11:17:33 -07:00
Ivan Yashchuk
f47fd0eb72 Updated cholesky_backward for complex inputs (#45267)
Summary:
Updated `cholesky_backward` to work correctly for complex input.
Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do?
Ref. https://github.com/pytorch/pytorch/issues/44895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267

Reviewed By: bwasti

Differential Revision: D23975269

Pulled By: anjali411

fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6
2020-09-29 11:07:32 -07:00
Xingying Cheng
ea59251f51 Fix model_name not logged properly issue. (#45488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488

model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it.
ghstack-source-id: 113103942

Test Plan:
made sure that now the model_name is logged from module_->name().
verified with one model which does not contain the model metadata, and the model_name field is logged as below:

09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  model_name = __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  method_name = labels
09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod()

Reviewed By: linbinyu

Differential Revision: D23984165

fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8
2020-09-29 10:37:36 -07:00
Meghan Lele
09b3e16b40 [JIT] Enable @unused syntax for ignoring properties (#45261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45261

**Summary**
This commit enables `unused` syntax for ignoring
properties. Inoring properties is more intuitive with this feature enabled.
`ignore` is not supported because class type properties cannot be
executed in Python (because they exist only as TorchScript types) like
an `ignored` function and module properties that cannot be scripted
are not added to the `ScriptModule` wrapper so that they
may execute in Python.

**Test Plan**
This commit updates the existing unit tests for class type and module
properties to test properties ignored using `unused`.

Test Plan: Imported from OSS

Reviewed By: navahgar, Krovatkin, mannatsingh

Differential Revision: D23971881

Pulled By: SplitInfinity

fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33
2020-09-29 10:24:25 -07:00
Akshit Khurana
5f49d14be2 Add mobile_optimized tag to optimized model. (#45479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479

Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized.

Test Plan: buck test //caffe2/test:mobile passes

Reviewed By: kimishpatel

Differential Revision: D23956728

fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8
2020-09-29 10:06:57 -07:00
Mike Ruberry
ab5edf21b0 Revert D23789657: [wip] fast typeMeta/ScalarType conversion approach 2
Test Plan: revert-hammer

Differential Revision:
D23789657 (1ed1a2f5b0)

Original commit changeset: 5afdd52d24bd

fbshipit-source-id: 6d827be8895bcb39c8e85342eee0f7a3f5056c76
2020-09-29 09:40:53 -07:00
Nikita Shulga
b3135c2056 Enable torch.cuda.amp typechecking (#45480)
Summary:
Fix `torch._C._autocast_*_nesting` declarations in __init__.pyi

Fix iterable constructor logic: not every iterable can be constructed using `type(val)(val)` trick, for example it would not work for `val=range(10)` although `isinstance(val, Iterable)` is True
Change optional resolution logic to meet mypy expectations

Fixes https://github.com/pytorch/pytorch/issues/45436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45480

Reviewed By: walterddr

Differential Revision: D23982822

Pulled By: malfet

fbshipit-source-id: 6418a28d04ece1b2427dcde4b71effb67856a872
2020-09-29 09:31:55 -07:00
Mike Ruberry
bb19a55429 Improves fft doc consistency and makes deprecation warnings more prominent (#45409)
Summary:
This PR makes the deprecation warnings for existing fft functions more prominent and makes the torch.stft deprecation warning consistent with our current deprecation planning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45409

Reviewed By: ngimel

Differential Revision: D23974975

Pulled By: mruberry

fbshipit-source-id: b90d8276095122ac3542ab625cb49b991379c1f8
2020-09-29 09:07:49 -07:00
Mike Ruberry
6d37126a10 Makes rdiv consistent with div (#45407)
Summary:
In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407

Reviewed By: ngimel

Differential Revision: D23974967

Pulled By: mruberry

fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95
2020-09-29 08:34:01 -07:00
Mike Ruberry
87f98a5b54 Updates torch.floor_divide documentation to clarify it's actually torch.trunc_divide (or torch.rtz_divide) (#45411)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/43874 for 1.7. 1.8 will need to take floor_divide through a proper deprecation process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45411

Reviewed By: ngimel

Differential Revision: D23974997

Pulled By: mruberry

fbshipit-source-id: 16dd07e50a17ac76bfc93bd6b71d4ad72d909bf4
2020-09-29 05:55:44 -07:00
Antonio Cuni
37f9af7f29 Missing tests about torch.xxx(out=...) (#44465)
Summary:
PR opened just to run the CI tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44465

Reviewed By: ngimel

Differential Revision: D23907565

Pulled By: mruberry

fbshipit-source-id: 620661667877f1e9a2bab17d19988e2dc986fc0f
2020-09-29 04:54:46 -07:00
Mike Ruberry
56af122659 Revert D23966878: [pytorch][PR] This PR flips a switch to enable PE + TE
Test Plan: revert-hammer

Differential Revision:
D23966878 (dddb685c11)

Original commit changeset: 2010a0b07c59

fbshipit-source-id: 132556039730fd3e4babd0d7ca8daf9c8d14f728
2020-09-29 04:33:19 -07:00
Basil Hosmer
1ed1a2f5b0 [wip] fast typeMeta/ScalarType conversion approach 2 (#44965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44965

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23789657

Pulled By: bhosmer

fbshipit-source-id: 5afdd52d24bd097891ff4a7313033f7bd400165e
2020-09-29 02:39:36 -07:00
Supriya Rao
489af4ddcb [quant] Add quant APIs to save/load observer state_dict (#44846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44846

The save function traverses the model state dict to pick out the observer stats
load function traverse the module hierarchy to load the state dict into module attributes depending on observer type

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23746821

fbshipit-source-id: 05c571b62949a2833602d736a81924d77e7ade55
2020-09-29 01:52:42 -07:00
Zafar
bb478810e0 [quant] torch.max_pool1d (#45152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45152

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23846473

Pulled By: z-a-f

fbshipit-source-id: 38fd611e568e4f8b39b7a00adeb42c7b99576360
2020-09-29 01:45:22 -07:00
Mikhail Zolotukhin
b86008ab75 [TensorExpr] Remove buf_ field from class Tensor. (#45390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390

Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.

Differential Revision: D23952865

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
2020-09-29 01:21:57 -07:00
Mikhail Zolotukhin
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
Mikhail Zolotukhin
92306b85d5 [TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388

Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.

Differential Revision: D23952867

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
2020-09-29 01:17:10 -07:00
Iurii Zdebskyi
8c309fc052 Add more tests for mt optimizers (#45475)
Summary:
Add more test cases for mt optimizers and fix Adam/AdamW

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45475

Reviewed By: soumith

Differential Revision: D23982727

Pulled By: izdeby

fbshipit-source-id: 4b24d37bd52a2fa3719d3e3a5dcf3b96990b0f5b
2020-09-28 23:59:58 -07:00
James Reed
6bdb871d47 [FX] Lint pass for Graphs (#44973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44973

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23792631

Pulled By: jamesr66a

fbshipit-source-id: d8faef0c311d8bd611ba0a7e1e2f353e3e5a1068
2020-09-28 23:00:32 -07:00
James Reed
b0bdc82a00 [FX][EZ] Fix bug where copying node made non-unique name (#45311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45311

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23917864

Pulled By: jamesr66a

fbshipit-source-id: 10d0a4017ffe160bce4ba0d830e035616bbded74
2020-09-28 22:55:20 -07:00
lixinyu
417e3f85e5 Support tuple inputs in NN Module test (#44853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44853

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23750441

Pulled By: glaringlee

fbshipit-source-id: 1b111a370a726b40521134b711c35f48dda99411
2020-09-28 22:05:05 -07:00
Nikolay Korovaiko
dddb685c11 This PR flips a switch to enable PE + TE (#45396)
Summary:
This PR flips a switch to enable PE + TE
next PR: https://github.com/pytorch/pytorch/pull/45397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45396

Reviewed By: suo

Differential Revision: D23966878

Pulled By: Krovatkin

fbshipit-source-id: 2010a0b07c595992a88b3fe0792d6af315cf421e
2020-09-28 21:57:50 -07:00
Natalia Gimelshein
50b91103a9 add self cuda time to avoid double/quadruple counting (#45209)
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::matmul          0.17%            890.805us        99.05%           523.401ms        5.234ms          49.91%           791.184ms        7.912ms          100
aten::mm              98.09%           518.336ms        98.88%           522.511ms        5.225ms          49.89%           790.885ms        7.909ms          100
aten::t               0.29%            1.530ms          0.49%            2.588ms          25.882us         0.07%            1.058ms          10.576us         100
aten::view            0.46%            2.448ms          0.46%            2.448ms          12.238us         0.06%            918.936us        4.595us          200
aten::transpose       0.13%            707.204us        0.20%            1.058ms          10.581us         0.03%            457.802us        4.578us          100
aten::empty           0.14%            716.056us        0.14%            716.056us        7.161us          0.01%            185.694us        1.857us          100
aten::as_strided      0.07%            350.935us        0.07%            350.935us        3.509us          0.01%            156.380us        1.564us          100
aten::stride          0.65%            3.458ms          0.65%            3.458ms          11.527us         0.03%            441.258us        1.471us          300
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s

Recorded timeit time:  789.0814 ms

```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler

After
```
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
        aten::matmul         0.15%     802.716us        99.06%     523.548ms       5.235ms     302.451us         0.04%     791.151ms       7.912ms           100
            aten::mm        98.20%     519.007ms        98.91%     522.745ms       5.227ms     790.225ms        99.63%     790.848ms       7.908ms           100
             aten::t         0.27%       1.406ms         0.49%       2.578ms      25.783us     604.964us         0.08%       1.066ms      10.662us           100
          aten::view         0.45%       2.371ms         0.45%       2.371ms      11.856us     926.281us         0.12%     926.281us       4.631us           200
     aten::transpose         0.15%     783.462us         0.22%       1.173ms      11.727us     310.016us         0.04%     461.282us       4.613us           100
         aten::empty         0.11%     591.603us         0.11%     591.603us       5.916us     176.566us         0.02%     176.566us       1.766us           100
    aten::as_strided         0.07%     389.270us         0.07%     389.270us       3.893us     151.266us         0.02%     151.266us       1.513us           100
        aten::stride         0.60%       3.147ms         0.60%       3.147ms      10.489us     446.451us         0.06%     446.451us       1.488us           300
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms

Recorded timeit time:  788.9832 ms

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209

Reviewed By: zou3519

Differential Revision: D23925491

Pulled By: ngimel

fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
2020-09-28 21:51:13 -07:00
Shen Li
5be954b502 Fix WorkerInfo link format (#45476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45476

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23982069

Pulled By: mrshenli

fbshipit-source-id: 6d932e77c1941dfd96592b388353f0fc8968dde6
2020-09-28 20:48:15 -07:00
Shen Li
8e47fcba5f Update docs for RPC async_execution (#45458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45458

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973366

Pulled By: mrshenli

fbshipit-source-id: 3697f07fa972db21746aa25eaf461c1b93293f58
2020-09-28 20:48:12 -07:00
Shen Li
c5ade5f698 Fix no_sync docs (#45455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973365

Pulled By: mrshenli

fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb
2020-09-28 20:48:09 -07:00
Shen Li
6967e6295e Fix DDP docs (#45454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973367

Pulled By: mrshenli

fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0
2020-09-28 20:43:22 -07:00
Alex Suhan
52cbc9e4ec [TensorExpr] Always inline and DCE in the LLVM backend (#45445)
Summary:
Inline pytorch into wrapper, which is especially helpful in combination
with dead code elimination to reduce IR size and compilation times when
a lot of parameters are unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45445

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D23969009

Pulled By: asuhan

fbshipit-source-id: a21509d07e4c130b6aa6eae5236bb64db2748a3d
2020-09-28 18:11:13 -07:00
Meghan Lele
7ac872b934 [JIT] Modify to_backend API so that it accepts wrapped modules (#43612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43612

**Summary**
This commit modifies the `torch._C._jit_to_backend` function so that it
accepts `ScriptModules` as inputs. It already returns `ScriptModules`
(as opposed to C++ modules), so this makes sense and makes the API more
intuitive.

**Test Plan**
Continuous integration, which includes unit tests and out-of-tree tests
for custom backends.

**Fixes**
This commit fixes #41432.

Test Plan: Imported from OSS

Reviewed By: suo, jamesr66a

Differential Revision: D23339854

Pulled By: SplitInfinity

fbshipit-source-id: 08ecef729c4e1e6bddf3f483276947fc3559ea88
2020-09-28 17:17:01 -07:00
Rong Rong
5855aa8dac Type check quasirandom (#45434)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45434

Reviewed By: walterddr

Differential Revision: D23967139

Pulled By: ajitmaths

fbshipit-source-id: bcee6627f367fd01aa9a5c10a7c24331fc1823ad
2020-09-28 16:49:38 -07:00
Rong Rong
49b198c454 type check for torch.testing._internal.common_utils (#45375)
Summary:
part of torch.testing._internal.* effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45375

Reviewed By: malfet

Differential Revision: D23964315

Pulled By: walterddr

fbshipit-source-id: efdd643297f5c7f75670ffe60ff7e82fc413d18d
2020-09-28 16:28:46 -07:00
Heitor Schueroff de Souza
96f8755034 Fixed handling of nan for evenly_distribute_backward (#45280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45280

Performance is the same on CPU and on CUDA is only 1-1.05x slower. This change is necessary for the future nan ops including nan(min|max|median)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23908796

Pulled By: heitorschueroff

fbshipit-source-id: c2b57acbe924cfa59fbd85216811f29f4af05088
2020-09-28 15:57:02 -07:00
Jan Schlüter
6a206df891 20000x faster audio conversion for SummaryWriter (#44201)
Summary:
Stumbled upon a little gem in the audio conversion for `SummaryWriter.add_audio()`: two Python `for` loops to convert a float array to little-endian int16 samples. On my machine, this took 35 seconds for a 30-second 22.05 kHz excerpt. The same can be done directly in numpy in 1.65 milliseconds. (No offense, I'm glad that the functionality was there!)

Would also be ready to extend this to support stereo waveforms, or should this become a separate PR?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44201

Reviewed By: J0Nreynolds

Differential Revision: D23831002

Pulled By: edward-io

fbshipit-source-id: 5c8f1ac7823d1ed41b53c4f97ab9a7bac33ea94b
2020-09-28 15:44:29 -07:00
Zachary DeVito
e54e1fe51e [package] Add dependency viz (#45214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45214

When in verbose mode the package exporter will produce an html visualization
of dependencies of a module to make it easier to trim out unneeded code,
or debug inclusion of things that cannot be exported.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23873525

Pulled By: zdevito

fbshipit-source-id: 6801991573d8dd5ab8c284e09572b36a35e1e5a4
2020-09-28 15:38:41 -07:00
Omkar Salpekar
6b65b3cbd8 [Distributed] DeleteKey API for c10d TCP Store (#45401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: mrshenli

Differential Revision: D23955730

fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
2020-09-28 15:30:39 -07:00
Gregory Chanan
1097fe0088 Remove CriterionTest.test_cuda code for dtype None. (#45316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45316

It's never used.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23919449

Pulled By: gchanan

fbshipit-source-id: f9aaeeabf3940389156bfc01bc3118d348ca4cf6
2020-09-28 15:08:09 -07:00
lcskrishna
a4486fe7ba [ROCm] Print name irrespective of seq number assignment for roctx traces (#45229)
Summary:
Recent changes to the seq_num correlation behavior in profiler (PR https://github.com/pytorch/pytorch/issues/42565)  has changed the behavior for emit_nvtx(record_shapes=True)  which doesn't print the name of the operator properly.

Created PR to dump out the name in roctx traces, irrespective of the sequence number assigned only for ROCm.

cc: jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45229

Reviewed By: zou3519

Differential Revision: D23932902

Pulled By: albanD

fbshipit-source-id: c782667ff002b70b51f1cc921afd1b1ac533b39d
2020-09-28 15:03:47 -07:00
Taylor Robie
c6b7eeb654 Gh/taylorrobie/timer cleanup (#45361)
Summary:
This PR cleans up some of the rough edges around `Timer` and `Compare`
* Moves `Measurement` to be dataclass based
* Adds a bunch of type annotations. MyPy is now happy.
* Allows missing entries in `Compare`. This is one of the biggest usability issues with `Compare` right now, both from an API perspective and because the current failure mode is really unpleasant.
* Greatly expands the testing of `Compare`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45361

Test Plan: Changes to Timer are covered under existing tests, changes to `Compare` are covered by the expanded `test_compare` method.

Reviewed By: bwasti

Differential Revision: D23966816

Pulled By: robieta

fbshipit-source-id: 826969f73b42f72fa35f4de3c64d0988b61474cd
2020-09-28 14:56:43 -07:00
Negin Raoof
a77d633db1 [ONNX] Fix view for dynamic input shape (#43558)
Summary:
Export of view op with dynamic input shape is broken when using tensors with a 0-dim.
This fix removes symbolic use of static input size to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43558

Reviewed By: ailzhang

Differential Revision: D23965090

Pulled By: bzinodev

fbshipit-source-id: 628e9d7ee5d53375f25052340ca6feabf7ba7c53
2020-09-28 14:46:51 -07:00
Gregory Chanan
5d1fee23b3 Remove convert_target from NN tests. (#45291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45291

It's not necessary, you can just check if the dtype is integral.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23911963

Pulled By: gchanan

fbshipit-source-id: 230139e1651eb76226f4095e31068dded30e03e8
2020-09-28 14:21:42 -07:00
Rong Rong
986af53be2 type check for torch.testing._internalcodegen:* (#45368)
Summary:
part of `torch.testing._internal.*` effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45368

Reviewed By: malfet

Differential Revision: D23950512

Pulled By: walterddr

fbshipit-source-id: 399f712d12cdd9795b0136328f512c3f86a15f24
2020-09-28 14:04:52 -07:00
Yi Wang
7a4c417ed3 Fix typo (#45379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45379

Registeres -> Registers in reducer.h.
ghstack-source-id: 112982279

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D23951203

fbshipit-source-id: 96c7dc2e1e12c132339b9ac83ce1da52c812740c
2020-09-28 14:02:01 -07:00
BowenBao
57c18127dc [ONNX] Update div export to perform true divide (#44831)
Summary:
related https://github.com/pytorch/pytorch/issues/43787

Now that PyTorch div is actually performing true divide, update onnx export code to stay consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44831

Reviewed By: eellison

Differential Revision: D23880316

Pulled By: bzinodev

fbshipit-source-id: 3bb8db34142ac4fed4039295ad3c4cb79487987f
2020-09-28 13:53:43 -07:00
gunandrose4u
47debdca42 Document change for DDP enabled on Windows platform (#45392)
Summary:
Document change for DDP enabled on Windows platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45392

Reviewed By: gchanan

Differential Revision: D23962344

Pulled By: mrshenli

fbshipit-source-id: 8924c6ca36d68699871d8add3e0aab6542ea269c
2020-09-28 13:22:42 -07:00
Iurii Zdebskyi
722faeb2a4 [RELAND] Added optimizers based on multi tensor apply (#45408)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/45299.  The present PR fixes minor bugs that caused revert.

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45408

Reviewed By: gchanan

Differential Revision: D23956680

Pulled By: izdeby

fbshipit-source-id: c5eab7bf5fce14a287c15cead1cdc26e42cfed94
2020-09-28 13:14:04 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
Nikolay Korovaiko
993628c74a Build shape expressions and remove outputs that are only used by aten::sizes (#45080)
Summary:
Currently, TE materializes all intermediate results even if they are only used for computing their shapes. This diff ports the approach the OF (Old Fuser) took to deal with this issue. Namely, given the structure of a fusion group we infer all the sizes outside a fusion group based on fusion group's inputs.

A simple example would be:

```
        def test_fuse(a, b):
            c = a + b
            d = c + b
            return d
```

Here we don't need to cache `c` as computing a gradient for `b` in `d = c + b` doesn't need it. We do need to compute sizes for all arguments here in case broadcasts happen.

Without this optimization, TE would need to materialize `c` so we can get its size

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %83 : Double(1:1, requires_grad=0, device=cuda:0), %84 : Double(1:1, requires_grad=0, device=cuda:0), %85 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : Tensor, %87 : Tensor = prim::If(%85)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0), %c.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%83, %84)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4, %c.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %94 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %95 : (Tensor, Tensor) = prim::CallFunction(%94, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %96 : Tensor, %97 : Tensor = prim::TupleUnpack(%95)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%96, %97)
[DUMP profiling_graph_executor_impl.cpp:499]   %60 : int[] = aten::size(%87) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %60) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %60) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %67 : int[] = aten::size(%86) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%60, %67) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %67) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%86, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3, %c.3)
```

With this optimization we use `prim::BroadcastSizes` to compute the size of `c`. No need to materialize it.

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %88 : Double(1:1, requires_grad=0, device=cuda:0), %89 : Double(1:1, requires_grad=0, device=cuda:0), %90 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %91 : Tensor = prim::If(%90)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%88, %89)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %97 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %98 : (Tensor) = prim::CallFunction(%97, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %99 : Tensor = prim::TupleUnpack(%98)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%99)
[DUMP profiling_graph_executor_impl.cpp:499]   %85 : int[] = aten::size(%91)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : int[] = prim::BroadcastSizes(%59, %62)
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %86) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %86) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%86, %85) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %85) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%91, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45080

Reviewed By: bertmaher

Differential Revision: D23856410

Pulled By: Krovatkin

fbshipit-source-id: 2956286eb03a4894a5baa151c35e6092466322b1
2020-09-28 10:45:56 -07:00
Rong Rong
48d29c830d [hotfix] disable problematic cuda tests on rocm builds (#45435)
Summary:
Disable the recent 3 cuda tests on amd rocm build/tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435

Reviewed By: malfet

Differential Revision: D23962881

Pulled By: walterddr

fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f
2020-09-28 10:02:12 -07:00
Nikita Vedeneev
e4950a093a Backward support for generalized eigenvalue solver with LOBPCG in forward [only k-rank SYMEIG case] (#43002)
Summary:
As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002

Reviewed By: zou3519

Differential Revision: D23931326

Pulled By: albanD

fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e
2020-09-28 07:22:35 -07:00
Mike Ruberry
6417a70465 Updates linalg warning + docs (#45415)
Summary:
Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415

Reviewed By: ngimel

Differential Revision: D23958252

Pulled By: mruberry

fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac
2020-09-28 05:28:42 -07:00
generatedunixname89002005325676
7818a214c5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23959094

fbshipit-source-id: 6caa046d263114bff38a38d756099aac357e4f04
2020-09-28 05:08:46 -07:00
Negin Raoof
95a97e51b5 [ONNX] Improve scripting inplace indexing ops (#44351)
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.

2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.

Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351

Reviewed By: ezyang

Differential Revision: D23880267

Pulled By: bzinodev

fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
2020-09-28 00:32:36 -07:00
Zino Benaissa
13f76f2be4 Fix preserve submodule attribute in freezing (#45143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143

This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23844969

Pulled By: bzinodev

fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
2020-09-28 00:05:38 -07:00
liqunfu
c3bf402cbb handle onnx nll with default ignore index (#44816)
Summary:
in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value.
therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816

Reviewed By: ezyang

Differential Revision: D23880354

Pulled By: bzinodev

fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b
2020-09-27 23:26:19 -07:00
shubhambhokare1
5b839bca78 [ONNX] Optimize export_onnx api to reduce string and model proto exchange (#44332)
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332

Reviewed By: bwasti, eellison

Differential Revision: D23880129

Pulled By: bzinodev

fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
2020-09-27 16:29:08 -07:00
neginraoof
4005afe94b [ONNX] Update narrow for dynamic inputs (#44039)
Summary:
Update narrow for dynamic inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44039

Reviewed By: mruberry

Differential Revision: D23742215

Pulled By: bzinodev

fbshipit-source-id: 0d58d2fe996f91a124af988a9a21ee433e842d07
2020-09-27 15:52:57 -07:00
Natalia Gimelshein
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
Mike Ruberry
54a253fded Revert D23931987: Added optimizers based on multi tensor apply
Test Plan: revert-hammer

Differential Revision:
D23931987 (2b21e7767e)

Original commit changeset: 582134ef2d40

fbshipit-source-id: ffd500aea55fda34155442fb15e2529cb9c00100
2020-09-26 18:11:54 -07:00
Rohan Varma
23dfca8351 Support record_shapes in RPC profiling (#44419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419

Closes https://github.com/pytorch/pytorch/issues/39969

This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.

This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899

Reviewed By: pritamdamania87

Differential Revision: D23591274

fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
2020-09-26 13:26:44 -07:00
Rohan Varma
19dda7c68a Fallback to CPU when remote end does not have CUDA for profiling (#44967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23790729

fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
2020-09-26 13:12:55 -07:00
Iurii Zdebskyi
2b21e7767e Added optimizers based on multi tensor apply (#45299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45299

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931987

Pulled By: izdeby

fbshipit-source-id: 582134ef2d402909d27d89a45c5b588fb7130ea1
2020-09-26 12:17:43 -07:00
Omkar Salpekar
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
Omkar Salpekar
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
Zafar
d9af3d2fcd [quant] ConvTranspose warnings (#45081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45081

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23822449

Pulled By: z-a-f

fbshipit-source-id: f21a5f3ef4d09f703c96fff0bc413dbadeac8202
2020-09-25 22:30:14 -07:00
Wang Xu
92189b34b7 Add get_all_users_of function to GraphManipulation (#45216)
Summary:
This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216

Reviewed By: ezyang

Differential Revision: D23883572

Pulled By: scottxu0730

fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4
2020-09-25 19:32:49 -07:00
Zafar
958c208666 [quant] conv_transpose graph patterns (#45078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45078

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23821580

Pulled By: z-a-f

fbshipit-source-id: 813a4ef1bbc429720765d61791fe754b6678a334
2020-09-25 18:14:29 -07:00
Wanchao Liang
32c355af5b [dist_optim] introduce distributed functional optimizer (#45221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221

This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935256

Pulled By: wanchaol

fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
2020-09-25 17:13:10 -07:00
Wanchao Liang
08caf15502 [optimizer] refactor Adam to use functional API (#44791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44791

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935257

Pulled By: wanchaol

fbshipit-source-id: 6f6e22a9287f5515d2e4e6abd4dee2fe7e17b945
2020-09-25 17:13:08 -07:00
Wanchao Liang
0444c372e1 [optimizer] introduce optimizer functional API, refactor Adagrad (#44715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715

We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency.

This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935258

Pulled By: wanchaol

fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a
2020-09-25 17:10:26 -07:00
Nikita Shulga
8ab2ad306d Enable torch.cuda.nccl typechecking (#45344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344

Reviewed By: walterddr

Differential Revision: D23935306

Pulled By: malfet

fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69
2020-09-25 17:02:47 -07:00
Shen Li
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
Brian Hirsh
439930c81b adding a beta parameter to the smooth_l1 loss fn (#44433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433

Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time

fixing some type errors, updated fn signature in a few more files

removing my usage of Scalar, making beta a double everywhere instead

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23636720

Pulled By: bdhirsh

fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
2020-09-25 16:36:28 -07:00
Pritam Damania
a2b4177c5b Add barrier() at the end of init_process_group and new_group. (#45181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181

`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112

Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.

Reviewed By: mrshenli

Differential Revision: D23858025

fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
2020-09-25 15:46:59 -07:00
Vasiliy Kuznetsov
eee7dad376 Add torch.do_assert, which is symbolically traceable (#45188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45188

This is a symbolically traceable alternative to Python's `assert`.
It should be useful to allow people who want to use FX to also
be able to assert things.

A bunch of TODO(before) land are inline - would love thoughts
on where is the best place for this code to live, and what this
function should be called (since `assert` is reserved).

Test Plan:
```
python test/test_fx.py TestFX.test_symbolic_trace_assert
```

Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23861567

fbshipit-source-id: d9d6b9556140faccc0290eba1fabea401d7850de
2020-09-25 13:46:28 -07:00
Rohan Varma
7c5436d557 [RPC profiling] Add tests to ensure RPC profiling works on single threaded (#44923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed)
ghstack-source-id: 112868469

Test Plan: CI

Reviewed By: lw

Differential Revision: D23691304

fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203
2020-09-25 13:24:18 -07:00
Rohan Varma
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00
Iurii Zdebskyi
d5748d9a1a Enable binary ops with Scalar Lists with for foreach APIs (#45298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931986

Pulled By: izdeby

fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3
2020-09-25 12:58:34 -07:00
gunandrose4u
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
Nikita Shulga
c8166d4b58 Add torch.cuda.comm to typechecking CI (#45350)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45350

Reviewed By: walterddr

Differential Revision: D23935750

Pulled By: malfet

fbshipit-source-id: 5a7d2d4fbc976699d80bb5caf4727c19fa2c5bc8
2020-09-25 12:13:43 -07:00
Gao, Xiang
dc9e9c118e CUDA BFloat16 neg (#45240)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45240

Reviewed By: mruberry

Differential Revision: D23933392

Pulled By: ngimel

fbshipit-source-id: 2472dc550600ff470a1044ddee39054e22598038
2020-09-25 11:25:49 -07:00
Bram Wasti
e5f6e5af13 Add Deep and wide to test and flatten/tranpose for good measure (#44129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604302

Pulled By: bwasti

fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12
2020-09-25 11:05:41 -07:00
Bram Wasti
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
Nick Gibson
d1d9017a66 [NNC] fix Half conversion of immediates in Cuda backend (#45213)
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213

Reviewed By: ezyang

Differential Revision: D23885287

Pulled By: nickgg

fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
2020-09-25 10:53:36 -07:00
Supriya Rao
a117d968f6 [quant][graph] Remove redundant aten::wait calls in the graph (#45257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45257

Currently we inline fork-wait calls when we insert observers for quantization
In the case where fork and wait are in different subgraphs, inlining the fork-wait calls
only gets rid of the fork. This leaves the aten::wait call in the graph with a torch.Tensor as input,
which is currently not supported.
To avoid this we check to make sure input to all wait calls in the graph is of type Future[tensor]
in the cleanup phase

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_quantize_fork_wait

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D23895412

fbshipit-source-id: 3c58c6be7d7e7904eb6684085832ac21f827a399
2020-09-25 09:52:52 -07:00
Shinichiro Hamaji
8b00c4c794 [ONNX] Correct a minor typo in warning (#45187)
Summary:
The warning for batch_norm was mentioning dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45187

Reviewed By: glaringlee

Differential Revision: D23873215

Pulled By: ezyang

fbshipit-source-id: 1dcc82ad16522215f49b4cd0fc0e357b2094e4f2
2020-09-25 09:26:51 -07:00
Sebastian Messmer
78fcde9c50 Trace scattered tensor options arguments (#44071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071

Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.

This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793

Test Plan:
waitforsandcastle

vs master: https://www.internalfb.com/intern/fblearner/details/216129483/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/

Reviewed By: ezyang

Differential Revision: D23486638

fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
2020-09-25 09:04:06 -07:00
Sebastian Messmer
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
Brian Hirsh
2739a7c599 Byte-for-byte compatibility fixes in codegen (#44879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44879

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23825163

Pulled By: bdhirsh

fbshipit-source-id: 4d8028274f82c401b393c4fe1b9e32de3f4909c6
2020-09-25 08:06:50 -07:00
kshitij12345
00e704e757 [fix] torch.repeat : dim-0 backward (#45212)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45212

Reviewed By: mrshenli

Differential Revision: D23905545

Pulled By: albanD

fbshipit-source-id: c5bf9cf481c8cf3ccc1fdbfb364006b29f67dc9f
2020-09-25 07:53:00 -07:00
Alex Suhan
76ee58e2ec [TensorExpr] Move inner loops vectorization logic to its own method (#45287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45287

Test Plan: CI, build

Reviewed By: gmagogsfm

Differential Revision: D23913432

Pulled By: asuhan

fbshipit-source-id: 3bf8fe09753f349e3c857863a43d2b1fca5101c1
2020-09-25 02:29:36 -07:00
Xiong Wei
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
jjsjann123
99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
Vasiliy Kuznetsov
bdf329ef8a SyncBN: preserve qconfig if it exists (#45317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45317

Eager mode quantization depends on the presence of the `config`
model attribute.  Currently converting a model to use `SyncBatchNorm`
removes the qconfig - fixing this.  This is important if a BN is not
fused to anything during quantization convert.

Test Plan:
```
python test/test_quantization.py TestDistributed.test_syncbn_preserves_qconfig
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23922072

fbshipit-source-id: cc1bc25c8e5243abb924c6889f78cf65a81be158
2020-09-24 22:52:07 -07:00
Mike Ruberry
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
Jerry Zhang
bc3151dee0 [quant] Remove unused qconfig argument in qat linear module (#45307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45307

fixes: https://github.com/pytorch/pytorch/issues/35634

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23917339

fbshipit-source-id: 65f8844b98198bbf93547b3d71408c2a54605218
2020-09-24 22:15:16 -07:00
gunandrose4u
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
Yanli Zhao
c6500bcf14 [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing  to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 112845787

Test Plan:
1. When grad_is_view=false:
a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli
b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588
https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli
d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli
e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli

2. When grad_is_view=true:
a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details
b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli
d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli
e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli

Reviewed By: mrshenli

Differential Revision: D23588186

fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
2020-09-24 20:54:51 -07:00
Xiao Wang
7e5492e1be [minor] Fix undefined variable (#45246)
Summary:
The commit 2a37f3fd2f https://github.com/pytorch/pytorch/pull/45130 deleted the python variable `capability` which is used in later lines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45246

Reviewed By: walterddr

Differential Revision: D23923916

Pulled By: malfet

fbshipit-source-id: c5d7fef9e4a87ccc621191200e5965710e9d6aaa
2020-09-24 20:17:13 -07:00
Linbin Yu
0f2c648c97 log metadata when model loading failed (#44430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44430

log metadata even when model loading is failed

Test Plan: {F331550976}

Reviewed By: husthyc

Differential Revision: D23577711

fbshipit-source-id: 0504e75625f377269f1e5df0f1ebe34b8e564c4b
2020-09-24 20:09:22 -07:00
Himangshu
92ebb04f92 added check for NumberType (#44375)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44375

Reviewed By: mrshenli

Differential Revision: D23906728

Pulled By: eellison

fbshipit-source-id: 3b534e5dd3af1f5e43a7314953e64117cbe8ffe4
2020-09-24 16:26:59 -07:00
Rohan Varma
bee1d448e7 Fix test_rpc_profiling_remote_record_function (#45162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45162

This test was flaky because it was not able to validate that the
overall record_function's CPU times are greater than the sum of its children.
It turns out that this is a general bug in the profiler that can be reproduced
without RPC, see https://github.com/pytorch/pytorch/issues/45160. Hence,
removing this from the test and replacing it by just validating the expected
children.

Ran the test 1000 times and they all passed.
ghstack-source-id: 112632327

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23851854

fbshipit-source-id: 5d9023acd17800a6668ba4849659d8cc902b8d6c
2020-09-24 15:57:32 -07:00
Elias Ellison
5dd288eb06 [JIT] Regularize tensorexpr fuser strategy with other fusers (#44972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44972

Previously, our fusion strategy would be:
- start at the end of the block, find a fusable node
- iteratively try to merge inputs into the fusion group, sorted topologically

This strategy works pretty well, but has the possibility of missing fusion groups. See my attached test case for an example where we wouldn't find all possible fusion groups. bertmaher found an example of a missed fusion groups in one of our rnn examples (jit_premul) that caused a regression from the legacy fuser.

Here, I'm updating our fusion strategy to be the same as our other fusion passes - create_autodiff_subgraphs, and graph_fuser.cpp.

The basic strategy is:
- iterate until you find a fusible node
- try to merge the nodes inputs, whenever a succesful merge occurs restart at the beginning of the nodes inputs
- after you've exhausted a node, continue searching the block for fusion opportunities from the node
- continue doing this on the block until we go through an iteration without an succesful merges

Since we create the fusion groups once, and only re-specialize within the fusion groups, we should be running this very infrequently (only re-triggers when we fail undefinedness specializations). Also bc it's the same algorithm as the existing fuser it is unlikely to cause a regression.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23821581

Pulled By: eellison

fbshipit-source-id: e513d1ef719120dadb0bfafc7a14f4254cd806ee
2020-09-24 15:34:21 -07:00
Elias Ellison
0137e3641d Refactor subgraph merging (#44238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44238

Refactor create_autodiff_subgraphs to use the same updating of output aliasing properties logic as tensorexpr fuser, and factor that out to a common function in subgraph utils.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23871565

Pulled By: eellison

fbshipit-source-id: 72df253b16baf8e4aabf3d68b103b29e6a54d44c
2020-09-24 15:29:34 -07:00
Mikhail Zolotukhin
71e6ce6616 [JIT] Specialize AutogradZero: merge AutogradAnyNonZero and Not(AutogradAnyNonZero) checks into one. (#44987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44987

This PR introduces new `prim::AutogradAllZero` and
`prim::AutogradAllNonZero` ops that are used for a batch check for
multiple tensors. The specialize-autogradzero pass now generates one
check for all expected-to-be-undefined tensors, one check for all
expected-to-be-defined tensors, and a bunch of checks for size
parameters passed to `grad_sum_to_size` (this probably could be cleaned
up somehow as well in future).

An example of what we generated before this change:
```
%1626 : bool = prim::AutogradAnyNonZero(%0)
%1627 : bool = prim::AutogradAnyNonZero(%2)
%1628 : bool = aten::__not__(%1627)
%1629 : bool = prim::AutogradAnyNonZero(%3)
%1630 : bool = aten::__not__(%1629)
%1631 : bool = prim::AutogradAnyNonZero(%4)
%1632 : bool = aten::__not__(%1631)
%1633 : bool = prim::AutogradAnyNonZero(%5)
%1634 : bool = aten::__not__(%1633)
%1635 : bool = prim::AutogradAnyNonZero(%6)
%1636 : bool = aten::__not__(%1635)
%1637 : bool = prim::AutogradAnyNonZero(%7)
%1638 : bool = aten::__not__(%1637)
%1639 : bool = prim::AutogradAnyNonZero(%8)
%1640 : bool = aten::__not__(%1639)
%1641 : bool = prim::AutogradAnyNonZero(%9)
%1642 : bool = aten::__not__(%1641)
%1643 : bool = prim::AutogradAnyNonZero(%10)
%1644 : bool = aten::__not__(%1643)
%1645 : bool = prim::AutogradAnyNonZero(%11)
%1646 : bool = aten::__not__(%1645)
%1647 : bool = prim::AutogradAnyNonZero(%12)
%1648 : bool = aten::__not__(%1647)
%1649 : bool = prim::AutogradAnyNonZero(%13)
%1650 : bool = aten::__not__(%1649)
%1651 : bool = prim::AutogradAnyNonZero(%14)
%1652 : bool = aten::__not__(%1651)
%1653 : bool = prim::AutogradAnyNonZero(%15)
%1654 : bool = aten::__not__(%1653)
%1655 : bool = prim::AutogradAnyNonZero(%16)
%1656 : bool = aten::__not__(%1655)
%1657 : bool = prim::AutogradAnyNonZero(%17)
%1658 : bool = prim::AutogradAnyNonZero(%18)
%1659 : bool = prim::AutogradAnyNonZero(%19)
%1660 : bool = prim::AutogradAnyNonZero(%20)
%1661 : bool = aten::__is__(%self_size.16, %1625)
%1662 : bool = aten::__is__(%other_size.16, %1625)
%1663 : bool = aten::__is__(%self_size.14, %1625)
%1664 : bool = aten::__is__(%self_size.12, %1625)
%1665 : bool = prim::AutogradAnyNonZero(%ingate.7)
%1666 : bool = prim::AutogradAnyNonZero(%forgetgate.7)
%1667 : bool = prim::AutogradAnyNonZero(%cellgate.7)
%1668 : bool = prim::AutogradAnyNonZero(%30)
%1669 : bool = prim::AutogradAnyNonZero(%31)
%1670 : bool = aten::__is__(%self_size.10, %1625)
%1671 : bool = aten::__is__(%other_size.10, %1625)
%1672 : bool = prim::AutogradAnyNonZero(%34)
%1673 : bool = prim::AutogradAnyNonZero(%35)
%1674 : bool = aten::__is__(%self_size.8, %1625)
%1675 : bool = aten::__is__(%other_size.8, %1625)
%1676 : bool = aten::__is__(%self_size.6, %1625)
%1677 : bool = aten::__is__(%other_size.6, %1625)
%1678 : bool = prim::AutogradAnyNonZero(%outgate.7)
%1679 : bool = prim::AutogradAnyNonZero(%41)
%1680 : bool = prim::AutogradAnyNonZero(%42)
%1681 : bool = prim::AutogradAnyNonZero(%43)
%1682 : bool = aten::__is__(%self_size.4, %1625)
%1683 : bool = aten::__is__(%other_size.4, %1625)
%1684 : bool[] = prim::ListConstruct(%1626, %1628, %1630, %1632, %1634, %1636, %1638, %1640, %1642, %1644, %1646, %1648, %1650, %1652, %1654, %1656, %1657, %1658, %1659, %1660, %1661, %1662, %1663, %1664, %1665, %1666, %1667, %1668, %1669, %1670, %1671, %1672, %1673, %1674, %1675, %1676, %1677, %1678, %1679, %1680, %1681, %1682, %1683)
%1685 : bool = aten::all(%1684)
```

Same example after this change:
```
%1625 : None = prim::Constant()
%1626 : bool = aten::__is__(%self_size.16, %1625)
%1627 : bool = aten::__is__(%other_size.16, %1625)
%1628 : bool = aten::__is__(%self_size.14, %1625)
%1629 : bool = aten::__is__(%self_size.12, %1625)
%1630 : bool = aten::__is__(%self_size.10, %1625)
%1631 : bool = aten::__is__(%other_size.10, %1625)
%1632 : bool = aten::__is__(%self_size.8, %1625)
%1633 : bool = aten::__is__(%other_size.8, %1625)
%1634 : bool = aten::__is__(%self_size.6, %1625)
%1635 : bool = aten::__is__(%other_size.6, %1625)
%1636 : bool = aten::__is__(%self_size.4, %1625)
%1637 : bool = aten::__is__(%other_size.4, %1625)
%1638 : bool = prim::AutogradAllNonZero(%0, %17, %18, %19, %20, %ingate.7, %forgetgate.7, %cellgate.7, %30, %31, %34, %35, %outgate.7, %41, %42, %43)
%1639 : bool = prim::AutogradAllZero(%2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16)
%1640 : bool[] = prim::ListConstruct(%1626, %1627, %1628, %1629, %1630, %1631, %1632, %1633, %1634, %1635, %1636, %1637, %1638, %1639)
%1641 : bool = aten::all(%1640)
```

My performance measurements showed some changes, but I don't really
trust them and think that they are probably just a noise. Below are
tables with min-aggregation over 10 runs:

FastRNN models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| lstm[aten]:bwd                                   |     30.059927 |       29.834089 |      -0.8% |
| lstm[aten]:fwd                                   |     25.673708 |       25.700039 |       0.1% |
| lstm[cudnn]:bwd                                  |     17.866232 |       17.893120 |       0.2% |
| lstm[cudnn]:fwd                                  |     11.418444 |       11.408514 |      -0.1% |
| lstm[jit]:bwd                                    |     27.127205 |       27.141029 |       0.1% |
| lstm[jit]:fwd                                    |     17.018047 |       16.975451 |      -0.3% |
| lstm[jit_multilayer]:bwd                         |     27.502396 |       27.365149 |      -0.5% |
| lstm[jit_multilayer]:fwd                         |     16.918591 |       16.917767 |      -0.0% |
| lstm[jit_premul]:bwd                             |     22.281199 |       22.215082 |      -0.3% |
| lstm[jit_premul]:fwd                             |     14.848708 |       14.896231 |       0.3% |
| lstm[jit_premul_bias]:bwd                        |     20.761206 |       21.170969 |       2.0% |
| lstm[jit_premul_bias]:fwd                        |     15.013515 |       15.037978 |       0.2% |
| lstm[jit_simple]:bwd                             |     26.715771 |       26.697786 |      -0.1% |
| lstm[jit_simple]:fwd                             |     16.675898 |       16.545893 |      -0.8% |
| lstm[py]:bwd                                     |     56.327065 |       54.731030 |      -2.8% |
| lstm[py]:fwd                                     |     39.876324 |       39.230572 |      -1.6% |

Torch Hub models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[BERT_pytorch-cuda-jit]                 |      0.111706 |        0.106604 |      -4.6% |
| test_eval[LearningToPaint-cuda-jit]              |      0.002841 |        0.002801 |      -1.4% |
| test_eval[Super_SloMo-cuda-jit]                  |      0.384869 |        0.384737 |      -0.0% |
| test_eval[attension_is_all_you_nee...-cuda-jit]  |      0.123857 |        0.123923 |       0.1% |
| test_eval[demucs-cuda-jit]                       |      0.077270 |        0.076878 |      -0.5% |
| test_eval[fastNLP-cuda-jit]                      |      0.000255 |        0.000249 |      -2.3% |
| test_eval[moco-cuda-jit]                         |      0.426472 |        0.427380 |       0.2% |
| test_eval[pytorch_CycleGAN_and_pix...-cuda-jit]  |      0.026483 |        0.026423 |      -0.2% |
| test_eval[pytorch_mobilenet_v3-cuda-jit]         |      0.036202 |        0.035853 |      -1.0% |
| test_eval[pytorch_struct-cuda-jit]               |      0.001439 |        0.001495 |       3.9% |
| test_train[BERT_pytorch-cuda-jit]                |      0.247236 |        0.247188 |      -0.0% |
| test_train[Background_Matting-cuda-jit]          |      3.536659 |        3.581864 |       1.3% |
| test_train[LearningToPaint-cuda-jit]             |      0.015341 |        0.015331 |      -0.1% |
| test_train[Super_SloMo-cuda-jit]                 |      1.018626 |        1.019098 |       0.0% |
| test_train[attension_is_all_you_nee...-cuda-jit] |      0.446314 |        0.444893 |      -0.3% |
| test_train[demucs-cuda-jit]                      |      0.169647 |        0.169846 |       0.1% |
| test_train[fastNLP-cuda-jit]                     |      0.001990 |        0.001978 |      -0.6% |
| test_train[moco-cuda-jit]                        |      0.855323 |        0.856974 |       0.2% |
| test_train[pytorch_mobilenet_v3-cuda-jit]        |      0.497723 |        0.485416 |      -2.5% |
| test_train[pytorch_struct-cuda-jit]              |      0.309692 |        0.308792 |      -0.3% |

Differential Revision: D23794659

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 859b68868ef839c5c6cbc7021879ee22d3144ea8
2020-09-24 14:31:49 -07:00
Yi Wang
022ba5a78b Make ddp_comm_hook_wrapper a private method. (#44643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643

This method is not used anywhere else.

Also formatted the file.

Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks

Reviewed By: pritamdamania87

Differential Revision: D23675945

fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461
2020-09-24 13:29:48 -07:00
Xinyu Li
26001a2334 Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList
Test Plan: revert-hammer

Differential Revision:
D23753711 (71d1b5b0e2)

Original commit changeset: bf3e8c54bc07

fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c
2020-09-24 11:55:49 -07:00
Gao, Xiang
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Rohan Varma
e57a08119b Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238

Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23719270

fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
2020-09-24 09:50:44 -07:00
Raziel Alvarez Guevara
2b38c09f69 Moves prim ops from C10 back to JIT (#45144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144

Moves prim ops from C10 back to JIT.

These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781

Test Plan:
buck test //caffe2/test/cpp/jit:jit

https://pxl.cl/1l22N

buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test

https://pxl.cl/1lBxD

Reviewed By: iseeyuan

Differential Revision: D23697598

fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
2020-09-24 09:44:20 -07:00
Taylor Robie
8507ea22b2 replace timer test with a mocked variant (#45173)
Summary:
I noticed that the recently introduced adaptive_autorange tests occasionally timeout CI, and I've been meaning to improve the Timer tests for a while. This PR allows unit tests to swap the measurement portion of `Timer` with a deterministic mock so we can thoroughly test behavior without having to worry about flaky CI measurements. It also means that the tests can be much more detailed and still finish very quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45173

Test Plan: You're lookin' at it.

Reviewed By: ezyang

Differential Revision: D23873548

Pulled By: robieta

fbshipit-source-id: 26113e5cea0cbf46909b9bf5e90c878c29e87e88
2020-09-24 09:42:37 -07:00