Commit Graph

6361 Commits

Author SHA1 Message Date
Yanan Cao
64681d6bec Add all remaining method declarations from torch.distributed Python API to C++ (#45768)
Summary:
Also ran formatter on previous sections

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45768

Reviewed By: wanchaol

Differential Revision: D24129467

Pulled By: gmagogsfm

fbshipit-source-id: aa8a5c45c3609d5b96e5f585b699d9e3e71394c8
2020-10-06 12:36:36 -07:00
Nikita Shulga
930bddd403 Cleanup nccl.cpp (#45899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45899

Use function polymorphism to avoid repeated casts
I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument
Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24138012

Pulled By: malfet

fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1
2020-10-06 11:26:14 -07:00
Peter Bell
d44eaf63d1 torch.fft helper functions (#44877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44877

Part of gh-42175. This implements the `torch.fft` helper functions: `fftfreq`, `rfftfreq`, `fftshift` and `ifftshift`.

* #43009 Cleanup tracer handling of optional arguments

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24043473

Pulled By: mruberry

fbshipit-source-id: 35de7b70b27658a426773f62d23722045ea53268
2020-10-05 22:04:52 -07:00
Pritam Damania
bf85642c4c Remove lock from GraphTask::set_exception_without_signal. (#45867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867

In most cases the lock ordering was hold a lock in local autograd and
then hold a lock in DistAutogradContext.

In case of `set_exception_without_signal` the lock order was in reverse and as
a result we saw potential deadlock issues in our TSAN tests. To fix this, I
removed the lock and instead just used std::atomic exchange.

In addition to this, I fixed TestE2E to ensure that we use the appropriate
timeout.

TestE2EProcessGroup was flaky for these two reasons and now is fixed.
ghstack-source-id: 113592709

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D24120962

fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66
2020-10-05 20:02:29 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
Nikita Shulga
1558a3657b Add LazyNVRTC (#45674)
Summary:
Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables:
On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template:
```
  auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME)));
  if (!fn)
    throw std::runtime_error("Can't get" ## NAME);
  lazyNVRTC.NAME = fn;
  return fn(...)
```
Fixes https://github.com/pytorch/pytorch/issues/31985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674

Reviewed By: ezyang

Differential Revision: D24073946

Pulled By: malfet

fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874
2020-10-05 16:27:40 -07:00
Ansley Ussery
f18cc9c57d Change type inferred from empty annotation (#45360)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45360

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078645

Pulled By: ansley

fbshipit-source-id: 5d37d07df75bd7a2111d44638befe53c1021ee82
2020-10-05 15:16:56 -07:00
Hao Lu
8a6b919163 [StaticRuntime] Fix broken tests (#45813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45813

Fix tests broken by D23996656 (2b48dd168d).

Test Plan:
```
buck test mode/opt //pytorch/tensorboardX:test_pytorchtb -- 'test_pytorch_graph \(pytorch\.tensorboardX\.tests\.test_pytorch_graph\.PytorchGraphTest\)'
buck test mode/opt //pytext/tests:
buck test mode/dev-nosan //mobile-vision/projects/detectron2go/tests:test_caffe2_compatibles
```

Reviewed By: yinghai

Differential Revision: D24100807

fbshipit-source-id: e2f92aadca4161f5cf9f552e922fb4d6500af3a4
2020-10-03 16:54:22 -07:00
Nikita Shulga
24fa2daea6 Revert D24100389: Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb
Test Plan: revert-hammer

Differential Revision:
D24100389

Original commit changeset: b32c5163e4fb

fbshipit-source-id: 9ce7bfbcf411c0584e5d535ee107fb5a135ee6e6
2020-10-03 15:33:42 -07:00
Nikita Shulga
ff568a0e6b Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb
Test Plan: revert-hammer

Differential Revision:
D24072697 (e3d2defdc8)

Original commit changeset: 7f56b9f3cbe5

fbshipit-source-id: b32c5163e4fb6df99447f95fdb82674e5ae62f22
2020-10-03 12:27:26 -07:00
Hao Lu
2b48dd168d [StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640

Reviewed By: dzhulgakov

Differential Revision: D23996656

fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5
2020-10-02 23:03:05 -07:00
Edward Yang
546aab66c1 Revert D24027761: Update backward definition for more operators and reenable tests in test_ops.py
Test Plan: revert-hammer

Differential Revision:
D24027761 (7d809f5d8e)

Original commit changeset: c1f707c2a039

fbshipit-source-id: 30750d2f08886036fb8b2cd0ae51c7732d3b7b19
2020-10-02 18:52:57 -07:00
Shen Li
8cb7280242 Revert "Remove device maps from TensorPipe for v1.7 release (#45353)" (#45762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762

This reverts commit 5211fb97ac.

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D24088231

Pulled By: mrshenli

fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4
2020-10-02 15:14:05 -07:00
Yanan Cao
d150d3e276 Make sure each warnings.warn only executes once inside TorchScript. (#45382)
Summary:
* Add a pass at end of runCleanupPasses to annotate `aten::warn` so that each has its unique id
* Enhanced interpreter so that it tracks which `aten::warn` has been executed before and skip them
* Improved insertInstruction so that it correctly checks for overflow

Fixes https://github.com/pytorch/pytorch/issues/45108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45382

Reviewed By: mrshenli

Differential Revision: D24060677

Pulled By: gmagogsfm

fbshipit-source-id: 9221bc55b9ce36b374bdf614da3fe47496b481c1
2020-10-02 14:55:10 -07:00
anjali411
7d809f5d8e Update backward definition for more operators and reenable tests in test_ops.py (#44444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44444

This PR:
1. Fixes https://github.com/pytorch/pytorch/issues/41510. Updates backward formula for the following functions: `asin`, `acos`, `asinh`, `acosh`, `atan`, `atanh`, `div`, `log`, `log10`, `log2`, `log1p`, `pow`, `reciprocal`, `angle`.
2. Re-enables the tests in `test_ops.py`.
3. Adds dispatch for complex dtypes for `tanh_backward`.
4. Re-enables commented tests in `common_methods_invocation.py`.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24027761

Pulled By: anjali411

fbshipit-source-id: c1f707c2a039149a6e04bbde53ee120d9119d99a
2020-10-02 13:37:10 -07:00
Bert Maher
e3d2defdc8 [te] Get llvm codegen to compile with llvm9 and llvm-fb (#45726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45726

FB has an old internal platform that uses some random llvm version
that looks sort of like llvm 7.  I've guarded that with the appropriate
LLVM_VERSION_PATCH.

I've also swapped out some of our uses of ThreadSafeModule/ThreadSafeContext
for the variants without ThreadSafe in the name.  As far as I can tell we
weren't using the bundled locks anyways, but I'm like 85% sure this is OK since
we compile under the Torch JIT lock anyways.

Test Plan: unit tests

Reviewed By: ZolotukhinM, asuhan

Differential Revision: D24072697

fbshipit-source-id: 7f56b9f3cbe5e6d54416acdf73876338df69ddb2
2020-10-02 13:33:13 -07:00
Omkar Salpekar
3799ba83e5 [Docs] Adding Store API Docs (#45543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543

This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195

Test Plan: Will verify screenshots by building the docs.

Reviewed By: pritamdamania87

Differential Revision: D24005598

fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
2020-10-02 11:16:56 -07:00
Eli Uriegas
a052597e6c Bump nightlies to 1.8.0 (#45696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45696

Similar to https://github.com/pytorch/pytorch/pull/40519

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D24064381

Pulled By: seemethere

fbshipit-source-id: 1484b9c4fc5fa8cfa7be591a0a5d4b6e05968589
2020-10-02 11:10:34 -07:00
Pritam Damania
6e43f0db8b Use correct signatures for METH_NOARGS. (#45528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45528

As described in https://github.com/pytorch/pytorch/issues/45419,
resolving a bunch of cpython signature issues.

#Closes: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 113385726

Test Plan: sentinel

Reviewed By: albanD

Differential Revision: D24000626

fbshipit-source-id: d334596f1f0256063691aa044c8fb2face260817
2020-10-02 10:43:58 -07:00
Andrew Millspaugh
cdf93b03de Add string versions of argument funcs in jit Node (#45464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45464

Usage of Symbols to find arguments requires one to generate a nonsense symbol for inputs which don't already have one. The intention of symbols appears to be something of an internalized string, but the namespace component doesn't apply to an argument. In order to access the arguments by name without adding new symbols, versions of those functions with std::string input was added. These can be proved valid based on the existing codepath. Additionally, a hasNamedInput convenience function was added to remove the necessity of a try/catch block in user code.

The primary motivation is to be able to easily handle the variable number of arguments in glow, so that the arange op may be implemented.

Reviewed By: eellison

Differential Revision: D23972315

fbshipit-source-id: 3e0b41910cf07e916186f1506281fb221725a91b
2020-10-02 10:26:29 -07:00
Supriya Rao
04526a49d3 [quant] creating quint4x2 dtype for quantized tensors (#44678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678

This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2
The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it.

This change uses most of the existing scaffolding for qtensor storage. We allocate storage
based on the dtype before creating a new qtensor.

It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info
while quantizing and packing the qtensor (when we add 2-bit qtensor)

Kernels that use this dtype should be aware of the packing format.

Test Plan:
Locally tested
```
x = torch.ones((100, 100), dtype=torch.float)
qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8)
qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2)

torch.save(x, "temp.p")
print('Size float (B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx_8bit, "temp.p")
print('Size quantized 8bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx, "temp.p")
print('Size quantized 4bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
```

Size float (B): 40760
Size quantized 8bit(B): 10808
Size quantized 4bit(B): 5816

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23993134

fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d
2020-10-01 23:53:34 -07:00
Nikolay Korovaiko
a0d08b2199 Set the default bailout depth to 20 (#45710)
Summary:
This modifies the default bailout depth to 20 which gives us a reasonable performance in benchmarks we considered (fastrnns, maskrcnn, hub/benchmark, etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45710

Reviewed By: robieta

Differential Revision: D24071861

Pulled By: Krovatkin

fbshipit-source-id: 472aacc136f37297b21f577750c1d60683a6c81e
2020-10-01 23:37:41 -07:00
Abaho Katabarwa
de3a48013a Use CAFFE2_USE_MSVC_STATIC_RUNTIME to determine when to avoid waiting for global destructors on Windows (#43532)
Summary:
We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](54c05fa34e/torch/csrc/autograd/engine.cpp (L228)) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day.

This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained.

I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds.

Fixes https://github.com/pytorch/pytorch/issues/44470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532

Reviewed By: mrshenli

Differential Revision: D24053767

Pulled By: albanD

fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8
2020-10-01 16:41:14 -07:00
generatedunixname89002005325676
84cf3372d1 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D24044108

fbshipit-source-id: 6dfe2f1201304fa58e42472e3f53c72cbb63d7d2
2020-10-01 05:29:03 -07:00
Xingying Cheng
4339f5c076 [PyTorch][QPL] Add instance_key into MOBILE_MODULE_LOAD_STATS logging. (#45518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45518

Similar to previous diff,  Add instance_key into MOBILE_MODULE_LOAD_STATS logging.
ghstack-source-id: 113149713

Test Plan:
```
09-29 11:50:23.345  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterLoadModel instance_key = 2015064908
09-29 11:50:23.409  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_name = bi_pytext_v10
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_type = FBNet
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"]
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitLoadModel instance_key = 2015064908
```

Reviewed By: iseeyuan

Differential Revision: D23996150

fbshipit-source-id: 7bf76af3b7e6b346afd20ab341204743c81cfe83
2020-09-30 23:31:35 -07:00
BowenBao
3da4cea658 [ONNX] Add dim_param support in export with onnx shape inference (#44920)
Summary:
* Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic.
* Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12.
* Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy.
* This PR depends on existing PR such as 44332.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44920

Reviewed By: eellison

Differential Revision: D23958398

Pulled By: bzinodev

fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d
2020-09-30 21:56:24 -07:00
Xingying Cheng
3f440d74fc [PyTorch][QPL] Add instance_key into MOBILE_MODULE_STATS logging. (#45517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45517

Add unique instance_key instead of the default one into MOBILE_MODULE_STATS logging to avoid multiple events overlaps.
ghstack-source-id: 113149453

Test Plan:
Make sure that each event's start, annotate and end are having the same instancekey:
```
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, method_name = forward
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_name = bi_pytext_v10
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_type = FBNet
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"]
09-28 23:46:03.181 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1123198800
09-28 23:46:04.183 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, method_name = forward
09-28 23:46:04.184 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, model_name = __torch__.Model
09-28 23:46:04.205 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1521608147
```

Reviewed By: iseeyuan

Differential Revision: D23985178

fbshipit-source-id: bcd5db8dc680e3cf8d12edf865377e80693cc23b
2020-09-30 20:13:33 -07:00
Jerry Zhang
9d5607fcd9 [quant] Use PlaceholderObserver as default dynamic quant observer (#45343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45343

Current default dynamic quant observer is not correct since we don't accumulate
min/max and we don't need to calculate qparams.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23933995

fbshipit-source-id: 3ff497c9f5f74c687e8e343ab9948d05ccbba09b
2020-09-30 19:01:18 -07:00
Taylor Robie
2b13d9413e Re-land: Add callgrind collection to Timer #44717 (#45586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586

Test Plan: The unit test has been softened to be less platform sensitive.

Reviewed By: mruberry

Differential Revision: D24025415

Pulled By: robieta

fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
2020-09-30 17:43:06 -07:00
Yanan Cao
3a2d45304d [Experimental][Partial] New implementation for torch.distributed APIs in C++ (#45547)
Summary:
This is an attempt at refactoring `torch.distributed` implementation. Goal is to push Python layer's global states (like _default_pg) to C++ layer such that `torch.distributed` becomes more TorchScript friendly.

This PR adds the skeleton of C++ implementation, at the moment it is not included in any build (and won't be until method implementations are filled in). If you see any test failures related, feel free to revert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45547

Reviewed By: izdeby

Differential Revision: D24024213

Pulled By: gmagogsfm

fbshipit-source-id: 2762767f63ebef43bf58e17f9447d53cf119f05f
2020-09-30 17:35:51 -07:00
Hector Yuen
f2c2b75e80 flush the buffer when printing the IR (#45585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45585

I discovered this bug when I was trying to print the graph to a file. Turns out I had to close the file, but flushing should be a good safeguard in case other users forget.

Test Plan:
Tested with and without flushing.
with P144064292
without P144064767

Reviewed By: mortzur

Differential Revision: D24023819

fbshipit-source-id: 39574b3615feb28e5b5939664c04ddfb1257706a
2020-09-30 16:55:27 -07:00
Zino Benaissa
4be42034b6 Clear shape information before finalizing graph-mode quantization (#45282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45282

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23909601

Pulled By: bzinodev

fbshipit-source-id: 3062cda46b15a79094a360216c35906afab7c723
2020-09-30 16:13:55 -07:00
Negin Raoof
6b42ca2d69 [ONNX] Update embedding_bag export (#44693)
Summary:
Export of embedding bag with dynamic list of offsets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44693

Reviewed By: malfet

Differential Revision: D23831980

Pulled By: bzinodev

fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a
2020-09-30 13:36:40 -07:00
Xinyu Li
c9bb990707 [c++] Distance-agnostic triplet margin loss (#45377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45377

This PR adds a C++ implementation of the TripletMarginWithDistanceLoss, for which the Python implementation was introduced in PR #43680.  It's based on PR #44072, but I'm resubmitting this to unlink it from Phabricator.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24003973

fbshipit-source-id: 2d9ada7260a6f27425ff2fdbbf623dad0fb79405
2020-09-30 12:37:35 -07:00
Rohan Varma
181afd5220 Add an option to DDP to take a list of parameters to ignore upfront. (#44826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826

As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.

To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109

Test Plan: Added unittest

Reviewed By: xw285cornell, mrshenli

Differential Revision: D23740639

fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
2020-09-30 11:52:50 -07:00
Mike Ruberry
51d0ae9207 Revert D24010742: [pytorch][PR] Add callgrind collection to Timer
Test Plan: revert-hammer

Differential Revision:
D24010742 (9b27e0926b)

Original commit changeset: df6bc765f8ef

fbshipit-source-id: 4c1edd57ea932896f7052716427059c924222501
2020-09-30 10:15:46 -07:00
anjali411
415ed434aa Add whitelist for complex backward (#45461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45461

This PR disables autograd for all C -> C, R -> C functions which are not included in the whitelist `GRADIENT_IMPLEMENTED_FOR_COMPLEX`. In practice, there will be a RuntimeError during forward computation when the outputs are differentiable:
```
>>> x=torch.randn(4, 4, requires_grad=True, dtype=torch.cdouble)
>>> x.pow(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: pow does not support automatic differentiation for outputs with complex dtype.
```

The implicit assumption here is that all the C -> R functions have correct backward definitions. So before merging this PR, the following functions must be tested and verified to have correct backward definitions:
`torch.abs` (updated in #39955 ), `torch.angle`, `torch.norm`, `torch.irfft`, `torch.istft`.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23998156

Pulled By: anjali411

fbshipit-source-id: 370eb07fe56ac84dd8e2233ef7bf3a3eb8aeb179
2020-09-30 08:45:55 -07:00
VinodSKumar
e02868e12d Unify Transformer coder Constructors (#45515)
Summary:
Fixes #{[45502](https://github.com/pytorch/pytorch/issues/45502)}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45515

Reviewed By: zhangguanheng66, ZolotukhinM

Differential Revision: D23994644

Pulled By: glaringlee

fbshipit-source-id: b8728e8dfd8857e27246ebb11b17c2d1b48796ca
2020-09-30 07:05:41 -07:00
Nikolay Korovaiko
7566823779 Enable PE + TE (#45546)
Summary:
This PR enables PE + TE for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45546

Reviewed By: ZolotukhinM

Differential Revision: D24006940

Pulled By: Krovatkin

fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115
2020-09-30 06:49:59 -07:00
Taylor Robie
9b27e0926b Add callgrind collection to Timer (#44717)
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:

A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.

Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
    "x.backward()",
    setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()

for c, fn in counts[:20]:
    print(f"{c:>12}  {fn}")
```

```
      812800  ???:_dl_update_slotinfo
      355600  ???:update_get_addr
      308300  work/Python/ceval.c:_PyEval_EvalFrameDefault'2
      304800  ???:__tls_get_addr
      196059  ???:_int_free
      152400  ???:__tls_get_addr_slow
      138400  build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
      126526  work/Objects/dictobject.c:_PyDict_LoadGlobal
      114268  ???:malloc
      101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
       85900  work/Python/ceval.c:_PyEval_EvalFrameDefault
       79946  work/Objects/typeobject.c:_PyType_Lookup
       72000  build/../c10/core/Device.h:c10::Device::validate()
       70000  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
       66400  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
       63000  ???:pthread_mutex_lock
       61200  work/Objects/dictobject.c:PyDict_GetItem
       59800  ???:free
       58400  work/Objects/tupleobject.c:tupledealloc
       56707  work/Objects/dictobject.c:lookdict_unicode_nodummy
```

Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions:  {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
    _ = count_dict.setdefault(fn, 0)
    count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
    print(f"{c:>8}  {fn}")
```

```
Head instructions: 7609547
1.6 instructions:  6059648
  169600  ???:_dl_update_slotinfo
  101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
   74200  ???:update_get_addr
   63600  ???:__tls_get_addr
   46800  work/Python/ceval.c:_PyEval_EvalFrameDefault
   33512  work/Objects/dictobject.c:_PyDict_LoadGlobal
   31800  ???:__tls_get_addr_slow
   31700  build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
   28300  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
   27800  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
   27401  work/Objects/dictobject.c:lookdict_unicode_nodummy
   24115  work/Objects/typeobject.c:_PyType_Lookup
   24080  ???:_int_free
   21700  work/Objects/dictobject.c:PyDict_GetItemWithError
   20700  work/Objects/dictobject.c:PyDict_GetItem
          ...
   -3200  build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
   -3400  build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
   -3500  /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
   -3700  build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
   -4207  work/Objects/obmalloc.c:PyMem_Calloc
   -4500  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
   -4800  build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
   -5000  build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
   -5300  work/Objects/listobject.c:PyList_New
   -5400  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
   -5600  /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
   -6231  work/Objects/obmalloc.c:PyMem_Free
   -6300  work/Objects/listobject.c:list_repeat
  -11200  work/Objects/listobject.c:list_dealloc
  -28900  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```

Remaining TODOs:
  * Include a timer in the generated script for cuda sync.
  * Add valgrind to CircleCI machines and add a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717

Reviewed By: soumith

Differential Revision: D24010742

Pulled By: robieta

fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
2020-09-30 05:52:54 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Peng-Jen Chen
93650a82c9 Move prim::tolist math.log and aten::cpu to lite interpreter for translation model (#45482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482

Working on some models that need these ops on lite interpreter.

Test Plan: locally build and load/run the TS model without problem.

Reviewed By: iseeyuan

Differential Revision: D23906581

fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e
2020-09-29 21:42:18 -07:00
Mikhail Zolotukhin
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
Thomas Viehmann
22a34bcf4e ROCm {emoji:2764} TensorExpr (#45506)
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506

Reviewed By: zhangguanheng66

Differential Revision: D23991410

Pulled By: Krovatkin

fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
2020-09-29 16:52:16 -07:00
Randall Hunt
ab5cf16b6c fix standard deviation gradient NaN behavior (#45468)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45468

Reviewed By: zhangguanheng66

Differential Revision: D23991064

Pulled By: albanD

fbshipit-source-id: d4274895f2dac8b2cdbd73e5276ce3df466fc341
2020-09-29 13:47:29 -07:00
anjali411
18876b5722 Update backward formula for torch.dot and add backward definition for torch.vdot (#45074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45074

TODO: Add R -> C tests in https://github.com/pytorch/pytorch/pull/44744 (blocked on some JIT changes)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23975361

Pulled By: anjali411

fbshipit-source-id: 3512bd2962b588a198bc317673bd18cc96ac823f
2020-09-29 12:52:03 -07:00
Ivan Yashchuk
f47fd0eb72 Updated cholesky_backward for complex inputs (#45267)
Summary:
Updated `cholesky_backward` to work correctly for complex input.
Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do?
Ref. https://github.com/pytorch/pytorch/issues/44895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267

Reviewed By: bwasti

Differential Revision: D23975269

Pulled By: anjali411

fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6
2020-09-29 11:07:32 -07:00
Xingying Cheng
ea59251f51 Fix model_name not logged properly issue. (#45488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488

model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it.
ghstack-source-id: 113103942

Test Plan:
made sure that now the model_name is logged from module_->name().
verified with one model which does not contain the model metadata, and the model_name field is logged as below:

09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  model_name = __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  method_name = labels
09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod()

Reviewed By: linbinyu

Differential Revision: D23984165

fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8
2020-09-29 10:37:36 -07:00
Akshit Khurana
5f49d14be2 Add mobile_optimized tag to optimized model. (#45479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479

Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized.

Test Plan: buck test //caffe2/test:mobile passes

Reviewed By: kimishpatel

Differential Revision: D23956728

fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8
2020-09-29 10:06:57 -07:00
Mike Ruberry
ab5edf21b0 Revert D23789657: [wip] fast typeMeta/ScalarType conversion approach 2
Test Plan: revert-hammer

Differential Revision:
D23789657 (1ed1a2f5b0)

Original commit changeset: 5afdd52d24bd

fbshipit-source-id: 6d827be8895bcb39c8e85342eee0f7a3f5056c76
2020-09-29 09:40:53 -07:00
Mike Ruberry
56af122659 Revert D23966878: [pytorch][PR] This PR flips a switch to enable PE + TE
Test Plan: revert-hammer

Differential Revision:
D23966878 (dddb685c11)

Original commit changeset: 2010a0b07c59

fbshipit-source-id: 132556039730fd3e4babd0d7ca8daf9c8d14f728
2020-09-29 04:33:19 -07:00
Basil Hosmer
1ed1a2f5b0 [wip] fast typeMeta/ScalarType conversion approach 2 (#44965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44965

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23789657

Pulled By: bhosmer

fbshipit-source-id: 5afdd52d24bd097891ff4a7313033f7bd400165e
2020-09-29 02:39:36 -07:00
Mikhail Zolotukhin
b86008ab75 [TensorExpr] Remove buf_ field from class Tensor. (#45390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390

Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.

Differential Revision: D23952865

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
2020-09-29 01:21:57 -07:00
Mikhail Zolotukhin
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
Mikhail Zolotukhin
92306b85d5 [TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388

Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.

Differential Revision: D23952867

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
2020-09-29 01:17:10 -07:00
Nikolay Korovaiko
dddb685c11 This PR flips a switch to enable PE + TE (#45396)
Summary:
This PR flips a switch to enable PE + TE
next PR: https://github.com/pytorch/pytorch/pull/45397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45396

Reviewed By: suo

Differential Revision: D23966878

Pulled By: Krovatkin

fbshipit-source-id: 2010a0b07c595992a88b3fe0792d6af315cf421e
2020-09-28 21:57:50 -07:00
Natalia Gimelshein
50b91103a9 add self cuda time to avoid double/quadruple counting (#45209)
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::matmul          0.17%            890.805us        99.05%           523.401ms        5.234ms          49.91%           791.184ms        7.912ms          100
aten::mm              98.09%           518.336ms        98.88%           522.511ms        5.225ms          49.89%           790.885ms        7.909ms          100
aten::t               0.29%            1.530ms          0.49%            2.588ms          25.882us         0.07%            1.058ms          10.576us         100
aten::view            0.46%            2.448ms          0.46%            2.448ms          12.238us         0.06%            918.936us        4.595us          200
aten::transpose       0.13%            707.204us        0.20%            1.058ms          10.581us         0.03%            457.802us        4.578us          100
aten::empty           0.14%            716.056us        0.14%            716.056us        7.161us          0.01%            185.694us        1.857us          100
aten::as_strided      0.07%            350.935us        0.07%            350.935us        3.509us          0.01%            156.380us        1.564us          100
aten::stride          0.65%            3.458ms          0.65%            3.458ms          11.527us         0.03%            441.258us        1.471us          300
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s

Recorded timeit time:  789.0814 ms

```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler

After
```
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
        aten::matmul         0.15%     802.716us        99.06%     523.548ms       5.235ms     302.451us         0.04%     791.151ms       7.912ms           100
            aten::mm        98.20%     519.007ms        98.91%     522.745ms       5.227ms     790.225ms        99.63%     790.848ms       7.908ms           100
             aten::t         0.27%       1.406ms         0.49%       2.578ms      25.783us     604.964us         0.08%       1.066ms      10.662us           100
          aten::view         0.45%       2.371ms         0.45%       2.371ms      11.856us     926.281us         0.12%     926.281us       4.631us           200
     aten::transpose         0.15%     783.462us         0.22%       1.173ms      11.727us     310.016us         0.04%     461.282us       4.613us           100
         aten::empty         0.11%     591.603us         0.11%     591.603us       5.916us     176.566us         0.02%     176.566us       1.766us           100
    aten::as_strided         0.07%     389.270us         0.07%     389.270us       3.893us     151.266us         0.02%     151.266us       1.513us           100
        aten::stride         0.60%       3.147ms         0.60%       3.147ms      10.489us     446.451us         0.06%     446.451us       1.488us           300
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms

Recorded timeit time:  788.9832 ms

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209

Reviewed By: zou3519

Differential Revision: D23925491

Pulled By: ngimel

fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
2020-09-28 21:51:13 -07:00
Shen Li
5be954b502 Fix WorkerInfo link format (#45476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45476

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23982069

Pulled By: mrshenli

fbshipit-source-id: 6d932e77c1941dfd96592b388353f0fc8968dde6
2020-09-28 20:48:15 -07:00
Alex Suhan
52cbc9e4ec [TensorExpr] Always inline and DCE in the LLVM backend (#45445)
Summary:
Inline pytorch into wrapper, which is especially helpful in combination
with dead code elimination to reduce IR size and compilation times when
a lot of parameters are unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45445

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D23969009

Pulled By: asuhan

fbshipit-source-id: a21509d07e4c130b6aa6eae5236bb64db2748a3d
2020-09-28 18:11:13 -07:00
Meghan Lele
7ac872b934 [JIT] Modify to_backend API so that it accepts wrapped modules (#43612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43612

**Summary**
This commit modifies the `torch._C._jit_to_backend` function so that it
accepts `ScriptModules` as inputs. It already returns `ScriptModules`
(as opposed to C++ modules), so this makes sense and makes the API more
intuitive.

**Test Plan**
Continuous integration, which includes unit tests and out-of-tree tests
for custom backends.

**Fixes**
This commit fixes #41432.

Test Plan: Imported from OSS

Reviewed By: suo, jamesr66a

Differential Revision: D23339854

Pulled By: SplitInfinity

fbshipit-source-id: 08ecef729c4e1e6bddf3f483276947fc3559ea88
2020-09-28 17:17:01 -07:00
Heitor Schueroff de Souza
96f8755034 Fixed handling of nan for evenly_distribute_backward (#45280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45280

Performance is the same on CPU and on CUDA is only 1-1.05x slower. This change is necessary for the future nan ops including nan(min|max|median)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23908796

Pulled By: heitorschueroff

fbshipit-source-id: c2b57acbe924cfa59fbd85216811f29f4af05088
2020-09-28 15:57:02 -07:00
Omkar Salpekar
6b65b3cbd8 [Distributed] DeleteKey API for c10d TCP Store (#45401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: mrshenli

Differential Revision: D23955730

fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
2020-09-28 15:30:39 -07:00
lcskrishna
a4486fe7ba [ROCm] Print name irrespective of seq number assignment for roctx traces (#45229)
Summary:
Recent changes to the seq_num correlation behavior in profiler (PR https://github.com/pytorch/pytorch/issues/42565)  has changed the behavior for emit_nvtx(record_shapes=True)  which doesn't print the name of the operator properly.

Created PR to dump out the name in roctx traces, irrespective of the sequence number assigned only for ROCm.

cc: jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45229

Reviewed By: zou3519

Differential Revision: D23932902

Pulled By: albanD

fbshipit-source-id: c782667ff002b70b51f1cc921afd1b1ac533b39d
2020-09-28 15:03:47 -07:00
Yi Wang
7a4c417ed3 Fix typo (#45379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45379

Registeres -> Registers in reducer.h.
ghstack-source-id: 112982279

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D23951203

fbshipit-source-id: 96c7dc2e1e12c132339b9ac83ce1da52c812740c
2020-09-28 14:02:01 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
Nikolay Korovaiko
993628c74a Build shape expressions and remove outputs that are only used by aten::sizes (#45080)
Summary:
Currently, TE materializes all intermediate results even if they are only used for computing their shapes. This diff ports the approach the OF (Old Fuser) took to deal with this issue. Namely, given the structure of a fusion group we infer all the sizes outside a fusion group based on fusion group's inputs.

A simple example would be:

```
        def test_fuse(a, b):
            c = a + b
            d = c + b
            return d
```

Here we don't need to cache `c` as computing a gradient for `b` in `d = c + b` doesn't need it. We do need to compute sizes for all arguments here in case broadcasts happen.

Without this optimization, TE would need to materialize `c` so we can get its size

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %83 : Double(1:1, requires_grad=0, device=cuda:0), %84 : Double(1:1, requires_grad=0, device=cuda:0), %85 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : Tensor, %87 : Tensor = prim::If(%85)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0), %c.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%83, %84)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4, %c.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %94 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %95 : (Tensor, Tensor) = prim::CallFunction(%94, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %96 : Tensor, %97 : Tensor = prim::TupleUnpack(%95)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%96, %97)
[DUMP profiling_graph_executor_impl.cpp:499]   %60 : int[] = aten::size(%87) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %60) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %60) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %67 : int[] = aten::size(%86) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%60, %67) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %67) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%86, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3, %c.3)
```

With this optimization we use `prim::BroadcastSizes` to compute the size of `c`. No need to materialize it.

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %88 : Double(1:1, requires_grad=0, device=cuda:0), %89 : Double(1:1, requires_grad=0, device=cuda:0), %90 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %91 : Tensor = prim::If(%90)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%88, %89)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %97 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %98 : (Tensor) = prim::CallFunction(%97, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %99 : Tensor = prim::TupleUnpack(%98)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%99)
[DUMP profiling_graph_executor_impl.cpp:499]   %85 : int[] = aten::size(%91)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : int[] = prim::BroadcastSizes(%59, %62)
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %86) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %86) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%86, %85) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %85) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%91, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45080

Reviewed By: bertmaher

Differential Revision: D23856410

Pulled By: Krovatkin

fbshipit-source-id: 2956286eb03a4894a5baa151c35e6092466322b1
2020-09-28 10:45:56 -07:00
generatedunixname89002005325676
7818a214c5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23959094

fbshipit-source-id: 6caa046d263114bff38a38d756099aac357e4f04
2020-09-28 05:08:46 -07:00
Negin Raoof
95a97e51b5 [ONNX] Improve scripting inplace indexing ops (#44351)
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.

2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.

Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351

Reviewed By: ezyang

Differential Revision: D23880267

Pulled By: bzinodev

fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
2020-09-28 00:32:36 -07:00
Zino Benaissa
13f76f2be4 Fix preserve submodule attribute in freezing (#45143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143

This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23844969

Pulled By: bzinodev

fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
2020-09-28 00:05:38 -07:00
shubhambhokare1
5b839bca78 [ONNX] Optimize export_onnx api to reduce string and model proto exchange (#44332)
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332

Reviewed By: bwasti, eellison

Differential Revision: D23880129

Pulled By: bzinodev

fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
2020-09-27 16:29:08 -07:00
Natalia Gimelshein
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
Rohan Varma
23dfca8351 Support record_shapes in RPC profiling (#44419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419

Closes https://github.com/pytorch/pytorch/issues/39969

This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.

This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899

Reviewed By: pritamdamania87

Differential Revision: D23591274

fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
2020-09-26 13:26:44 -07:00
Rohan Varma
19dda7c68a Fallback to CPU when remote end does not have CUDA for profiling (#44967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23790729

fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
2020-09-26 13:12:55 -07:00
Omkar Salpekar
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
Omkar Salpekar
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
Zafar
958c208666 [quant] conv_transpose graph patterns (#45078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45078

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23821580

Pulled By: z-a-f

fbshipit-source-id: 813a4ef1bbc429720765d61791fe754b6678a334
2020-09-25 18:14:29 -07:00
Nikita Shulga
8ab2ad306d Enable torch.cuda.nccl typechecking (#45344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344

Reviewed By: walterddr

Differential Revision: D23935306

Pulled By: malfet

fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69
2020-09-25 17:02:47 -07:00
Shen Li
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
Brian Hirsh
439930c81b adding a beta parameter to the smooth_l1 loss fn (#44433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433

Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time

fixing some type errors, updated fn signature in a few more files

removing my usage of Scalar, making beta a double everywhere instead

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23636720

Pulled By: bdhirsh

fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
2020-09-25 16:36:28 -07:00
Rohan Varma
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00
Iurii Zdebskyi
d5748d9a1a Enable binary ops with Scalar Lists with for foreach APIs (#45298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931986

Pulled By: izdeby

fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3
2020-09-25 12:58:34 -07:00
gunandrose4u
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
Bram Wasti
e5f6e5af13 Add Deep and wide to test and flatten/tranpose for good measure (#44129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604302

Pulled By: bwasti

fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12
2020-09-25 11:05:41 -07:00
Bram Wasti
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
Nick Gibson
d1d9017a66 [NNC] fix Half conversion of immediates in Cuda backend (#45213)
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213

Reviewed By: ezyang

Differential Revision: D23885287

Pulled By: nickgg

fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
2020-09-25 10:53:36 -07:00
Supriya Rao
a117d968f6 [quant][graph] Remove redundant aten::wait calls in the graph (#45257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45257

Currently we inline fork-wait calls when we insert observers for quantization
In the case where fork and wait are in different subgraphs, inlining the fork-wait calls
only gets rid of the fork. This leaves the aten::wait call in the graph with a torch.Tensor as input,
which is currently not supported.
To avoid this we check to make sure input to all wait calls in the graph is of type Future[tensor]
in the cleanup phase

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_quantize_fork_wait

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D23895412

fbshipit-source-id: 3c58c6be7d7e7904eb6684085832ac21f827a399
2020-09-25 09:52:52 -07:00
Sebastian Messmer
78fcde9c50 Trace scattered tensor options arguments (#44071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071

Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.

This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793

Test Plan:
waitforsandcastle

vs master: https://www.internalfb.com/intern/fblearner/details/216129483/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/

Reviewed By: ezyang

Differential Revision: D23486638

fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
2020-09-25 09:04:06 -07:00
Sebastian Messmer
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
Brian Hirsh
2739a7c599 Byte-for-byte compatibility fixes in codegen (#44879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44879

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23825163

Pulled By: bdhirsh

fbshipit-source-id: 4d8028274f82c401b393c4fe1b9e32de3f4909c6
2020-09-25 08:06:50 -07:00
kshitij12345
00e704e757 [fix] torch.repeat : dim-0 backward (#45212)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45212

Reviewed By: mrshenli

Differential Revision: D23905545

Pulled By: albanD

fbshipit-source-id: c5bf9cf481c8cf3ccc1fdbfb364006b29f67dc9f
2020-09-25 07:53:00 -07:00
Alex Suhan
76ee58e2ec [TensorExpr] Move inner loops vectorization logic to its own method (#45287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45287

Test Plan: CI, build

Reviewed By: gmagogsfm

Differential Revision: D23913432

Pulled By: asuhan

fbshipit-source-id: 3bf8fe09753f349e3c857863a43d2b1fca5101c1
2020-09-25 02:29:36 -07:00
Xiong Wei
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
jjsjann123
99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
Mike Ruberry
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
gunandrose4u
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
Yanli Zhao
c6500bcf14 [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing  to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 112845787

Test Plan:
1. When grad_is_view=false:
a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli
b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588
https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli
d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli
e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli

2. When grad_is_view=true:
a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details
b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli
d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli
e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli

Reviewed By: mrshenli

Differential Revision: D23588186

fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
2020-09-24 20:54:51 -07:00
Linbin Yu
0f2c648c97 log metadata when model loading failed (#44430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44430

log metadata even when model loading is failed

Test Plan: {F331550976}

Reviewed By: husthyc

Differential Revision: D23577711

fbshipit-source-id: 0504e75625f377269f1e5df0f1ebe34b8e564c4b
2020-09-24 20:09:22 -07:00
Himangshu
92ebb04f92 added check for NumberType (#44375)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44375

Reviewed By: mrshenli

Differential Revision: D23906728

Pulled By: eellison

fbshipit-source-id: 3b534e5dd3af1f5e43a7314953e64117cbe8ffe4
2020-09-24 16:26:59 -07:00
Elias Ellison
5dd288eb06 [JIT] Regularize tensorexpr fuser strategy with other fusers (#44972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44972

Previously, our fusion strategy would be:
- start at the end of the block, find a fusable node
- iteratively try to merge inputs into the fusion group, sorted topologically

This strategy works pretty well, but has the possibility of missing fusion groups. See my attached test case for an example where we wouldn't find all possible fusion groups. bertmaher found an example of a missed fusion groups in one of our rnn examples (jit_premul) that caused a regression from the legacy fuser.

Here, I'm updating our fusion strategy to be the same as our other fusion passes - create_autodiff_subgraphs, and graph_fuser.cpp.

The basic strategy is:
- iterate until you find a fusible node
- try to merge the nodes inputs, whenever a succesful merge occurs restart at the beginning of the nodes inputs
- after you've exhausted a node, continue searching the block for fusion opportunities from the node
- continue doing this on the block until we go through an iteration without an succesful merges

Since we create the fusion groups once, and only re-specialize within the fusion groups, we should be running this very infrequently (only re-triggers when we fail undefinedness specializations). Also bc it's the same algorithm as the existing fuser it is unlikely to cause a regression.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23821581

Pulled By: eellison

fbshipit-source-id: e513d1ef719120dadb0bfafc7a14f4254cd806ee
2020-09-24 15:34:21 -07:00
Elias Ellison
0137e3641d Refactor subgraph merging (#44238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44238

Refactor create_autodiff_subgraphs to use the same updating of output aliasing properties logic as tensorexpr fuser, and factor that out to a common function in subgraph utils.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23871565

Pulled By: eellison

fbshipit-source-id: 72df253b16baf8e4aabf3d68b103b29e6a54d44c
2020-09-24 15:29:34 -07:00
Mikhail Zolotukhin
71e6ce6616 [JIT] Specialize AutogradZero: merge AutogradAnyNonZero and Not(AutogradAnyNonZero) checks into one. (#44987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44987

This PR introduces new `prim::AutogradAllZero` and
`prim::AutogradAllNonZero` ops that are used for a batch check for
multiple tensors. The specialize-autogradzero pass now generates one
check for all expected-to-be-undefined tensors, one check for all
expected-to-be-defined tensors, and a bunch of checks for size
parameters passed to `grad_sum_to_size` (this probably could be cleaned
up somehow as well in future).

An example of what we generated before this change:
```
%1626 : bool = prim::AutogradAnyNonZero(%0)
%1627 : bool = prim::AutogradAnyNonZero(%2)
%1628 : bool = aten::__not__(%1627)
%1629 : bool = prim::AutogradAnyNonZero(%3)
%1630 : bool = aten::__not__(%1629)
%1631 : bool = prim::AutogradAnyNonZero(%4)
%1632 : bool = aten::__not__(%1631)
%1633 : bool = prim::AutogradAnyNonZero(%5)
%1634 : bool = aten::__not__(%1633)
%1635 : bool = prim::AutogradAnyNonZero(%6)
%1636 : bool = aten::__not__(%1635)
%1637 : bool = prim::AutogradAnyNonZero(%7)
%1638 : bool = aten::__not__(%1637)
%1639 : bool = prim::AutogradAnyNonZero(%8)
%1640 : bool = aten::__not__(%1639)
%1641 : bool = prim::AutogradAnyNonZero(%9)
%1642 : bool = aten::__not__(%1641)
%1643 : bool = prim::AutogradAnyNonZero(%10)
%1644 : bool = aten::__not__(%1643)
%1645 : bool = prim::AutogradAnyNonZero(%11)
%1646 : bool = aten::__not__(%1645)
%1647 : bool = prim::AutogradAnyNonZero(%12)
%1648 : bool = aten::__not__(%1647)
%1649 : bool = prim::AutogradAnyNonZero(%13)
%1650 : bool = aten::__not__(%1649)
%1651 : bool = prim::AutogradAnyNonZero(%14)
%1652 : bool = aten::__not__(%1651)
%1653 : bool = prim::AutogradAnyNonZero(%15)
%1654 : bool = aten::__not__(%1653)
%1655 : bool = prim::AutogradAnyNonZero(%16)
%1656 : bool = aten::__not__(%1655)
%1657 : bool = prim::AutogradAnyNonZero(%17)
%1658 : bool = prim::AutogradAnyNonZero(%18)
%1659 : bool = prim::AutogradAnyNonZero(%19)
%1660 : bool = prim::AutogradAnyNonZero(%20)
%1661 : bool = aten::__is__(%self_size.16, %1625)
%1662 : bool = aten::__is__(%other_size.16, %1625)
%1663 : bool = aten::__is__(%self_size.14, %1625)
%1664 : bool = aten::__is__(%self_size.12, %1625)
%1665 : bool = prim::AutogradAnyNonZero(%ingate.7)
%1666 : bool = prim::AutogradAnyNonZero(%forgetgate.7)
%1667 : bool = prim::AutogradAnyNonZero(%cellgate.7)
%1668 : bool = prim::AutogradAnyNonZero(%30)
%1669 : bool = prim::AutogradAnyNonZero(%31)
%1670 : bool = aten::__is__(%self_size.10, %1625)
%1671 : bool = aten::__is__(%other_size.10, %1625)
%1672 : bool = prim::AutogradAnyNonZero(%34)
%1673 : bool = prim::AutogradAnyNonZero(%35)
%1674 : bool = aten::__is__(%self_size.8, %1625)
%1675 : bool = aten::__is__(%other_size.8, %1625)
%1676 : bool = aten::__is__(%self_size.6, %1625)
%1677 : bool = aten::__is__(%other_size.6, %1625)
%1678 : bool = prim::AutogradAnyNonZero(%outgate.7)
%1679 : bool = prim::AutogradAnyNonZero(%41)
%1680 : bool = prim::AutogradAnyNonZero(%42)
%1681 : bool = prim::AutogradAnyNonZero(%43)
%1682 : bool = aten::__is__(%self_size.4, %1625)
%1683 : bool = aten::__is__(%other_size.4, %1625)
%1684 : bool[] = prim::ListConstruct(%1626, %1628, %1630, %1632, %1634, %1636, %1638, %1640, %1642, %1644, %1646, %1648, %1650, %1652, %1654, %1656, %1657, %1658, %1659, %1660, %1661, %1662, %1663, %1664, %1665, %1666, %1667, %1668, %1669, %1670, %1671, %1672, %1673, %1674, %1675, %1676, %1677, %1678, %1679, %1680, %1681, %1682, %1683)
%1685 : bool = aten::all(%1684)
```

Same example after this change:
```
%1625 : None = prim::Constant()
%1626 : bool = aten::__is__(%self_size.16, %1625)
%1627 : bool = aten::__is__(%other_size.16, %1625)
%1628 : bool = aten::__is__(%self_size.14, %1625)
%1629 : bool = aten::__is__(%self_size.12, %1625)
%1630 : bool = aten::__is__(%self_size.10, %1625)
%1631 : bool = aten::__is__(%other_size.10, %1625)
%1632 : bool = aten::__is__(%self_size.8, %1625)
%1633 : bool = aten::__is__(%other_size.8, %1625)
%1634 : bool = aten::__is__(%self_size.6, %1625)
%1635 : bool = aten::__is__(%other_size.6, %1625)
%1636 : bool = aten::__is__(%self_size.4, %1625)
%1637 : bool = aten::__is__(%other_size.4, %1625)
%1638 : bool = prim::AutogradAllNonZero(%0, %17, %18, %19, %20, %ingate.7, %forgetgate.7, %cellgate.7, %30, %31, %34, %35, %outgate.7, %41, %42, %43)
%1639 : bool = prim::AutogradAllZero(%2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16)
%1640 : bool[] = prim::ListConstruct(%1626, %1627, %1628, %1629, %1630, %1631, %1632, %1633, %1634, %1635, %1636, %1637, %1638, %1639)
%1641 : bool = aten::all(%1640)
```

My performance measurements showed some changes, but I don't really
trust them and think that they are probably just a noise. Below are
tables with min-aggregation over 10 runs:

FastRNN models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| lstm[aten]:bwd                                   |     30.059927 |       29.834089 |      -0.8% |
| lstm[aten]:fwd                                   |     25.673708 |       25.700039 |       0.1% |
| lstm[cudnn]:bwd                                  |     17.866232 |       17.893120 |       0.2% |
| lstm[cudnn]:fwd                                  |     11.418444 |       11.408514 |      -0.1% |
| lstm[jit]:bwd                                    |     27.127205 |       27.141029 |       0.1% |
| lstm[jit]:fwd                                    |     17.018047 |       16.975451 |      -0.3% |
| lstm[jit_multilayer]:bwd                         |     27.502396 |       27.365149 |      -0.5% |
| lstm[jit_multilayer]:fwd                         |     16.918591 |       16.917767 |      -0.0% |
| lstm[jit_premul]:bwd                             |     22.281199 |       22.215082 |      -0.3% |
| lstm[jit_premul]:fwd                             |     14.848708 |       14.896231 |       0.3% |
| lstm[jit_premul_bias]:bwd                        |     20.761206 |       21.170969 |       2.0% |
| lstm[jit_premul_bias]:fwd                        |     15.013515 |       15.037978 |       0.2% |
| lstm[jit_simple]:bwd                             |     26.715771 |       26.697786 |      -0.1% |
| lstm[jit_simple]:fwd                             |     16.675898 |       16.545893 |      -0.8% |
| lstm[py]:bwd                                     |     56.327065 |       54.731030 |      -2.8% |
| lstm[py]:fwd                                     |     39.876324 |       39.230572 |      -1.6% |

Torch Hub models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[BERT_pytorch-cuda-jit]                 |      0.111706 |        0.106604 |      -4.6% |
| test_eval[LearningToPaint-cuda-jit]              |      0.002841 |        0.002801 |      -1.4% |
| test_eval[Super_SloMo-cuda-jit]                  |      0.384869 |        0.384737 |      -0.0% |
| test_eval[attension_is_all_you_nee...-cuda-jit]  |      0.123857 |        0.123923 |       0.1% |
| test_eval[demucs-cuda-jit]                       |      0.077270 |        0.076878 |      -0.5% |
| test_eval[fastNLP-cuda-jit]                      |      0.000255 |        0.000249 |      -2.3% |
| test_eval[moco-cuda-jit]                         |      0.426472 |        0.427380 |       0.2% |
| test_eval[pytorch_CycleGAN_and_pix...-cuda-jit]  |      0.026483 |        0.026423 |      -0.2% |
| test_eval[pytorch_mobilenet_v3-cuda-jit]         |      0.036202 |        0.035853 |      -1.0% |
| test_eval[pytorch_struct-cuda-jit]               |      0.001439 |        0.001495 |       3.9% |
| test_train[BERT_pytorch-cuda-jit]                |      0.247236 |        0.247188 |      -0.0% |
| test_train[Background_Matting-cuda-jit]          |      3.536659 |        3.581864 |       1.3% |
| test_train[LearningToPaint-cuda-jit]             |      0.015341 |        0.015331 |      -0.1% |
| test_train[Super_SloMo-cuda-jit]                 |      1.018626 |        1.019098 |       0.0% |
| test_train[attension_is_all_you_nee...-cuda-jit] |      0.446314 |        0.444893 |      -0.3% |
| test_train[demucs-cuda-jit]                      |      0.169647 |        0.169846 |       0.1% |
| test_train[fastNLP-cuda-jit]                     |      0.001990 |        0.001978 |      -0.6% |
| test_train[moco-cuda-jit]                        |      0.855323 |        0.856974 |       0.2% |
| test_train[pytorch_mobilenet_v3-cuda-jit]        |      0.497723 |        0.485416 |      -2.5% |
| test_train[pytorch_struct-cuda-jit]              |      0.309692 |        0.308792 |      -0.3% |

Differential Revision: D23794659

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 859b68868ef839c5c6cbc7021879ee22d3144ea8
2020-09-24 14:31:49 -07:00
Xinyu Li
26001a2334 Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList
Test Plan: revert-hammer

Differential Revision:
D23753711 (71d1b5b0e2)

Original commit changeset: bf3e8c54bc07

fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c
2020-09-24 11:55:49 -07:00
Raziel Alvarez Guevara
2b38c09f69 Moves prim ops from C10 back to JIT (#45144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144

Moves prim ops from C10 back to JIT.

These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781

Test Plan:
buck test //caffe2/test/cpp/jit:jit

https://pxl.cl/1l22N

buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test

https://pxl.cl/1lBxD

Reviewed By: iseeyuan

Differential Revision: D23697598

fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
2020-09-24 09:44:20 -07:00
iurii zdebskyi
71d1b5b0e2 Add foreach APIs for binary ops with ScalarList (#44743)
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions

tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743

Reviewed By: bwasti, malfet

Differential Revision: D23753711

Pulled By: izdeby

fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
2020-09-24 08:30:42 -07:00
Alex Suhan
3dd0e362db [TensorExpr] Fix min and max for integral inputs in CUDA backend (#44984)
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984

Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops

Reviewed By: ezyang

Differential Revision: D23885259

Pulled By: asuhan

fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
2020-09-23 23:19:12 -07:00
Peter Bell
6a2e9eb51c torch.fft: Multi-dimensional transforms (#44550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44550

Part of the `torch.fft` work (gh-42175).
This adds n-dimensional transforms: `fftn`, `ifftn`, `rfftn` and `irfftn`.

This is aiming for correctness first, with the implementation on top of the existing `_fft_with_size` restrictions. I plan to follow up later with a more efficient rewrite that makes `_fft_with_size` work with arbitrary numbers of dimensions.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23846032

Pulled By: mruberry

fbshipit-source-id: e6950aa8be438ec5cb95fb10bd7b8bc9ffb7d824
2020-09-23 22:09:58 -07:00
Supriya Rao
60665ace17 [quant] Add optimized approach to calculate qparams for qembedding_bag (#45149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45149

The choose_qparams_optimized calculates the the optimized qparams.
It uses a greedy approach to nudge the min and max and calculate the l2 norm
  and tries to minimize the quant error by doing `torch.norm(x-fake_quant(x,s,z))`

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23848060

fbshipit-source-id: c6c57c9bb07664c3f1c87dd7664543e09f634aee
2020-09-23 19:00:22 -07:00
Alex Suhan
76c185dcca [TensorExpr] When lanes differ, insert Broadcast instead of Cast (#45179)
Summary:
We need to check if dtypes differ in scalar type or lanes to decide between
Cast and Broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45179

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyBroadcastTermExpander

Reviewed By: bwasti

Differential Revision: D23873316

Pulled By: asuhan

fbshipit-source-id: ca141be67e10c2b6c5f2ff9c11e42dcfc62ac620
2020-09-23 17:06:54 -07:00
Alex Suhan
0495998862 [TensorExpr] Disallow arithmetic binary operations on Bool (#44677)
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677

Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py

Reviewed By: agolynski

Differential Revision: D23801412

Pulled By: asuhan

fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
2020-09-23 14:59:11 -07:00
Alex Suhan
8e0fc711f4 [TensorExpr] Remove unused EvalConstExpr function (#45180)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45180

Test Plan: build

Reviewed By: ezyang

Differential Revision: D23877151

Pulled By: asuhan

fbshipit-source-id: a5d4d211c1dc85e6f7045330606163a933b9474e
2020-09-23 14:55:27 -07:00
Yi Wang
2a1a51facb Fix typos. (#45195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45195

Fix some typos in reducer class.
ghstack-source-id: 112673443

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23862399

fbshipit-source-id: 0dc69e5ea1fa7d33c85d1909b2216bcd1f579f6a
2020-09-23 14:51:15 -07:00
Nick Gibson
9e206ee9f1 [NNC] Fix a bug in SplitWithMask when splitting multiple times (#45141)
Summary:
When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can.

The fix is just to simplify the loop extents before doing the extent calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141

Reviewed By: ezyang

Differential Revision: D23869170

Pulled By: nickgg

fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5
2020-09-23 14:04:58 -07:00
Bradley Davis
21fabae47a Remove expensive call to PyObject_GetAttrString in PyTorch_LookupSpecial (#44684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44684

The ad-hoc quantization benchmarking script in D23689062 recently highlighted that quantized ops were surprisingly slow after the introduction of support for custom ops in torch.fx in D23203204 (f15e27265f).

Using strobelight, it's immediately clear that up to 66% of samples were seen in `c10::get_backtrace`, which is descends from `torch::is_tensor_and_apppend_overloaded -> torch::check_has_torch_function ->  torch::PyTorch_LookupSpecial -> PyObject_HasAttrString ->  PyObject_GetAttrString`.

I'm no expert by any means so please correct any/all misinterpretation, but it appears that:
- `check_has_torch_function` only needs to return a bool
- `PyTorch_LookupSpecial` should return `NULL` if a matching method is not found on the object
- in the impl of `PyTorch_LookupSpecial` the return value from `PyObject_HasAttrString` only serves as a bool to return early, but ultimately ends up invoking `PyObject_GetAttrString`, which raises, spawning the generation of a backtrace
- `PyObject_FastGetAttrString` returns `NULL` (stolen ref to an empty py::object if the if/else if isn't hit) if the method is not found, anyway, so it could be used singularly instead of invoking both `GetAttrString` and `FastGetAttrString`
- D23203204 (f15e27265f) compounded (but maybe not directly caused) the problem by increasing the number of invocations

so, removing it in this diff and seeing how many things break :)

before:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
  (0): Quantize(scale=tensor([0.0241]), zero_point=tensor([60]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.017489388585090637, zero_point=68, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.010896682739257812
q 0.11908197402954102
```

after:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
  (0): Quantize(scale=tensor([0.0247]), zero_point=tensor([46]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.012683945707976818, zero_point=41, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.011141300201416016
q 0.022639036178588867
```

which roughly restores original performance seen in P142370729

UPDATE: 9/22 mode/opt benchmarks
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
  (0): Quantize(scale=tensor([0.0263]), zero_point=tensor([82]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.021224206313490868, zero_point=50, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.002968311309814453
q 0.5138928890228271
```

with patch:
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
  (0): Quantize(scale=tensor([0.0323]), zero_point=tensor([70]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.017184294760227203, zero_point=61, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.0026655197143554688
q 0.0064449310302734375
```

Reviewed By: ezyang

Differential Revision: D23697334

fbshipit-source-id: f756d744688615e01c94bf5c48c425747458fb33
2020-09-23 13:52:54 -07:00
Zino Benaissa
4d80c8c648 Fix inlining interface call in fork subgraph (#43790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43790

Interface calls were not handled properly when they are used in fork
subgraph. This PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23402039

Pulled By: bzinodev

fbshipit-source-id: 41adc5ee7d942250e732e243ab30e356d78d9bf7
2020-09-23 11:17:19 -07:00
Edward Yang
da4033d32a Make cudaHostRegister actually useful on cudart. (#45159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159

By default, pybind11 binds void* to be capsules.  After a lot of
Googling, I have concluded that this is not actually useful:
you can't actually create a capsule from Python land, and our
data_ptr() function returns an int, which means that the
function is effectively unusable.  It didn't help that we had no
tests exercising it.

I've replaced the void* with uintptr_t, so that we now accept int
(and you can pass data_ptr() in directly).  I'm not sure if we
should make these functions accept ctypes types; unfortunately,
pybind11 doesn't seem to have any easy way to do this.

Fixes #43006

Also added cudaHostUnregister which was requested.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23849731

Pulled By: ezyang

fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f
2020-09-23 11:05:44 -07:00
Shen Li
94c3cdd994 Let rpc._all_gather use default RPC timeout (#44983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44983

`_all_gather` was converted from `_wait_all_workers` and inherited its
5 seconds fixed timeout. As `_all_gather` meant to support a broader
set of use cases, the timeout configuration should be more flexible.
This PR makes `rpc._all_gather` use the global default RPC timeout.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23794383

Pulled By: mrshenli

fbshipit-source-id: 382f52c375f0f25c032c5abfc910f72baf4c5ad9
2020-09-23 08:06:09 -07:00
Martin Yuan
e5bade7b2c [PyTorch Mobile] Move string op registrations to prim and make them selective (#44960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44960

Since we have templated selective build, it should be safe to move the operators to prim so that they can be selectively built in mobile

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23772025

fbshipit-source-id: 52cebae76e4df5a6b2b51f2cd82f06f75e2e45d0
2020-09-23 07:42:35 -07:00
Luca Wehrstedt
76dc50e9c8 [RPC] Infer backend type if only options are given (#45065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065

To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change.
ghstack-source-id: 112586258

Test Plan: Added new unit tests.

Reviewed By: pritamdamania87

Differential Revision: D23814289

fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752
2020-09-23 00:46:27 -07:00
Alex Suhan
215679573e [TensorExpr] Fix operator order in combineMultilane (#45157)
Summary:
combineMultilane used the wrong order when ramp was on the left hand side,
which matters for subtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45157

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyRampSubBroadcast

Reviewed By: ailzhang

Differential Revision: D23851751

Pulled By: asuhan

fbshipit-source-id: 864d1611e88769fb43327ef226bb3310017bf858
2020-09-22 23:50:47 -07:00
Rohan Varma
d4a634c209 [RPC profiling] Don't wrap toHere() calls with profiling (#44655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44655

Since `toHere()` does not execute operations over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).
ghstack-source-id: 112605610

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23641466

fbshipit-source-id: 109d9eb10bd7fe76122b2026aaf1c7893ad10588
2020-09-22 21:17:00 -07:00
Rohan Varma
70d2e4d1f6 [RPC profiling] Allow disableProfiler() to be called from another thread. (#44653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653

This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620

Reviewed By: mrshenli

Differential Revision: D23638499

fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
2020-09-22 21:16:58 -07:00
Rohan Varma
1bd6533d60 Remove thread_local RecordFunctionGuard from profiler. (#44646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44646

Per a discussion with ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.
ghstack-source-id: 112605618

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23683998

fbshipit-source-id: 4e49a439509884fe04d922553890ae353e3331ab
2020-09-22 21:15:31 -07:00
Jerry Zhang
f575df201f [quant][graphmode][jit][api] Expose preserved_attrs from finalize to convert_jit (#44490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44490

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23631142

fbshipit-source-id: f0913f0cb4576067e2a7288326024942d12e0ae0
2020-09-22 19:37:25 -07:00
Meghan Lele
e045119956 [JIT] Add default arguments for class types (#45098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45098

**Summary**
This commit adds support for default arguments in methods of class
types. Similar to how default arguments are supported for regular
script functions and methods on scripted modules, default values are
retrieved from the definition of a TorchScript class in Python as Python
objects, converted to IValues, and then attached to the schemas of
already compiled class methods.

**Test Plan**
This commit adds a set of new tests to TestClassType to test default
arguments.

**Fixes**
This commit fixes #42562.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23844769

Pulled By: SplitInfinity

fbshipit-source-id: ceedff7703bf9ede8bd07b3abcb44a0f654936bd
2020-09-22 18:37:44 -07:00
Bram Wasti
ebde5a80bb [tensorexpr] Add flag to fuse with unknown shapes (#44401)
Summary:
This flag simply allows users to get fusion groups that will *eventually* have shapes (such that `getOperation` is a valid).

This is useful for doing early analysis and compiling just in time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44401

Reviewed By: ZolotukhinM

Differential Revision: D23656140

Pulled By: bwasti

fbshipit-source-id: 9a26c202752399d1932ad7d69f21c88081ffc1e5
2020-09-22 18:17:47 -07:00
Yanan Cao
c253b10154 Fix incorrect EnumValue serialization issue (#44891)
Summary:
Previously, `prim::EnumValue` is serialized to `ops.prim.EnumValue`, which doesn't have the right implementation to refine return type. This diff correctly serializes it to enum.value, thus fixing the issue.

Fixes https://github.com/pytorch/pytorch/issues/44892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44891

Reviewed By: malfet

Differential Revision: D23818962

Pulled By: gmagogsfm

fbshipit-source-id: 6edfdf9c4b932176b08abc69284a916cab10081b
2020-09-22 11:59:45 -07:00
Ailing Zhang
10f287539f Align casing in test_dispatch with dispatch keys. (#44933)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44933

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23778247

Pulled By: ailzhang

fbshipit-source-id: bc3725eae670b03543015afe763cb3bb16baf8f6
2020-09-22 10:50:08 -07:00
Elias Ellison
ae286d81e0 [JIT] improve alias analysis for list constructs (#39111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39111

In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach:
- it is not to hard to maintain the aliasDb implementation
- it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated
- It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements.

The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op.

In an example like:

```
 def foo(input):
    x = torch.tensor([1, 2, 3, 4])
    y = [x, x]
    input.add_(1)
    return torch.cat(y)
```

we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23828003

Pulled By: eellison

fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
2020-09-22 09:38:59 -07:00
Nikita Shulga
63fd257879 Add Ellipsis constant to the list of recognized tokens (#44959)
Summary:
Per https://docs.python.org/3.6/library/constants.html
> `Ellipsis` is the same as ellipsis literal `...`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44959

Reviewed By: suo

Differential Revision: D23785660

Pulled By: malfet

fbshipit-source-id: f68461849e7d16ef68042eb96566f2c936c06b0f
2020-09-22 09:05:25 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Bugra Akyildiz
1b059f2c6d Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914)
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.

Fixes https://github.com/pytorch/pytorch/issues/43960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914

Reviewed By: rohan-varma

Differential Revision: D23798888

Pulled By: bugra

fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
2020-09-22 06:28:47 -07:00
Jerry Zhang
5aed75b21b [quant][graphmode][jit] Try to support append (#44641)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44641

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23682356

fbshipit-source-id: 09a03dfde0b1346a5764e8e28ba56e32b343d239
2020-09-21 23:13:56 -07:00
Ksenija Stanojevic
0dda65ac77 [ONNX] add jit pass for lists (#43820)
Summary:
Add jit preprocessing pass for adding int lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43820

Reviewed By: albanD

Differential Revision: D23674598

Pulled By: bzinodev

fbshipit-source-id: 35766403a073e202563bba5251c07efb7cc5cfb1
2020-09-21 22:05:25 -07:00
Shen Li
09e7f62ce2 Fix RPC and ProcessGroup GIL deadlock (#45088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088

Fixes #45082

Found a few problems while working on #44983

1. We deliberately swallow RPC timeouts during shutdown, as we haven't
found a good way to handle those. When we convert `_wait_all_workers`
into `_all_gather`, the same logic was inherited. However, as
`_all_gather` meant to be used in more general scenarios, we should
no longer keep silent about errors. This commit let the error throw
in `_all_gather` and also let `shutdown()` to catch them and log.
2. After fixing (1), I found that `UnpickledPythonCall` needs to
acquire GIL on destruction, and this can lead to deadlock when used
in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a
synchronization point which holds GIL. In `init_rpc`, followers
(`rank != 0`) can exit before the leader (`rank == 0`). If the two
happens together, we could get a) on a follower, it exits `init_rpc`
after running `_broadcast_to_followers` and before the reaching dtor
of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`,
which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from `_broadcast_to_followers`,
which is blocked by the dtor of `UnpickledPythonCall`. And hence
the deadlock. This commit drops the GIL in `ProcessGroup` ctor.
3. After fixing (2), I found that `TensorPipe` backend
nondeterministically fails with `test_local_shutdown`, due to a
similar reason as (2), but this time it is that `shutdown()` on a
follower runs before the leader finishes `init_rpc`. This commit
adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`.

The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during `ProcessGroup` ctor, I
made that change too.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23825592

Pulled By: mrshenli

fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976
2020-09-21 21:47:27 -07:00
Nikita Shulga
81bb19c9f0 [JIT] Prohibit subscripted assignments for tuple types (#44929)
Summary:
This would force jit.script to raise an error if someone tries to mutate tuple
```
Tuple[int, int] does not support subscripted assignment:
  File "/home/nshulga/test/tupleassignment.py", line 9
torch.jit.script
def foo(x: Tuple[int, int]) -> int:
    x[-1] = x[0] + 1
    ~~~~~ <--- HERE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44929

Reviewed By: suo

Differential Revision: D23777668

Pulled By: malfet

fbshipit-source-id: 8efaa4167354ffb4930ccb3e702736a3209151b6
2020-09-21 16:35:44 -07:00
Ailing Zhang
92f8f75c59 Add alias dispatch key Math. (#44354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44354

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23591481

Pulled By: ailzhang

fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf
2020-09-21 11:10:39 -07:00
Lucas Hosseini
ac8c7c4e9f Make Channel API accept buffer structs rather than raw pointers. (#45014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212

+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.

Differential Revision: D23598033

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: beauby

fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
2020-09-21 10:18:45 -07:00
Nick Gibson
4bbb6adff5 [NNC] fix SyncThreads insertion and reenable CudaSharedMem test (#44909)
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.

The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.

To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.

Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909

Reviewed By: agolynski

Differential Revision: D23800565

Pulled By: nickgg

fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
2020-09-21 09:27:22 -07:00
anjali411
9f67176b82 Complex gradcheck logic (#43208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43208

This PR adds gradcheck for complex. The logic used for complex gradcheck is described in Section 3.5.3 here: https://arxiv.org/pdf/1701.00392.pdf

More concretely, this PR introduces the following changes:
1. Updates get_numerical_jacobian to take as input a scalar value for vector (v). Adds gradcheck logic for C -> C, C-> R, R -> C. For R -> C functions, only the real value of gradient is propagated.
2. Adds backward definition for `torch.complex` and also adds a test to verify the definition added.
3. Updates backward for `mul`, `sin`, `cos`, `sinh`, `cosh`.
4. Adds tests for all `torch.real`, `torch.imag`, `torch.view_as_real`, `torch.view_as_complex`, `torch.conj`.

Follow up tasks:
1. Add more thorough tests for R -> C cases. Specifically, add R->C test variants for functions. for e.g., `torch.mul(complex_tensor, real_tensor)`
2. Add back commented test in `common_methods_invocation.py`.
3. Add more special case checking for complex gradcheck to make debugging easier.
4. Update complex autograd note.
5. disable complex autograd for operators not tested for complex.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23655088

Pulled By: anjali411

fbshipit-source-id: caa75e09864b5f6ead0f988f6368dce64cf15deb
2020-09-20 22:05:04 -07:00
Peter Bell
da7863f46b Add one dimensional FFTs to torch.fft namespace (#43011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43011

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751850

Pulled By: mruberry

fbshipit-source-id: 8dc5fec75102d8809eeb85a3d347ba1b5de45b33
2020-09-19 23:32:22 -07:00
Mike Ruberry
60709ad1bf Adds multiply and divide aliases (#44463)
Summary:
These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders.

This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44463

Reviewed By: ngimel

Differential Revision: D23670782

Pulled By: mruberry

fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4
2020-09-19 15:47:52 -07:00
Ivan Kobzarev
e9941a5dd4 [vulkan][py] torch.utils.optimize_for_vulkan (#44903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44903

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23766039

Pulled By: IvanKobzarev

fbshipit-source-id: dbdf484ee7d3a7719aab105efba51b92ebc51568
2020-09-18 18:20:11 -07:00
Shawn Wu
572f7e069c Enable type check for torch.testing._internal.te_utils.* (#44927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44927

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23776842

Pulled By: sshawnwu

fbshipit-source-id: 65c028169a37e1f2f7d9fdce8a958234ee1caa26
2020-09-18 18:09:15 -07:00
Peter Bell
fd4e21c91e Add optional string support to native_functions schema (#43010)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43010

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751851

Pulled By: mruberry

fbshipit-source-id: 648f7430e1b7311eff28421f38e01f52d998fcbd
2020-09-18 14:57:24 -07:00
Michael Suo
374e9373b5 [jit] Pull (most) tests out of libtorch_python (#44795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44795

Today, we build our cpp tests twice, once as a standalone gtest binary,
and once linked in `libtorch_python` so we can call them from
`test_jit.py`.

This is convenient (it means that `test_jit.py` is a single entry point
for all our tests), but has a few drawbacks:
1. We can't actually use the gtest APIs, since we don't link gtest into
`libtorch_python`. We're stuck with the subset that we want to write
polyfills for, and an awkward registration scheme where you have to
write a test then include it in `tests.h`).
2. More seriously, we register custom operators and classes in these
tests. In a world where we may be linking many `libtorch_python`s, this
has a tendency to cause errors with `libtorch`.

So now, only tests that explicitly require cooperation with Python are
built into `libtorch_python`. The rest are built into
`build/bin/test_jit`.

There are tests which require that we define custom classes and
operators. In these cases, I've built thm into separate `.so`s that we
call `torch.ops.load_library()` on.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, ZolotukhinM

Differential Revision: D23735520

Pulled By: suo

fbshipit-source-id: d146bf4e7eb908afa6f96b394e4d395d63ad72ff
2020-09-18 14:04:40 -07:00
Lucas Hosseini
af3fc9725d Extract rpc/tensorpipe_utils.{cpp,h} from rpc/utils.{cpp,h} (#44803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44803

Test Plan: CI

Reviewed By: lw

Differential Revision: D23732022

fbshipit-source-id: 5b839c7997bbee162a14d03414ee32baabbc8ece
2020-09-18 13:51:43 -07:00
Nick Gibson
f175830558 [NNC] Fuse identical conditions in simplifier (#44886)
Summary:
Adds a pass to the IR Simplifier which fuses together the bodies of Cond statements which have identical conditions. e.g.

```
if (i < 10) {
  do_thing_1;
} else {
  do_thing_2;
}
if (i < 10) {
  do_thing_3;
}
```

is transformed into:

```
if (i < 10) {
  do_thing_1;
  do_thing_3;
} else {
  do_thing_2;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44886

Reviewed By: glaringlee

Differential Revision: D23768565

Pulled By: nickgg

fbshipit-source-id: 3fe40d91e82bdfff8dcb8c56a02a4fd579c070df
2020-09-18 11:38:03 -07:00
Yanan Cao
174cbff00a Improve sugared value's error message (#42889)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/42889 Improve sugared value's error message**

I think most (if not all) cases where this code path is reached can be attributed to closing over a global variable.
Improving error message to make this clearer to users.

close https://github.com/pytorch/pytorch/issues/41288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42889

Reviewed By: SplitInfinity

Differential Revision: D23779347

Pulled By: gmagogsfm

fbshipit-source-id: ced702a96234040f79eb16ad998d202e360d6654
2020-09-18 11:01:40 -07:00
Peter Bell
df39c40054 Cleanup tracer handling of optional arguments (#43009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43009

* **#43009 Cleanup tracer handling of optional arguments**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23766621

Pulled By: mruberry

fbshipit-source-id: c1b46cd23b58b18ef4c03021b2514d7e692badb6
2020-09-18 06:54:09 -07:00
Rohan Varma
5dbcbea265 TorchScript with record_function (#44345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44345

As part of enhancing profiler support for RPC, when executing TorchScript functions over RPC, we would like to be able to support user-defined profiling scopes created by `with record_function(...)`.

Since after https://github.com/pytorch/pytorch/pull/34705, we support `with` statements in TorchScript, this PR adds support for `with torch.autograd.profiler.record_function` to be used within TorchScript.

This can be accomplished via the following without this PR:
```
torch.opts.profiler._record_function_enter(...)
# Script code, such as forward pass
torch.opts.profiler._record_function_exit(....)
```

This is a bit hacky and it would be much cleaner to use the context manager now that we support `with` statements. Also, `_record_function_` type operators are internal operators that are subject to change, this change will help avoid BC issues in the future.

Tested with `python test/test_jit.py TestWith.test_with_record_function -v`
ghstack-source-id: 112320645

Test Plan:
Repro instructions:
1) Change `def script_add_ones_return_any(x) -> Any` to `def script_add_ones_return_any(x) -> Tensor` in `jit/rpc_test.py`
2) `buck test mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_record_function_on_caller_rpc_async --print-passing-details`
3) The function which ideally should accept `Future[Any]` is `def _call_end_callbacks_on_future` in `autograd/profiler.py`.

python test/test_jit.py TestWith.test_with_foo -v

Reviewed By: pritamdamania87

Differential Revision: D23332074

fbshipit-source-id: 61b0078578e8b23bfad5eeec3b0b146b6b35a870
2020-09-17 18:45:00 -07:00