Commit Graph

1014 Commits

Author SHA1 Message Date
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Bert Maher
c14a3613a8 Fix NaN propagation in TE fuser's min/max implementation (#43609)
Summary:
Per eager mode source-of-truth, NaNs shall be propagated by min/max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43609

Reviewed By: ZolotukhinM

Differential Revision: D23349184

Pulled By: bertmaher

fbshipit-source-id: 094eb8b89a02b27d5ecf3988d0f473c0f91e4afb
2020-09-01 02:10:13 -07:00
Pritam Damania
f1624b82b5 Preserve python backtrace in autograd engine errors. (#43684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684

This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.

As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.

For the example in #42560, the exception trace would now look like:

```
> Traceback (most recent call last):
>   File "test_autograd.py", line 6914, in test_preserve_backtrace
>     Foo.apply(t).sum().backward()
>   File "torch/tensor.py", line 214, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "torch/autograd/__init__.py", line 127, in backward
>     allow_unreachable=True)  # allow_unreachable flag
>   File "torch/autograd/function.py", line 87, in apply
>     return self._forward_cls.backward(self, *args)
>   File "test_autograd.py", line 6910, in backward
>     raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D23365408

fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
2020-09-01 01:28:47 -07:00
Alex Suhan
85d91a3230 [TensorExpr] Check statements in test_kernel.cpp (#43911)
Summary:
Check statements and fix all the warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43911

Test Plan: test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D23441092

Pulled By: asuhan

fbshipit-source-id: f671eef4b4eb9b51acb15054131152ae650fedbd
2020-08-31 22:16:25 -07:00
Alex Suhan
deb5fde51c [TensorExpr] Make KernelSumMultipleAxes much faster (#43905)
Summary:
Reduce input size, skip the dtype conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43905

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.KernelSum*

Reviewed By: ailzhang

Differential Revision: D23433398

Pulled By: asuhan

fbshipit-source-id: 0d95ced3c1382f10595a9e5745bf4bef007cc913
2020-08-31 17:58:43 -07:00
Elias Ellison
a7e7981c0b Use prim::TensorExprGroup interned symbol (#43635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635

Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358806

Pulled By: eellison

fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
2020-08-31 11:52:16 -07:00
Nick Gibson
1390cad2d8 [NNC] Hook up registerizer to Cuda codegen [2/x] (#42878)
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.

First up the good stuff, benchmark before:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.7917          9.7037          6.9386          6.0448
          (100, 100)          5.9338          14.972          7.1139          6.3254
        (100, 10000)          21.453          741.54          145.74          12.555
        (1000, 1000)          8.0678          122.75          22.833          9.0778

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.4502          7.9661          6.1469          5.5587
          (100, 100)          5.7613          13.897           21.49          5.5808
        (100, 10000)          21.702          82.398          75.462          22.793
        (1000, 1000)          22.527             129          176.51          22.517

```

After:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          6.0458          9.4966          7.1094           6.056
          (100, 100)          5.9299          9.1482          7.1693           6.593
        (100, 10000)          21.739          121.97          162.63          14.376
        (1000, 1000)          9.2374           29.01          26.883          10.127

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.9773          8.1792          7.2307          5.8941
          (100, 100)          6.1456          9.3155          24.563          5.8163
        (100, 10000)          25.384          30.212          88.531          27.185
        (1000, 1000)          26.517          32.702          209.31          26.537
```

Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).

The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.

It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878

Reviewed By: glaringlee

Differential Revision: D23382499

Pulled By: nickgg

fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
2020-08-31 10:39:46 -07:00
Alex Suhan
60ad7e9c04 [TensorExpr] Make sum available from Python (#43730)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43730

Test Plan:
python test/test_jit_fuser_te.py -k TestTEFuser.test_sum
test_tensorexpr --gtest_filter=TensorExprTest.KernelSum*

Reviewed By: ZolotukhinM

Differential Revision: D23407600

Pulled By: asuhan

fbshipit-source-id: e6da4690ae6d802f9be012e39e61b7467aa5285c
2020-08-29 10:38:21 -07:00
Nikolay Korovaiko
000739c31a Function calls for fallback paths (#43274)
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274

Reviewed By: malfet

Differential Revision: D23406961

Pulled By: Krovatkin

fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
2020-08-28 23:31:02 -07:00
Vinod Kumar S
13c7c6227e Python/C++ API Parity: TransformerDecoder (#42886)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42886

Reviewed By: zhangguanheng66

Differential Revision: D23385631

Pulled By: glaringlee

fbshipit-source-id: 610a2fabb4c25b2dfd37b33287215bb8872d653d
2020-08-28 20:13:53 -07:00
Mikhail Zolotukhin
776c2d495f [JIT] IRParser: store list attributes as generic ivalue lists. (#43785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43785

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23400565

Pulled By: ZolotukhinM

fbshipit-source-id: e248eb1854c4ec40da9455d4279ea6e47b1f2a16
2020-08-28 13:27:28 -07:00
Mike Ruberry
f4695203c2 Fixes fft function calls for C++ API (#43749)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43732.

Requires importing the fft namespace in the C++ API, just like the Python API does, to avoid clobbering torch::fft the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43749

Reviewed By: glaringlee

Differential Revision: D23391544

Pulled By: mruberry

fbshipit-source-id: d477d0b6d9a689d5c154ad6c31213a7d96fdf271
2020-08-28 12:41:30 -07:00
Martin Yuan
288a2effa0 Operator generator based on templated selective build. (#43456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43456

Introduce the template OperatorGenerator, which returns an optional Operator. It's null if the templated bool value is null.

RegisterOperators() is updated to take the optional Operator. A null will not be registered.

With this update the selective operator registration can be done at compile time. Tests are added to show an operator can be registered if it's in a whitelist and it will not be registered if it's not in the whitelist.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D23283563

Pulled By: iseeyuan

fbshipit-source-id: 456e0c72b2f335256be800aeabb797bd83bcf0b3
2020-08-27 07:26:07 -07:00
Alex Suhan
de84db2a9d [TensorExpr] Add aten::sum lowering to the kernel (#43585)
Summary:
Handles all dimensions and selected dimensions, per PyTorch semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43585

Test Plan: test_tensorexpr

Reviewed By: bertmaher

Differential Revision: D23362382

Pulled By: asuhan

fbshipit-source-id: e8d8f1197a026be0b46603b0807d996a0de5d58c
2020-08-27 02:46:47 -07:00
lixinyu
48e08f884e C++ APIs TransformerEncoder (#43187)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43187

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23182770

Pulled By: glaringlee

fbshipit-source-id: 968846138d4b1c391a74277216111dba8b72d683
2020-08-27 01:31:46 -07:00
James Reed
a070c619b9 [FX] Native callables in FX lowering (#43426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43426

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23273427

Pulled By: jamesr66a

fbshipit-source-id: 3a9d04486c72933d8afd9c181578fe98c3d825b0
2020-08-27 00:00:03 -07:00
Mikhail Zolotukhin
3ec24f02af [TensorExpr] Start using typecheck in the fuser. (#43173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173

With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.

Differential Revision: D23178230

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
2020-08-25 18:13:32 -07:00
Mikhail Zolotukhin
b763666f9f [JIT] Subgraph utils: add an optional vmap argument to the API to allow retrieving value mappings. (#43235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43235

This functionality is needed when we want to not lose track of
nodes/values as we merge and unmerge them into other nodes. For
instance, if we have a side data structure with some meta information
about values or nodes, this new functionality would allow to keep that
metadata up to date after merging and unmerging nodes.

Differential Revision: D23202648

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: 350d21a5d462454166f8a61b51d833551c49fcc9
2020-08-25 18:13:29 -07:00
Ann Shan
7cc1efec13 Add lite SequentialSampler to torch mobile (#43299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43299

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23228415

Pulled By: ann-ss

fbshipit-source-id: eebe54353a128783f039c7dac0e2dd765a61940d
2020-08-24 09:45:24 -07:00
Nikolay Korovaiko
a97ca93c0e remove prim::profile and special-casing (#43160)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43160

Reviewed By: ZolotukhinM

Differential Revision: D23284421

Pulled By: Krovatkin

fbshipit-source-id: 35e97aad299509a682ae7e95d7cef53301625309
2020-08-22 23:52:36 -07:00
Zino Benaissa
40c77f926c Add prim::TypeCheck operation (#43026)
Summary:
TypeCheck is a new operation to check the shape of tensors against
 expectd shapes. TypeCheck is a variadic operation. An example,

 %t0 : Tensor = ...
 %t1 : Tensor = ...
 %2 : FLOAT(20, 20), %3 : FLOAT(30, 30), %1 : bool =
 prim::TypeCheck(%t1, %t2)
 prim::If(%1)

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43026

Reviewed By: ZolotukhinM

Differential Revision: D23115830

Pulled By: bzinodev

fbshipit-source-id: fbf142126002173d2d865cf4b932dea3864466b4
2020-08-21 20:03:24 -07:00
Raghavan Raman
100649d6a9 Normalize loops with non-zero start. (#43179)
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.

This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179

Reviewed By: nickgg

Differential Revision: D23220534

Pulled By: navahgar

fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
2020-08-21 12:37:27 -07:00
Alex Suhan
f20a04fa2d [TensorExpr] Simplify conditional select (#43350)
Summary:
Fold conditional select when both sides are constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43350

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConditionalSelectFold*

Reviewed By: pbelevich

Differential Revision: D23256602

Pulled By: asuhan

fbshipit-source-id: ec04b1e4ae64f59fa574047f2d7af55a717a5262
2020-08-21 11:15:48 -07:00
lixinyu
e32d014f46 remove empty override pretty_print (#43341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43341

This is to remove the empty pretty_print() since it overrides the impl within Module base which is not as designed here.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D23244616

Pulled By: glaringlee

fbshipit-source-id: 94b8dfd3697dfc450f53b3b4eee6e9c13cafba7b
2020-08-20 18:48:29 -07:00
Ann Shan
dd194c1612 add _save_parameters to serialize map (#43163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43163

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23175287

Pulled By: ann-ss

fbshipit-source-id: ddfd734513c07e8bdbec108f26d1ca1770d098a6
2020-08-18 14:58:04 -07:00
Ann Shan
2e6e295ecc refactor _save_parameters to _save_data (#43162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43162

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23175286

Pulled By: ann-ss

fbshipit-source-id: 6f930b98c367242fd4efbf51cb1d09995f7c4b40
2020-08-18 14:57:03 -07:00
Christian Sarofeen
b3bda94393 [NVFuser] Enable E2E BCast-PWise-Reduction fusions (#43129)
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

**Overall:**

- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

**Integration:**

- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

**Code Generation:**

- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129

Reviewed By: mrshenli

Differential Revision: D23162207

Pulled By: soumith

fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
2020-08-18 09:10:08 -07:00
lixinyu
269fdb5bb2 prepare to split transformer header file (#43069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43069

The transformer c++ impl need to put TransformerEncoderLayer/DecoderLayer and TransformerEncoder/TransformerDecoder in different header since TransformerEncoder/Decoder's options class need TransformerEncoderLayer/DecoderLayer as input parameter. Split header files to avoid cycle includsion.

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D23139437

Pulled By: glaringlee

fbshipit-source-id: 3c752ed7702ba18a9742e4d47d049e62d2813de0
2020-08-17 07:54:05 -07:00
Ann Shan
248b6a30f4 add training mode to mobile::Module (#42880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42880

Enable switching between and checking for training and eval mode for torch::jit::mobile::Module using train(), eval(), and is_training(), like exists for torch::jit::Module.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23063006

Pulled By: ann-ss

fbshipit-source-id: b79002148c46146b6e961cbef8aaf738bbd53cb2
2020-08-17 00:20:03 -07:00
Elias Ellison
91f3114fc1 [JIT] Represent profiled types as a node attribute (#43035)
Summary:
This changes profiled types from being represented as:
`%23 : Float(4:256, 256:1, requires_grad=0, device=cpu) = prim::profile(%0)`
->
`%23 : Tensor = prim::profile[profiled_type=Float(4:256, 256:1, requires_grad=0, device=cpu)](%0)`

Previously, by representing the profiled type in the IR directly it was very easy for optimizations to accidentally use profiled types without inserting the proper guards that would ensure that the specialized type would be seen.

It would be a nice follow up to extend this to prim::Guard as well, however we have short term plans to get rid of prim::Guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43035

Reviewed By: ZolotukhinM

Differential Revision: D23120226

Pulled By: eellison

fbshipit-source-id: c78d7904edf314dd65d1a343f2c3a947cb721b32
2020-08-14 20:17:46 -07:00
Shen Li
06aaf8c20d Add set_device_map to TensorPipeOptions to support GPU args (#42637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637

This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23011572

Pulled By: mrshenli

fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
2020-08-14 18:46:55 -07:00
Heitor Schueroff de Souza
3d8c144400 Implemented torch::nn::Unflatten in libtorch (#42613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42613

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23030302

Pulled By: heitorschueroff

fbshipit-source-id: 954f1cdfcbd3a62a7f0e887fcf5995ef27222a87
2020-08-14 15:32:13 -07:00
Nikita Shulga
2f9fd8ad29 Build test_e2e_tensorpipe only if Gloo is enabled (#43041)
Summary:
test_e2e_tensorpipe depends on ProcessGroupGloo, therefore it could not be tested with Gloo disabled
Otherwise, it re-introduces  https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43041

Reviewed By: lw

Differential Revision: D23122101

Pulled By: malfet

fbshipit-source-id: a8a088b6522a3bc888238ede5c2d589b83c6ea94
2020-08-14 09:24:47 -07:00
Luca Wehrstedt
ed242cbec5 Guard TensorPipe agent by USE_TENSORPIPE (#42682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42682

ghstack-source-id: 109834351

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22978717

fbshipit-source-id: 18b7cbdb532e78ff9259e82f0f92ad279124419d
2020-08-14 02:57:36 -07:00
taivu
ccd9f3244b Get, save, and load module information for each operator (#42133)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42133

Test Plan:
We save a module with module debugging information as follows.
```
import torch
m = torch.jit.load('./detect.pt')
# Save module without debug info
m._save_for_lite_interpreter('./detect.bc')
# Save module with debug info
m._save_for_lite_interpreter('./detect.bc', _save_debug_info_in_bytecode=True)
```
Size of the file without module debugging information: 4.508 MB
Size of the file with module debugging information: 4.512 MB

Reviewed By: kimishpatel

Differential Revision: D22803740

Pulled By: taivu1998

fbshipit-source-id: c82ea62498fde36a1cfc5b073e2cea510d3b7edb
2020-08-14 01:25:27 -07:00
Vinod Kumar S
830423b80b Python/C++ API Parity: TransformerDecoderLayer (#42717)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42717

Reviewed By: zhangguanheng66

Differential Revision: D23095841

Pulled By: glaringlee

fbshipit-source-id: 327a5a23c9a3cca05e422666a6d7d802a7e8c468
2020-08-13 20:31:13 -07:00
Nick Gibson
6fb5ce5569 [NNC] Fix some bugs in Round+Mod simplification (#42934)
Summary:
When working on the Cuda Codegen, I found that running the IRSimplifier before generating code lead to test fails. This was due to a bug in Round+Mod simplification (e.g. (x / y * y) + (x % y) => x) to do with the order in which the terms appeared. After fixing it and writing a few tests around those cases, I found another bug in simplification of the same pattern and have fixed it (with some more test coverage).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42934

Reviewed By: zhangguanheng66

Differential Revision: D23085548

Pulled By: nickgg

fbshipit-source-id: e780967dcaa7a5fda9f6d7d19a6b7e7b4e94374b
2020-08-13 09:47:21 -07:00
Bram Wasti
ba9025bc1a [tensorexpr] Autograd for testing (#42548)
Summary:
A simple differentiable abstraction to allow testing of full training graphs.

Included in this 1st PR is an example of trivial differentiation.

If approved, I can add a full MLP and demonstrate convergence using purely NNC (for performance testing) in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42548

Reviewed By: ZolotukhinM

Differential Revision: D23057920

Pulled By: bwasti

fbshipit-source-id: 4a239852c5479bf6bd20094c6c35f066a81a832e
2020-08-13 07:58:06 -07:00
Luca Wehrstedt
8493b0d5d6 Enroll TensorPipe agent in C++-only E2E test (#42680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42680

ghstack-source-id: 109544678

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22978714

fbshipit-source-id: 04d6d190c240c6ead9bd9f3b7f3a5f964d7451e8
2020-08-13 07:07:30 -07:00
Nick Gibson
aabdef51f9 [NNC] Registerizer for GPU [1/x] (#42606)
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.

For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}
```

with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;
```

This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.

This diff got a bit unwieldy with the integration code so that will come in a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606

Reviewed By: bertmaher

Differential Revision: D22970969

Pulled By: nickgg

fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180
2020-08-11 11:17:50 -07:00
Heitor Schueroff de Souza
ffc3da35f4 Don't materialize output grads (#41821)
Summary:
Added a new option in AutogradContext to tell autograd to not materialize output grad tensors, that is, don't expand undefined/None tensors into tensors full of zeros before passing them as input to the backward function.

This PR is the second part that closes https://github.com/pytorch/pytorch/issues/41359. The first PR is https://github.com/pytorch/pytorch/pull/41490.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41821

Reviewed By: albanD

Differential Revision: D22693163

Pulled By: heitorschueroff

fbshipit-source-id: a8d060405a17ab1280a8506a06a2bbd85cb86461
2020-08-11 04:27:07 -07:00
Nikita Shulga
64a7939ee5 test_cpp_rpc: Build test_e2e_process_group.cpp only if USE_GLOO is true (#42836)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42836

Reviewed By: seemethere

Differential Revision: D23041274

Pulled By: malfet

fbshipit-source-id: 8605332701271bea6d9b3a52023f548c11d8916f
2020-08-10 16:54:26 -07:00
Ann Shan
13bc542829 Fix lite trainer unit test submodule registration (#42714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42714

Change two unit tests for the lite trainer to register two instances/objects of the same submodule type instead of the same submodule object twice.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22990736

Pulled By: ann-ss

fbshipit-source-id: 2bf56b5cc438b5a5fc3db90d3f30c5c431d3ae77
2020-08-07 18:26:56 -07:00
lixinyu
98de150381 C++ API TransformerEncoderLayer (#42633)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42633

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22994332

Pulled By: glaringlee

fbshipit-source-id: 873abdf887d135fb05bde560d695e2e8c992c946
2020-08-07 11:49:42 -07:00
Nick Gibson
944ac133d0 [NNC] Remove VarBinding and go back to Let stmts (#42634)
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.

So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634

Reviewed By: mruberry

Differential Revision: D22969771

Pulled By: nickgg

fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
2020-08-07 10:50:38 -07:00
Luca Wehrstedt
c30bc6d4d7 Update TensorPipe submodule (#42522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22959472

fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
2020-08-06 02:14:58 -07:00
Ilia Cherniavskii
a53fdaa23f Remove ProfiledType (#42570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570

ProfiledType doesn't do anything and is not used atm, removing

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22938664

Pulled By: ilia-cher

fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
2020-08-06 01:52:08 -07:00
Mike Ruberry
ccfce9d4a9 Adds fft namespace (#41911)
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.

Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:

```
import torch.fft

t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```

See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911

Reviewed By: glaringlee

Differential Revision: D22941894

Pulled By: mruberry

fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
2020-08-06 00:20:50 -07:00
Alexandru Suhan
1848b43c4d [NNC] Add loop unroll transformation (#42465)
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:

```
for x in 0..3:
  A[x] = x*2
```

becomes:

```
A[0] = 0
A[1] = 2
A[2] = 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465

Test Plan: `test_tensorexpr` unit tests.

Reviewed By: agolynski

Differential Revision: D22914418

Pulled By: asuhan

fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
2020-08-05 20:46:32 -07:00
Mikhail Zolotukhin
ef50694d44 [TensorExpr] Apply GenericIntrinsicExpander recursively. (#42567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42567

Before this change we didn't expand arguments, and thus in an expr
`sigmoid(sigmoid(x))` only the outer call was expanded.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D22936177

Pulled By: ZolotukhinM

fbshipit-source-id: 9c05dc96561225bab9a90a407d7bcf9a89b078a1
2020-08-05 14:13:46 -07:00