Commit Graph

159 Commits

Author SHA1 Message Date
Horace He
b38f153d91 [nnc] Added NNC lowerings for t/transpose/permute/expand + other cleaning (#57426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57426

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28293191

Pulled By: Chillee

fbshipit-source-id: b8fc44299acf2569c11e87e1991a2b724434b15d
2021-05-07 15:38:56 -07:00
Elias Ellison
241c2f4496 Add Gelu To NNC (#57753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57753

I'm not adding symbolic gradient because that is being added in https://github.com/pytorch/pytorch/pull/46785.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28262765

Pulled By: eellison

fbshipit-source-id: be365a2d392d7ac4bcc099a184762249ec2e18a6
2021-05-06 16:04:50 -07:00
CodemodService FBSourceClangFormatLinterBot
f1a62264f3 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28250914

fbshipit-source-id: 8bec4e0806891a045becf59c2d2f44f12bc41926
2021-05-06 05:11:25 -07:00
Horace He
c27428b5e9 [nnc] ported conv2d lowering over (#56875)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56875

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28213450

Pulled By: Chillee

fbshipit-source-id: bacdcec83ec61aba1d55f5e3a16f81d6ada3cff2
2021-05-05 20:54:43 -07:00
Elias Ellison
7627dd568a hardswish reland (#57652)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57652

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D28226724

Pulled By: eellison

fbshipit-source-id: 585a91ffab7a855b5600e79130a37be25ef9b354
2021-05-05 17:21:43 -07:00
Horace He
56211524a7 [NNC] ported over sum and softmax to new scheme (#56775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56775

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28173905

Pulled By: Chillee

fbshipit-source-id: 865ff71e5a428341d7225f534f7093ef2994fe5a
2021-05-05 17:09:34 -07:00
Mikhail Zolotukhin
e686c66fe7 Reland: [TensorExpr] Add TensorExprKernel::runFast method. (#57552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57552

This method uses `CodeGen::call_raw` instead of `CodeGen::call`.

Relanding #57328 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28195047

Pulled By: ZolotukhinM

fbshipit-source-id: bcfd3cb5b4f33a149b7549515ffd705e2c4f208f
2021-05-05 09:11:37 -07:00
Shen Li
887d0e5657 Revert D28197820: [JIT][NNC] add hardswish symbolic gradient and NNC lowering
Test Plan: revert-hammer

Differential Revision:
D28197820 (0142fd0b57)

Original commit changeset: 05305d85c5bb

fbshipit-source-id: 2e1d9699515982ba2a9be06e83a2ce043ec857ee
2021-05-05 07:53:30 -07:00
eellison
0142fd0b57 [JIT][NNC] add hardswish symbolic gradient and NNC lowering (#57383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57383

Notes: I picked up an activation from https://github.com/pytorch/pytorch/issues/56969. You can look at the [activations.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/Activation.cpp#L429) file which has both forward and backward kernel code to help you write the NNC lowering and the symbolic gradient.

I added a test in test_jit_fuser_te for the fusion, and I added an OpInfo and asserted that we expect to see autodiffable nodes to test the symbolic gradient.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28197820

Pulled By: eellison

fbshipit-source-id: 05305d85c5bb0847c8f911b95ba47b137dca7e90
2021-05-04 23:39:59 -07:00
Mikhail Zolotukhin
839d549f8f [JIT] Add a pass for removing a first (self) argument from a graph if it is unused. (#57169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57169

The pass is planned to be used in AOT pipeline, where we expect input
graphs to be functional. As such, these graphs should not use 'self'
argument even if it is present, and thus it can be remove safely.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28128328

Pulled By: ZolotukhinM

fbshipit-source-id: a7dfbf7776682826100c8eb0fef982a2e81c2554
2021-05-03 20:02:25 -07:00
Mikhail Zolotukhin
3ad3d8bd3f [JIT] Add a pass for annotating graph with input types derived from sample inputs. (#57076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57076

This pass is intended to be used in conjunction with shape propagation
pass: first we use sample inputs to specify shape info for graph inputs
and then we run shape-prop to infer shapes of intermediate values in the
graph.

Differential Revision: D28048290

Test Plan: Imported from OSS

Reviewed By: astaff

Pulled By: ZolotukhinM

fbshipit-source-id: 778d772e873d59d77af9f669f45dc44b9ee5e443
2021-05-03 20:01:15 -07:00
Mike Ruberry
3018093066 Revert D28110359: [TensorExpr] Add TensorExprKernel::runFast method.
Test Plan: revert-hammer

Differential Revision:
D28110359 (f219ed6627)

Original commit changeset: 4fdffc8196d2

fbshipit-source-id: 3c93a058b5dd7a3b71e399341a408ec74949ef56
2021-05-01 16:16:37 -07:00
Horace He
47e9ec401a [nnc] ported some more ops + added vectors to argvalue (#56766)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56766

Test Plan: Imported from OSS

Reviewed By: desertfire

Differential Revision: D28118331

Pulled By: Chillee

fbshipit-source-id: eb012943ad3b83e72a8cb17b594852164c3f0567
2021-04-30 17:34:49 -07:00
Mikhail Zolotukhin
f219ed6627 [TensorExpr] Add TensorExprKernel::runFast method. (#57328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57328

This method uses `CodeGen::call_raw` instead of `CodeGen::call`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28110359

Pulled By: ZolotukhinM

fbshipit-source-id: 4fdffc8196d24fc3300a9b4bc69f67562042a045
2021-04-30 15:26:18 -07:00
Horace He
3a923a555a [NNC] moved lowerings out of the TensorExprKernel and into independent functions (#56679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56679

moved lowerings out of the TensorExprKernel and into independent functions

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28082921

Pulled By: Chillee

fbshipit-source-id: af530510957ed4aa8b64dcc77ca36b69866d8000
2021-04-29 05:46:50 -07:00
Nikita Shulga
eac02f85cf Fix more clang-tidy errors (#57235)
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235

Reviewed By: janeyx99

Differential Revision: D28084444

Pulled By: malfet

fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
2021-04-28 23:29:10 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Mikhail Zolotukhin
f3743f097f [TensorExpr] Nuke tensorexpr::ScalarType and instead use c10::ScalarType directly. (#56825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56825

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27977461

Pulled By: ZolotukhinM

fbshipit-source-id: f8a72938ba395e426e2d9449627113abb1c9c34f
2021-04-26 01:51:21 -07:00
Mikhail Zolotukhin
441c835733 [TensorExpr] Remove unused field from TensorExprKernel. (#56761)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56761

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27960594

Pulled By: ZolotukhinM

fbshipit-source-id: 8f2bf1d688422363b97f48045ff96601665301f5
2021-04-26 01:51:19 -07:00
Horace He
bcef7ebd60 [NNC] Added matmul for NNC lowering/unified dtypes (#56456)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56456

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27977532

Pulled By: Chillee

fbshipit-source-id: c04372d988c8ef795f27037348a155894c2eddad
2021-04-24 19:15:16 -07:00
Horace He
7c50852a60 moved more lowerings over (#55372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55372

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27884601

Pulled By: Chillee

fbshipit-source-id: 91b00182abb5dcf60209425d2717fa0303cb4932
2021-04-23 00:08:26 -07:00
Horace He
b66a1e00a6 [NNC] added skeleton for refactoring (#55371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55371

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D27616418

Pulled By: Chillee

fbshipit-source-id: 8187a0cb2495b6bec07bb5992e352da3ffb299fb
2021-04-21 04:07:01 -07:00
Bert Maher
c91c4a081d [NNC] Horizontally fuse all loops (#56324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56324

Inlining is great if LLVM's CSE kicks in; but if a kernel has multiple outputs
(and thus multiple loops), CSE has no chance.

So, this pass "horizontally" fuses the output loops together so that CSE can go
to town. Essentially we want to turn
```
for (...) {
  output_1[] = some_complicated_expr...
}
for (...) {
  output_2[] = some_complicated_expr...
}
```

Into:
```
for (...) {
  output_1[] = complicated_expr
  output_2[] = complicated_expr. // llvm cse should take care of this
}
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27841194

Pulled By: bertmaher

fbshipit-source-id: 54153bb59786be87183c636d64f05963c4b1624a
2021-04-20 23:54:40 -07:00
Mikhail Zolotukhin
85126629a5 [TensorExpr] Add support for constant tensors in tensorexpr kernel. (#56319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56319

With this change the TorchScript graph can have constant tensors in it
and we still will be able to lower it to TE. The constants are
registered (or bound) within the `TensorExprKernel` object and when the
codegen is called, they are passed along with usual inputs and outputs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27838747

Pulled By: ZolotukhinM

fbshipit-source-id: 4a519d66fcc07fe5fa53f5cf9af28d25611f8437
2021-04-17 11:15:35 -07:00
Mikhail Zolotukhin
dd9ef529ba [TensorExpr] TensorExprKernel: switch type of tensors_ from Tensor to Buf. (#56318)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56318

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27838748

Pulled By: ZolotukhinM

fbshipit-source-id: 371a454912be76889999eda79e60d8154b749134
2021-04-17 11:14:26 -07:00
Bert Maher
928a4733af [nnc] Only lower float conv2d's (#56289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56289

While there's no reason to think non-float32 conv2d's *don't* work,
they're only tested in float32 now.  Since that's the most important use case,
I'd rather restrict the dtypes than spend time testing all the weird dtype
combinations that could possibly happen.
ghstack-source-id: 126755549

Test Plan: unit tests

Reviewed By: navahgar

Differential Revision: D27828495

fbshipit-source-id: fcf179207f2c9b20e0e86eb2b85687517d87063c
2021-04-17 05:12:54 -07:00
Mikhail Zolotukhin
5f19385588 [TensorExpr] Add aten::matmuls to TE fuser. (#54605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54605

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27298364

Pulled By: ZolotukhinM

fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
2021-04-16 12:54:38 -07:00
Natalia Gimelshein
506eca24b9 Revert D27752279: [nnc] Do not try to vectorize kernels that use float16
Test Plan: revert-hammer

Differential Revision:
D27752279 (8df5e61fd6)

Original commit changeset: ac115080bf2a

fbshipit-source-id: cbc0aa2dcb7691d9fc9d081c6169dea711cd9fac
2021-04-14 20:21:40 -07:00
Bert Maher
8df5e61fd6 [nnc] Do not try to vectorize kernels that use float16 (#55970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55970

LLVM's support for float16 is not great, and we were seeing assertion
failures trying to generate code for vectorized uses.  I note that clang
doesn't even try to vectorize operations involving half:
https://gcc.godbolt.org/z/86MW4xr17, so that's a good sign we shouldn't either.

Fixes #55905
ghstack-source-id: 126511474

Test Plan: pytest test_jit_fuser_te.py -k test_isnan

Reviewed By: asuhan

Differential Revision: D27752279

Pulled By: bertmaher

fbshipit-source-id: ac115080bf2a4a73d52b396d64a5bce0cf13abfe
2021-04-14 11:28:34 -07:00
Mikhail Zolotukhin
7ab654afd7 [TensorExpr] Rename Tensor::call to Tensor::load to be consistent with Buf and Placeholder. (#55826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55826

It's a mechanical change.

Differential Revision: D27717777

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: fbc1bb99602250c706cf2c8c2684119c323e4d51
2021-04-13 12:08:53 -07:00
Mikhail Zolotukhin
1263448cb2 [TensorExpr] Remove mask field from Load and Store classes. (#55825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825

The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.

Differential Revision: D27717776

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
2021-04-13 12:08:51 -07:00
Bert Maher
42486963b2 Integrate NNC conv2d with fuser (#55213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55213

Adds the integration of conv2d with the TE fuser.  A few things of interest:

- I'm *super* selective of what convs get lowered.  Only 3x3 depthwise, because
  I've benchmarked those to death and I'm pretty sure it's a good change.

- I'm allowing single-node "fusion" groups for supported convs.  (Maybe this is
  a sign that conv2d codegen should go through a different path entirely, but
  it seems to basically work).

I'll shared full benchmarkr results once I clean them up a little.  To
summarize, I tested the following torchvision models containing depthwise
convolutions.  Results are single-core on a skylake-avx512:

mobilenet_v2: 8% improvement
mobilenet_v3: 9% improvement
mnasnet: 10% improvement
shufflenet: 18% improvement

Note these are comparing against a baseline with a fast-but-buggy grouped
convolution implementation in MKLDNN.  So perf results will be better if
compared on master, but I'm going to assume the MKLDNN bug will be fixed and
re-enabled.

Perf results are more complicated when comparing to freezing plus conversion to
mkldnn layout; mobilenet v2/v3 are still faster, but mnasnet and shufflenet are
not.  Landing this doesn't prevent MKLDNN freezing from kicking in though, so
there's no harm (although landing mkldnn freezing will regress mobilenet, but
cest la vie).
ghstack-source-id: 126076112

Test Plan: New unit test, plus torchvision

Reviewed By: ZolotukhinM

Differential Revision: D27530272

fbshipit-source-id: 92153fad234bc9f1eaa4f7624c543168d1294a87
2021-04-08 21:58:27 -07:00
Nikita Shulga
6a39613f35 [BE] Make torch/csrc/jit/tensorexpr/ clang-tidy clean (#55628)
Summary:
Mostly auto-generated changes using
```
 python3 tools/clang_tidy.py -c build -x torch/csrc/jit/tensorexpr/eval.cpp -s
```
With following common patterns manually fixed
- Use ` = default` instead of `{}`
- deleted methods should be public
- Use pass-by-value + std::move instead of pass-by-reference+copy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55628

Reviewed By: walterddr

Differential Revision: D27655378

Pulled By: malfet

fbshipit-source-id: 92be87a08113435d820711103ea9b0364182c71a
2021-04-08 19:44:14 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Mikhail Zolotukhin
e8dbd0e1a0 [TensorExpr] Minor cleanups in kernel.cpp. (#55257)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55257

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D27544659

Pulled By: ZolotukhinM

fbshipit-source-id: c2f51be1a42df090a105689c8e3e91446e9ea8b4
2021-04-02 21:47:48 -07:00
Bert Maher
8e89d30f09 [nnc] Lower scalar constants as doubles/longs (#54824)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54824

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D27383224

Pulled By: bertmaher

fbshipit-source-id: 84b43ba6c22c1338c68c40a11ca647c3717f2abc
2021-03-29 14:06:04 -07:00
Mikhail Zolotukhin
1ceb90405b [TensorExpr] Add plumbing for conv2d fusion. (#54439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54439

For now the only way to represent conv2d in TE is via an external call,
and since aten library doesn't have an out variant for conv2d, the
external call has to perform an extra copy. Because of that fusing
conv2d now regressed performance and hence is disabled. However, in near
future we should have two alternative ways to enable it:
1) represent conv2d natively in TE (without an external call)
2) add an out variant for conv2d

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27237045

Pulled By: ZolotukhinM

fbshipit-source-id: f5545ff711b75f9f37bc056316d1999a70043b4c
2021-03-24 18:49:07 -07:00
Hui Guo
2a53897114 [jit][tensorexpr] Added aten::batch_norm into fuser when in inference mode (#54204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54204

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27134348

Pulled By: huiguoo

fbshipit-source-id: 5ea7a6c5bc694fcdfc436dba3fa6eb269420324e
2021-03-23 04:41:52 -07:00
Raghavan Raman
a5e19126b6 [NNC] LoopNest cleanup (#53688)
Summary:
* Replacing vector of Tensors with a set of output buffers in `TensorExprKernel`.
* Creating a block statement while compiling in `TensorExprKernel`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53688

Reviewed By: mrshenli

Differential Revision: D26941222

Pulled By: navahgar

fbshipit-source-id: 9eb81ec2effcdeafbeaa67d1e12475166054f80f
2021-03-10 20:20:03 -08:00
Raghavan Raman
d3cde6c23c [NNC] Implementation for aten::cat without conditionals. (#53128)
Summary:
This PR adds an implementation for `aten::cat` in NNC without any conditionals. This version is not enabled by default.

Here is the performance of some micro benchmarks with and without conditionals. There is up to 50% improvement in performance without conditionals for some of the shapes.

aten::cat implementation in NNC **with** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 5.44 us, SOL 0.26 GB/s, algorithmic 0.51 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.75 us, SOL 1.05 GB/s, algorithmic 2.10 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.87 us, SOL 4.05 GB/s, algorithmic 8.11 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 14.52 us, SOL 8.31 GB/s, algorithmic 16.62 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 9.58 us, SOL 6.84 GB/s, algorithmic 13.68 GB/s
```
aten::cat implementation in NNC **without** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion --cat_wo_conditionals concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 4.67 us, SOL 0.30 GB/s, algorithmic 0.60 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.65 us, SOL 1.07 GB/s, algorithmic 2.14 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.10 us, SOL 4.56 GB/s, algorithmic 9.12 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 7.44 us, SOL 16.22 GB/s, algorithmic 32.44 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 6.46 us, SOL 10.14 GB/s, algorithmic 20.29 GB/s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53128

Reviewed By: bertmaher

Differential Revision: D26758613

Pulled By: navahgar

fbshipit-source-id: 00f56b7da630b42bc6e7ddd4444bae0cf3a5780a
2021-03-07 22:57:02 -08:00
Mikhail Zolotukhin
8bac382d9d [TensorExpr] Remove unused classes from TensorExprKernel. (#53283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53283

We had `ShapeArg` and `KernelArg` classes, which were wrappers over
`BufferArg` without adding any new functionality on top of what already
existed. This PR removes them and replace their uses with `BufferArg`s
directly.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26821993

Pulled By: ZolotukhinM

fbshipit-source-id: d1f95ea069b9f38f1d32424464551df2565b3c49
2021-03-04 21:24:29 -08:00
Raghavan Raman
d382693263 [NNC] Build aggregate stmt for kernel before LoopNest. (#53024)
Summary:
This PR builds an aggregate stmt for all the tensors in the kernel before constructing LoopNest. This migrates to using the LoopNest constructor that takes in a stmt and output buffers. This is one more step closer to eliminating the dependency of LoopNest on Tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53024

Reviewed By: H-Huang

Differential Revision: D26729221

Pulled By: navahgar

fbshipit-source-id: 43e972585351f6902c14b383b137aaaee3aaa3e1
2021-03-02 00:51:56 -08:00
Elias Ellison
43f56e19a6 [NNC] Make NNC sanitize input names (#52786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52786

Previously, NNC did not sanitize input names. I ran into this in the next PR when making subgraph creation preserve debug names caused a number of NNC cuda failures. I also previously ran into this with some masked_fill failures internally, which led me to disable the operator.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696699

Pulled By: eellison

fbshipit-source-id: 7c3af4d559d58762fb8332666784a4d5cd6a4167
2021-03-01 21:22:16 -08:00
Richard Barnes
26419815af Modernize for-loops (#52330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52330

Test Plan: Sandcastle

Reviewed By: mruberry

Differential Revision: D26001961

fbshipit-source-id: e75cc8f1a8d30917b4d55df9e1a3c7836c271820
2021-02-23 17:32:33 -08:00
Raghavan Raman
09c56ef45e Remove DepTracker from LoopNest (#52405)
Summary:
Remove the dependency tracker that works on Tensors, DepTracker, from LoopNest. This is essential to the goal of removing Tensors from LoopNest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52405

Reviewed By: heitorschueroff

Differential Revision: D26548621

Pulled By: navahgar

fbshipit-source-id: b20f23d608c19ac71aebd31c14777d653eead36c
2021-02-22 12:48:07 -08:00
Hui Guo
d8b28579c3 Add NNC support for aten::hardtanh (a hot operation in mobilenet v2/v3) (#52394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52394

Test Plan:
Imported from OSS

test/test_tensorexpr.py
test/test_jit_fuser_te.py

Reviewed By: bertmaher

Differential Revision: D26497856

Pulled By: huiguoo

fbshipit-source-id: 8558f89826cad250da6f970bfc49384f2b9d7ee0
2021-02-18 22:56:03 -08:00
Raghavan Raman
c7a70eec1b Make LLVM the default backend for TE (#52314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264

When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.

This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314

Reviewed By: ejguan

Differential Revision: D26491294

Pulled By: navahgar

fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
2021-02-18 12:00:38 -08:00
Scott Wolchok
7328710cbc [PyTorch][codemod] Replace immediately-dereferenced cast calls w/castRaw (#50229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50229

`fastmod -m 'cast(<((at|c10)::)?\w+Type>\(\)\s*)->' 'castRaw${1}->'` Presuming it builds, this is a safe change: the
result of `cast()` wasn't being saved anywhere, so we didn't need
it, so we can use a raw pointer instead of a new `shared_ptr`.
ghstack-source-id: 120769170

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D25837494

fbshipit-source-id: 46319100dc0dfc78f6d2b45148207f83481f2ada
2021-02-01 23:12:07 -08:00
Mikhail Zolotukhin
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00