Commit Graph

40 Commits

Author SHA1 Message Date
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Bert Maher
c42dd8b257 Revert "Use at::cpu in bench_approx (#56563)" (#56816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56816

This doesn't actually work.  For some reason the linker can't find
at::cpu::logit_out, and it's not worth digging into why not.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27977406

Pulled By: bertmaher

fbshipit-source-id: d0235a393f25243e2c8a011e9baf267daf483ae4
2021-04-26 23:51:49 -07:00
Bert Maher
461e887d92 CPU Convolution benchmark harness for some popular models (#56455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56455

CPU convolution performance is pretty important for inference, so
tracking performance for CNNs often boils down to finding shapes that have
either regressed or need optimization.  This diff adds a benchmark harness that
lets you pretty easily add new sets of convolution parameters to benchmark.

I've started with an exhaustive list of layers from MobileNetV3, ResNet-18 and
ResNet-50, which are fairly popular torchvision models.  More to come if these
prove useful.

I've also added four backend configurations:

- native: uses at::conv2d, which applies its own backend selection heuristics
- mkldnn_none: uses mkldnn but applies no prepacking; uses the NCHW default
- mkldnn_weight: prepacks weights in an mkldnn-friendly format
- mkldnn_input: also prepacks the inputs in NCHW16c
ghstack-source-id: 127027784

Test Plan: Ran this on my Skylake Xeon

Reviewed By: ngimel

Differential Revision: D27876139

fbshipit-source-id: 950e1dfa09a33cc3acc7efd579f56df8453af1f2
2021-04-22 22:14:36 -07:00
Bert Maher
57cba8e601 Use at::cpu in bench_approx (#56563)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56563

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27902737

Pulled By: bertmaher

fbshipit-source-id: 66962671afbb093d5ae0b9308a401536c06ce8f5
2021-04-21 22:56:07 -07:00
Ailing Zhang
f096245610 AutoNonVariableTypeMode->InferenceMode in OSS. (#56421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56421

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866609

Pulled By: ailzhang

fbshipit-source-id: 040991a031c5511501b03cfe21a4a636586e120e
2021-04-19 18:07:41 -07:00
Raghavan Raman
164de39a11 Fix build failure due to namespace change for log_out and tanh_out (#56278)
Summary:
There is a build failure in `bench_approx.cpp` due to namespace change for log_out and tanh_out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56278

Reviewed By: bertmaher, nikithamalgifb

Differential Revision: D27825621

Pulled By: navahgar

fbshipit-source-id: 0bccd324af92a3460610bf475514449f0223de2b
2021-04-16 13:34:32 -07:00
Mikhail Zolotukhin
7ab654afd7 [TensorExpr] Rename Tensor::call to Tensor::load to be consistent with Buf and Placeholder. (#55826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55826

It's a mechanical change.

Differential Revision: D27717777

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: fbc1bb99602250c706cf2c8c2684119c323e4d51
2021-04-13 12:08:53 -07:00
Mikhail Zolotukhin
1263448cb2 [TensorExpr] Remove mask field from Load and Store classes. (#55825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825

The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.

Differential Revision: D27717776

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
2021-04-13 12:08:51 -07:00
Mikhail Zolotukhin
754b0d073a [TensorExpr] Unbreak benchmarks. (#55824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55824

Seemingly some of my last changes (namely, removing dep-tracker) broke
the TE benchmarks. This PR fixes it.

Differential Revision: D27717778

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 48584bc0cfd4879a3e44cb45ee1f0d5c91b5afbc
2021-04-13 12:08:50 -07:00
Mikhail Zolotukhin
b01a15d3d3 [TensorExpr] Redesign Rfactor loopnest transformation. (#55324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324

With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.

The new `rfactor` semantics is as follows:

```
Requirements:
 * S is the reduction store
 * S is the only statement in the innermost loop
 * There is at least two reduction arguments in S
 * OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
 used in the store and all other reduction variables are index variables of
 children loops of OUTER_REDUCTION_FOR
 * OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
 corresponding to the other reduction variables and the store, nested into
 each other

What it does:
  * Introduce a new buffer with an extra dimension of a size equal to the
  span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
  RFAC_BUF_PTR)
  * Insert an initialization store for the new buffer in
  OUTER_REDUCTION_FOR before its nested loop
  * Replace the reduction store to the original buffer with the reduction
  store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
  from reduction arguments
  * Insert a final reduction store over the extra dimension of the new
  buffer to the original buffer
  * Returns TRUE if the transformation succeeded and FALSE otherwise

Example:
Original IR:
S1: for i        # normal axis
S2:   X[i] = 0
S3:   for j      # reduction axis
S4:     for k    # reduction axis
S5:       X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})

After RFACTOR(S5, S3)
S1: for i               # normal axis
S2:   X[i] = 0
S3:   for j             # reduction axis for X, normal axis for X_rfac
        X_rfac[i,j] = 0
S4:     for k           # reduction axis
          X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
        X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```

Differential Revision: D27694960

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
2021-04-13 12:08:48 -07:00
Ailing Zhang
24c904951c Replace AutoNonVariableTypeMode with InferenceMode in fbcode. (#55114)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55114

Test Plan: CI

Reviewed By: ezyang, bhosmer

Differential Revision: D27472768

fbshipit-source-id: 76f17ef7de40f6e04e2968f8958027b5f93e1c0c
2021-04-02 11:45:53 -07:00
Mikhail Zolotukhin
688e350725 [TensorExpr] Nuke DepTracker and findAllNeededTensors. (#54997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997

DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.

DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.

Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27446414

Pulled By: ZolotukhinM

fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
2021-04-01 19:46:26 -07:00
Wenlei Xie
53596cdb73 Remove hacky wrapper for about 100 kernels (#54367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54367

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 124804544

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27210057

fbshipit-source-id: 368dc77843468cfc44535488a040dbc2cb67208d
2021-03-25 10:00:16 -07:00
Hui Guo
2a53897114 [jit][tensorexpr] Added aten::batch_norm into fuser when in inference mode (#54204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54204

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27134348

Pulled By: huiguoo

fbshipit-source-id: 5ea7a6c5bc694fcdfc436dba3fa6eb269420324e
2021-03-23 04:41:52 -07:00
Xiaoqiang Zheng
9f86b656ba Resubmit: Adding parallel support for the LLVM backend. (#54122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54122

Test Plan:
* USE_TBB=1 ATEN_THREADING=TBB python setup.py develop --cmake
  * USE_TBB=1 ATEN_THREADING=NATIVE python setup.py develop --cmake
  * USE_TBB=1 ATEN_THREADING=OMP python setup.py develop --cmake
  * cd build; ninja bin/tensorexpr_bench
  * bin/test_tensorexpr --gtest_filter="*Parallel*"

Reviewed By: bertmaher

Differential Revision: D27109802

Pulled By: zheng-xq

fbshipit-source-id: db159466d0b46357bcf0fbefb36094bee312368c
2021-03-18 07:19:37 -07:00
Nikita Shulga
d57ae6c46d Revert D26906509: Adding parallel support for the LLVM backend.
Test Plan: revert-hammer

Differential Revision:
D26906509 (95d2318510)

Original commit changeset: 12c17f2f21af

fbshipit-source-id: cc86d0dfca0dd791b31bda23a0172fc1cfd89760
2021-03-11 17:54:47 -08:00
Xiaoqiang Zheng
95d2318510 Adding parallel support for the LLVM backend. (#53243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53243

Test Plan: Imported from OSS

Reviewed By: bertmaher, Chillee

Differential Revision: D26906509

Pulled By: zheng-xq

fbshipit-source-id: 12c17f2f21af11e73fa4c5b5199043a7a15ecdec
2021-03-11 03:27:37 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Raghavan Raman
8af648354f [nnc] Benchmarks for concat (#52592)
Summary:
This PR adds a c++ benchmark for "concat" with 3 different versions - 1) aten::cat, 2) NNC implementation with if-then-else, 3) NNC implementation using multiple loops. It also adds a python benchmark for "concat" which can now be invoked with and without CPU fusion.

Here are the results of these benchmarks on a `Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz` machine with `OMP_NUM_THREADS=1`

```
--------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------
Concat2D2 (678fe9f077)Input/ATen/1/160/1/14/1                                         1211 ns       1211 ns     567896 GB/s=1.14953G/s
Concat2D2 (678fe9f077)Input/ATen/1/580/1/174/1                                        1296 ns       1296 ns     537060 GB/s=4.65362G/s
Concat2D2 (678fe9f077)Input/ATen/20/160/20/14/1                                       1823 ns       1823 ns     382052 GB/s=15.2677G/s
Concat2D2 (678fe9f077)Input/ATen/20/580/20/174/1                                      3347 ns       3347 ns     210036 GB/s=36.0432G/s
Concat2D2 (678fe9f077)Input/ATen/8/512/8/512/1                                        2093 ns       2093 ns     324760 GB/s=31.3061G/s
Concat2D2 (678fe9f077)Input/NNC/1/160/1/14/1                                           694 ns        694 ns    1002902 GB/s=2.00692G/s
Concat2D2 (678fe9f077)Input/NNC/1/580/1/174/1                                          852 ns        852 ns     803002 GB/s=7.08127G/s
Concat2D2 (678fe9f077)Input/NNC/20/160/20/14/1                                        1639 ns       1639 ns     419683 GB/s=16.9828G/s
Concat2D2 (678fe9f077)Input/NNC/20/580/20/174/1                                       5956 ns       5956 ns     117833 GB/s=20.2548G/s
Concat2D2 (678fe9f077)Input/NNC/8/512/8/512/1                                         3136 ns       3136 ns     224122 GB/s=20.8958G/s
Concat2D2 (678fe9f077)Input/NNCLoop/1/160/1/14/1                                       581 ns        581 ns    1209873 GB/s=2.39737G/s
Concat2D2 (678fe9f077)Input/NNCLoop/1/580/1/174/1                                      614 ns        614 ns    1132332 GB/s=9.82955G/s
Concat2D2 (678fe9f077)Input/NNCLoop/20/160/20/14/1                                    1091 ns       1091 ns     622952 GB/s=25.5247G/s
Concat2D2 (678fe9f077)Input/NNCLoop/20/580/20/174/1                                   2399 ns       2399 ns     288376 GB/s=50.289G/s
Concat2D2 (678fe9f077)Input/NNCLoop/8/512/8/512/1                                     1500 ns       1500 ns     478360 GB/s=43.6968G/s
Concat2D3 (e23ddf06e9)Input/ATen/8/512/8/512/8/512/1                                  2584 ns       2584 ns     266394 GB/s=38.0397G/s
Concat2D3 (e23ddf06e9)Input/NNC/8/512/8/512/8/512/1                                   5056 ns       5056 ns     139768 GB/s=19.4416G/s
Concat2D3 (e23ddf06e9)Input/NNCLoop/8/512/8/512/8/512/1                               1917 ns       1917 ns     369626 GB/s=51.2758G/s
Concat2D7 (b5edf329f8)Input/ATen/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1          3888 ns       3888 ns     178124 GB/s=46.3571G/s
Concat2D7 (b5edf329f8)Input/NNC/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1          24639 ns      24638 ns      28336 GB/s=7.31481G/s
Concat2D7 (b5edf329f8)Input/NNCLoop/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1       3093 ns       3093 ns     226326 GB/s=58.265G/s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52592

Reviewed By: bertmaher

Differential Revision: D26596701

Pulled By: navahgar

fbshipit-source-id: 650fa88febf4423ea49f5a1d3d734edc2294d257
2021-02-24 06:09:32 -08:00
Zirui Tao
2b202667c1 [1/N] CPU pointwise optimization: Add a benchmark for Relu
Summary: As title

Test Plan:
Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated
  Total time: 02:32.3 min
Run on (24 X 2394.45 MHz CPU s)
2021-02-16 21:29:30
----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
relu_nnc/64                                        1738 ns       1738 ns     410535 log/s=36.8257M/s
relu_nnc/512                                       1708 ns       1708 ns     408678 log/s=299.711M/s
relu_nnc/8192                                      3297 ns       3297 ns     214362 log/s=2.48499G/s
relu_nnc/32768                                    10725 ns      10722 ns      61032 log/s=3.05603G/s
log_nnc_sleef/64                                   2076 ns       2075 ns     326248 log/s=30.8436M/s
log_nnc_sleef/512                                  3070 ns       3069 ns     230616 log/s=166.81M/s
log_nnc_sleef/8192                                22214 ns      22210 ns      31251 log/s=368.849M/s
log_nnc_sleef/32768                               85835 ns      85824 ns       8366 log/s=381.804M/s
log_nnc_fast/64                                    1852 ns       1852 ns     379123 log/s=34.5532M/s
log_nnc_fast/512                                   2456 ns       2456 ns     299463 log/s=208.503M/s
log_nnc_fast/8192                                 10953 ns      10952 ns      69894 log/s=747.957M/s
log_nnc_fast/32768                                35424 ns      35422 ns      19986 log/s=925.08M/s
log_nnc_vml/64                                     2361 ns       2361 ns     356220 log/s=27.1063M/s
log_nnc_vml/512                                    2218 ns       2218 ns     313444 log/s=230.857M/s
log_nnc_vml/8192                                   8420 ns       8420 ns      81594 log/s=972.912M/s
log_nnc_vml/32768                                 29484 ns      29484 ns      21701 log/s=1.1114G/s
log_aten/64                                       15970 ns      15970 ns      44401 log/s=4.00742M/s
log_aten/512                                      18344 ns      18344 ns      41056 log/s=27.9114M/s
log_aten/8192                                     24894 ns      24893 ns      27414 log/s=329.084M/s
log_aten/32768                                    29129 ns      29125 ns      22477 log/s=1.12508G/s
logit_nnc_sleef/64                                 2379 ns       2379 ns     261168 logit/s=26.8981M/s
logit_nnc_sleef/512                                5778 ns       5774 ns     114009 logit/s=88.6757M/s
logit_nnc_sleef/8192                              57268 ns      57236 ns      12429 logit/s=143.127M/s
logit_nnc_sleef/32768                            216356 ns     216344 ns       3026 logit/s=151.462M/s
logit_nnc_fast/64                                  2178 ns       2173 ns     282306 logit/s=29.4565M/s
logit_nnc_fast/512                                 2955 ns       2943 ns     202527 logit/s=173.95M/s
logit_nnc_fast/8192                               14836 ns      14835 ns      46794 logit/s=552.192M/s
logit_nnc_fast/32768                              53999 ns      53997 ns      12842 logit/s=606.846M/s
logit_nnc_vml/64                                   2132 ns       2132 ns     335874 logit/s=30.018M/s
logit_nnc_vml/512                                  3029 ns       3029 ns     250988 logit/s=169.058M/s
logit_nnc_vml/8192                                13264 ns      13263 ns      53504 logit/s=617.655M/s
logit_nnc_vml/32768                               49395 ns      48284 ns      14526 logit/s=678.654M/s
logit_aten/64                                     88180 ns      86690 ns       9270 logit/s=738.261k/s
logit_aten/512                                    54682 ns      54489 ns      10000 logit/s=9.3964M/s
logit_aten/8192                                  170878 ns     164357 ns       6965 logit/s=49.8427M/s
logit_aten/32768                                 452291 ns     434638 ns       3967 logit/s=75.3915M/s
logit_caffe2/64                                   30170 ns      29902 ns      24686 logit/s=2.14029M/s
logit_caffe2/512                                 203517 ns     201201 ns       3570 logit/s=2.54472M/s
logit_caffe2/8192                               3199528 ns    3157098 ns        220 logit/s=2.59479M/s
logit_caffe2/32768                             12520838 ns   12504846 ns         56 logit/s=2.62042M/s
tanh_nnc_fast/64                                   1979 ns       1977 ns     309745 tanh/s=32.3752M/s
tanh_nnc_fast/512                                  2331 ns       2331 ns     300937 tanh/s=219.636M/s
tanh_nnc_fast/8192                                 8323 ns       8323 ns      83601 tanh/s=984.26M/s
tanh_nnc_fast/32768                               30767 ns      30766 ns      23024 tanh/s=1065.06M/s
tanh_aten/64                                      17181 ns      17180 ns      36818 tanh/s=3.72522M/s
tanh_aten/512                                     19071 ns      19036 ns      37243 tanh/s=26.8968M/s
tanh_aten/8192                                    53542 ns      52006 ns      16268 tanh/s=157.521M/s
tanh_aten/32768                                  619869 ns     587600 ns       1000 tanh/s=55.7658M/s
tanh_caffe2/64                                     9668 ns       9654 ns      70926 tanh/s=6.62919M/s
tanh_caffe2/512                                   70409 ns      70409 ns       9881 tanh/s=7.27184M/s
tanh_caffe2/8192                                1179098 ns    1179011 ns        644 tanh/s=6.9482M/s
tanh_caffe2/32768                               4384300 ns    4382613 ns        156 tanh/s=7.47682M/s
BatchNorm/ATen/1/64/112/112                    23186429 ns   23183715 ns         27 GB/s=277.028M/s
BatchNorm/ATen/1/256/14/14                      1772907 ns    1770636 ns        394 GB/s=226.703M/s
BatchNorm/ATen/1/128/28/28                      3069417 ns    3069229 ns        232 GB/s=261.569M/s
BatchNorm/ATen/1/64/56/56                       6367276 ns    6367190 ns        111 GB/s=252.173M/s
BatchNorm/ATen/1/512/7/7                        1334734 ns    1334373 ns        516 GB/s=150.411M/s
BatchNorm/ATen/5/64/112/112                   131727903 ns  131721364 ns          7 GB/s=243.792M/s
BatchNorm/ATen/5/256/14/14                      7879002 ns    7874672 ns         85 GB/s=254.873M/s
BatchNorm/ATen/5/128/28/28                     15561373 ns   15269781 ns         42 GB/s=262.877M/s
BatchNorm/ATen/5/64/56/56                      29169722 ns   29107393 ns         24 GB/s=275.812M/s
BatchNorm/ATen/5/512/7/7                        5042006 ns    5028687 ns        100 GB/s=199.559M/s
BatchNorm/NNC/1/64/112/112                      3303598 ns    3271058 ns        188 GB/s=1.96344G/s
BatchNorm/NNC/1/256/14/14                        330641 ns     326644 ns       2033 GB/s=1.22889G/s
BatchNorm/NNC/1/128/28/28                        498706 ns     497894 ns       1131 GB/s=1.61242G/s
BatchNorm/NNC/1/64/56/56                        1116910 ns    1114768 ns        641 GB/s=1.44033G/s
BatchNorm/NNC/1/512/7/7                          163380 ns     163351 ns       3493 GB/s=1.22867G/s
BatchNorm/NNC/5/64/112/112                     16392078 ns   16386427 ns         41 GB/s=1.95971G/s
BatchNorm/NNC/5/256/14/14                       1133781 ns    1133369 ns        674 GB/s=1.77086G/s
BatchNorm/NNC/5/128/28/28                       2053208 ns    2053211 ns        276 GB/s=1.95503G/s
BatchNorm/NNC/5/64/56/56                        3874949 ns    3874734 ns        165 GB/s=2.07193G/s
BatchNorm/NNC/5/512/7/7                          653665 ns     651498 ns       1236 GB/s=1.54033G/s
BatchNorm/ATenRelu/1/64/112/112                36878892 ns   36100523 ns         22 GB/s=177.907M/s
BatchNorm/ATenRelu/1/256/14/14                  6404318 ns    5544976 ns        100 GB/s=72.3913M/s
BatchNorm/ATenRelu/1/128/28/28                  5897059 ns    5735509 ns        106 GB/s=139.973M/s
BatchNorm/ATenRelu/1/64/56/56                  10075458 ns    9965146 ns         62 GB/s=161.125M/s
BatchNorm/ATenRelu/1/512/7/7                    2680507 ns    2662541 ns        254 GB/s=75.3806M/s
BatchNorm/ATenRelu/5/64/112/112               145738113 ns  144253693 ns          5 GB/s=222.612M/s
BatchNorm/ATenRelu/5/256/14/14                 13582519 ns   13427209 ns         65 GB/s=149.476M/s
BatchNorm/ATenRelu/5/128/28/28                 22747138 ns   22627185 ns         31 GB/s=177.401M/s
BatchNorm/ATenRelu/5/64/56/56                  53609692 ns   52936728 ns         15 GB/s=151.656M/s
BatchNorm/ATenRelu/5/512/7/7                   11378314 ns   11083777 ns         65 GB/s=90.5395M/s
BatchNorm/NNCRelu/1/64/112/112                  3154436 ns    3148939 ns        193 GB/s=2.03958G/s
BatchNorm/NNCRelu/1/256/14/14                    337341 ns     337163 ns       1926 GB/s=1.19055G/s
BatchNorm/NNCRelu/1/128/28/28                    505570 ns     505569 ns       1231 GB/s=1.58794G/s
BatchNorm/NNCRelu/1/64/56/56                     903452 ns     903421 ns        659 GB/s=1.77728G/s
BatchNorm/NNCRelu/1/512/7/7                      158521 ns     158321 ns       3781 GB/s=1.2677G/s
BatchNorm/NNCRelu/5/64/112/112                 15488210 ns   15480019 ns         41 GB/s=2.07446G/s
BatchNorm/NNCRelu/5/256/14/14                   1149186 ns    1148963 ns        649 GB/s=1.74683G/s
BatchNorm/NNCRelu/5/128/28/28                   2011589 ns    2011424 ns        320 GB/s=1.99564G/s
BatchNorm/NNCRelu/5/64/56/56                    3776274 ns    3776060 ns        161 GB/s=2.12607G/s
BatchNorm/NNCRelu/5/512/7/7                      699762 ns     699582 ns        975 GB/s=1.43446G/s
BM_CompileSwish                                30471825 ns   30470017 ns         24
BM_CompileSwishLLVMOnly                        27479624 ns   27473475 ns         25
FusedOverhead                                    196219 ns     196195 ns       3342
UnfusedOverhead                                  220210 ns     220119 ns       3302
Gemm/Torch/128/128/128                           115526 ns     115343 ns       7414 GFLOPS=36.3637G/s
Gemm/TensorExprNoopt/128/128/128                3155851 ns    3155706 ns        210 GFLOPS=1.32912G/s
Gemm/TensorExprTile32x32/128/128/128             124454 ns     124452 ns       5774 GFLOPS=33.7021G/s
Gemm/TensorExprTile4x16/128/128/128              174408 ns     174366 ns       3987 GFLOPS=24.0546G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128      72949 ns      72948 ns       9028 GFLOPS=57.4974G/s
Gemm/TensorExprTile4x16Cache/128/128/128          73237 ns      73234 ns       9501 GFLOPS=57.2726G/s
Reduce1D/Torch/16777216                       426865265 ns  426853756 ns          2 BYTES=157.217M/s
Reduce1D/Naive/16777216                       132347709 ns  132343710 ns          5 BYTES=507.08M/s
Reduce1D/NativeRfactor/16777216               234668375 ns  234664682 ns          3 BYTES=285.978M/s
Reduce1D/TeNaive/16777216                      20468304 ns   20467906 ns         34 BYTES=3.27874G/s
Reduce1D/TeSplitTail/16777216                  20378995 ns   20378678 ns         34 BYTES=3.29309G/s
Reduce1D/TeSplitMask/16777216                  20371783 ns   20371260 ns         36 BYTES=3.29429G/s
Reduce1D/TeRfactorV2/16777216                   8235908 ns    8235723 ns         84 BYTES=8.14851G/s

CPU info:

Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows:

  *-cpu:0
       description: CPU
       product: Intel Core Processor (Broadwell)
       vendor: Intel Corp.
       physical id: 400
       bus info: cpu@0
       version: 6.61.2
       slot: CPU 0
       size: 2GHz
       capacity: 2GHz
       width: 64 bits
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
       configuration: cores=1 enabledcores=1 microcode=1 threads=1

Reviewed By: bwasti

Differential Revision: D26275048

fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5
2021-02-17 17:18:28 -08:00
Bert Maher
71d5a8ea62 [nnc] Benchmark inference batchnorm (#52251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52251

Batchnorm in inference is just a bunch of pointwise ops.  NNC
should be able to do a good job of this, and indeed it does.  For fun
I've included a fused BN->Relu (although the real fusion fun would be
Conv->BN->Relu...).

```
---------------------------------------------------------------------------------------
Benchmark                                Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------------
BatchNorm/ATen/1/64/112/112         252886 ns     252875 ns       2785 GB/s=25.3981G/s
BatchNorm/ATen/1/256/14/14           12145 ns      12145 ns      55347 GB/s=33.0525G/s
BatchNorm/ATen/1/128/28/28           18919 ns      18918 ns      37749 GB/s=42.437G/s
BatchNorm/ATen/1/64/56/56            61434 ns      61433 ns      11315 GB/s=26.1363G/s
BatchNorm/ATen/1/512/7/7             11924 ns      11923 ns      59070 GB/s=16.8327G/s
BatchNorm/ATen/5/64/112/112        1873321 ns    1873292 ns        382 GB/s=17.1424G/s
BatchNorm/ATen/5/256/14/14           83470 ns      83459 ns       8538 GB/s=24.0483G/s
BatchNorm/ATen/5/128/28/28          157521 ns     157520 ns       4440 GB/s=25.4829G/s
BatchNorm/ATen/5/64/56/56           314675 ns     314670 ns       2235 GB/s=25.513G/s
BatchNorm/ATen/5/512/7/7             48129 ns      48128 ns      14582 GB/s=20.851G/s

BatchNorm/NNC/1/64/112/112          249454 ns     249428 ns       2802 GB/s=25.749G/s
BatchNorm/NNC/1/256/14/14             9321 ns       9321 ns      74573 GB/s=43.066G/s
BatchNorm/NNC/1/128/28/28            16874 ns      16873 ns      40999 GB/s=47.5797G/s
BatchNorm/NNC/1/64/56/56             59276 ns      59275 ns      12047 GB/s=27.0878G/s
BatchNorm/NNC/1/512/7/7               3452 ns       3452 ns     202610 GB/s=58.1394G/s
BatchNorm/NNC/5/64/112/112         1820201 ns    1820038 ns        373 GB/s=17.6439G/s
BatchNorm/NNC/5/256/14/14            78429 ns      78420 ns       8871 GB/s=25.5935G/s
BatchNorm/NNC/5/128/28/28           155214 ns     155202 ns       4514 GB/s=25.8635G/s
BatchNorm/NNC/5/64/56/56            311454 ns     311449 ns       2163 GB/s=25.7768G/s
BatchNorm/NNC/5/512/7/7              26853 ns      26851 ns      25283 GB/s=37.3735G/s

BatchNorm/ATenRelu/1/64/112/112     378879 ns     378849 ns       1844 GB/s=16.9528G/s
BatchNorm/ATenRelu/1/256/14/14       16707 ns      16705 ns      41391 GB/s=24.029G/s
BatchNorm/ATenRelu/1/128/28/28       30235 ns      30235 ns      23060 GB/s=26.5529G/s
BatchNorm/ATenRelu/1/64/56/56        91164 ns      91160 ns       7662 GB/s=17.6132G/s
BatchNorm/ATenRelu/1/512/7/7         14681 ns      14681 ns      46088 GB/s=13.6707G/s
BatchNorm/ATenRelu/5/64/112/112    2864060 ns    2863566 ns        243 GB/s=11.2142G/s
BatchNorm/ATenRelu/5/256/14/14      118376 ns     118367 ns       5907 GB/s=16.9561G/s
BatchNorm/ATenRelu/5/128/28/28      237893 ns     237873 ns       2936 GB/s=16.8749G/s
BatchNorm/ATenRelu/5/64/56/56       472452 ns     472386 ns       1479 GB/s=16.9949G/s
BatchNorm/ATenRelu/5/512/7/7         61389 ns      61379 ns      11442 GB/s=16.3496G/s

BatchNorm/NNCRelu/1/64/112/112      248378 ns     248341 ns       2812 GB/s=25.8618G/s
BatchNorm/NNCRelu/1/256/14/14         9965 ns       9964 ns      76013 GB/s=40.2861G/s
BatchNorm/NNCRelu/1/128/28/28        16153 ns      16153 ns      43343 GB/s=49.7004G/s
BatchNorm/NNCRelu/1/64/56/56         58761 ns      58757 ns      12095 GB/s=27.3265G/s
BatchNorm/NNCRelu/1/512/7/7          10529 ns      10529 ns      66590 GB/s=19.0625G/s
BatchNorm/NNCRelu/5/64/112/112     1799001 ns    1798757 ns        362 GB/s=17.8527G/s
BatchNorm/NNCRelu/5/256/14/14        78252 ns      78246 ns       8974 GB/s=25.6504G/s
BatchNorm/NNCRelu/5/128/28/28       154940 ns     154923 ns       4483 GB/s=25.9102G/s
BatchNorm/NNCRelu/5/64/56/56        312329 ns     312324 ns       2244 GB/s=25.7046G/s
BatchNorm/NNCRelu/5/512/7/7          51203 ns      51199 ns      13559 GB/s=19.6004G/s
```

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D26440786

Pulled By: bertmaher

fbshipit-source-id: 7d3f7bf6eee4c37736e9875d31ae1b483af9fb6f
2021-02-16 10:57:38 -08:00
Bert Maher
602434bcbe [te] Benchmark vml-based logit (#51771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771

This benchmarks an NNC implementation of logit based on VML's log
implementation.

It's a modest improvement over the sleef algorithm, but seems to be a bit
slower than aten (at larger sizes), and I'm not totally sure why, since you'd
think a fused logit kernel would be better than doing clamp/sub/div, followed
by log.  And yet...

Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2
machine; I suspect that it's needed to give the scheduler enough freedom to
fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory.
ghstack-source-id: 121392349

Test Plan:
```
-----------------------------------------------------------------------------
Benchmark                      Time           CPU Iterations UserCounters...
-----------------------------------------------------------------------------
logit_nnc_sleef/64           483 ns        483 ns    1452336 logit/s=132.469M/s
logit_nnc_sleef/512         3019 ns       3019 ns     228059 logit/s=169.577M/s
logit_nnc_sleef/8192       71427 ns      71424 ns       9662 logit/s=114.695M/s
logit_nnc_sleef/32768     307062 ns     306722 ns       2406 logit/s=106.833M/s

logit_nnc_fast/64            147 ns        147 ns    4408910 logit/s=434.908M/s
logit_nnc_fast/512           781 ns        781 ns     881230 logit/s=655.53M/s
logit_nnc_fast/8192        12519 ns      12518 ns      55626 logit/s=654.421M/s
logit_nnc_fast/32768       50530 ns      50526 ns      10000 logit/s=648.536M/s

logit_nnc_vml/64             125 ns        125 ns    5551460 logit/s=511.603M/s
logit_nnc_vml/512            733 ns        733 ns     938444 logit/s=698.955M/s
logit_nnc_vml/8192         11282 ns      11280 ns      61610 logit/s=726.23M/s
logit_nnc_vml/32768        45051 ns      44991 ns      15473 logit/s=728.325M/s

logit_aten/64                450 ns        449 ns    1599269 logit/s=142.429M/s
logit_aten/512              1055 ns       1054 ns     665538 logit/s=485.595M/s
logit_aten/8192            10865 ns      10864 ns      64152 logit/s=754.032M/s
logit_aten/32768           42106 ns      42103 ns      16477 logit/s=778.287M/s

logit_caffe2/64              233 ns        233 ns    2952127 logit/s=274.761M/s
logit_caffe2/512            1795 ns       1795 ns     393354 logit/s=285.177M/s
logit_caffe2/8192          29924 ns      29923 ns      23225 logit/s=273.77M/s
logit_caffe2/32768        123899 ns     123893 ns       5642 logit/s=264.487M/s
```

Reviewed By: bwasti

Differential Revision: D26272325

fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688
2021-02-10 02:09:14 -08:00
Bert Maher
2e35fe9535 [te] Implement log approximation using the VML approach (#51752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752

Using a straight power series approximation with enough terms gives
precision down to the denormal range, and avoids the fp division used in the
sleef approach.  This is nice because recent CPUs have dual pipelined fma units,
so we can compute 16 logarithms in parallel; whereas there's usually only one
FP divider and it has a fairly high latency/low throughput.
ghstack-source-id: 121392347

Test Plan:
On my avx2+fma broadwell:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           178 ns        178 ns    3933565 log/s=358.993M/s
log_nnc_sleef/512         1286 ns       1285 ns     559459 log/s=398.354M/s
log_nnc_sleef/8192       19366 ns      19364 ns      36619 log/s=423.053M/s
log_nnc_sleef/32768      79288 ns      79286 ns       8718 log/s=413.287M/s

log_nnc_fast/64             92 ns         92 ns    7644990 log/s=696.939M/s
log_nnc_fast/512           483 ns        483 ns    1426802 log/s=1059.49M/s
log_nnc_fast/8192         7519 ns       7514 ns      95319 log/s=1090.23M/s
log_nnc_fast/32768       31344 ns      31338 ns      22397 log/s=1045.62M/s

log_nnc_vml/64              88 ns         88 ns    7923812 log/s=728.469M/s
log_nnc_vml/512            454 ns        454 ns    1521437 log/s=1.12739G/s
log_nnc_vml/8192          6763 ns       6763 ns     103264 log/s=1.21136G/s
log_nnc_vml/32768        26565 ns      26564 ns      23609 log/s=1.23354G/s

log_aten/64                418 ns        418 ns    1651401 log/s=153.117M/s
log_aten/512               801 ns        801 ns     875857 log/s=638.923M/s
log_aten/8192             6877 ns       6872 ns     100840 log/s=1.19208G/s
log_aten/32768           26989 ns      26988 ns      26268 log/s=1.21416G/s
```

Reviewed By: bwasti, zheng-xq

Differential Revision: D26246400

fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d
2021-02-10 02:09:10 -08:00
Bert Maher
a23e82df10 [nnc] Tweak log_nnc_sleef so vectorization kicks in (#51491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491

The vectorizer heuristic is pretty dumb and only kicks in if the
unroll factor is exactly 8 or 4.

It's still slower than direct implementation, which isn't surprising.
ghstack-source-id: 120783426

Test Plan:
`buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench`

Before:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           438 ns        438 ns    1795511 log/s=146.259M/s
log_nnc_sleef/512         3196 ns       3195 ns     210032 log/s=160.235M/s
log_nnc_sleef/8192       77467 ns      77466 ns       8859 log/s=105.749M/s
log_nnc_sleef/32768     310206 ns     310202 ns       2170 log/s=105.634M/s
log_nnc_fast/64            100 ns        100 ns    7281074 log/s=637.144M/s
log_nnc_fast/512           546 ns        546 ns    1335816 log/s=938.361M/s
log_nnc_fast/8192         7360 ns       7359 ns      91971 log/s=1.11316G/s
log_nnc_fast/32768       30793 ns      30792 ns      22633 log/s=1064.17M/s
log_aten/64           427 ns        427 ns    1634897 log/s=150.021M/s
log_aten/512          796 ns        796 ns     877318 log/s=643.566M/s
log_aten/8192        6690 ns       6690 ns     102649 log/s=1.22452G/s
log_aten/32768      25357 ns      25350 ns      27808 log/s=1.29263G/s
```

After:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           189 ns        188 ns    3872475 log/s=340.585M/s
log_nnc_sleef/512         1307 ns       1307 ns     557770 log/s=391.709M/s
log_nnc_sleef/8192       20259 ns      20257 ns      34240 log/s=404.404M/s
log_nnc_sleef/32768      81556 ns      81470 ns       8767 log/s=402.209M/s
log_nnc_fast/64            110 ns        110 ns    6564558 log/s=581.116M/s
log_nnc_fast/512           554 ns        554 ns    1279304 log/s=923.376M/s
log_nnc_fast/8192         7774 ns       7774 ns      91421 log/s=1053.75M/s
log_nnc_fast/32768       31008 ns      31006 ns      21279 log/s=1056.83M/s
```

Reviewed By: bwasti

Differential Revision: D26139067

fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2
2021-02-01 16:35:37 -08:00
Mikhail Zolotukhin
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
Nikita Shulga
97ea95ddd7 Delete tabs from becnh_approx.cpp (#51157)
Summary:
Introduced by D25981260 (f08464f31d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157

Reviewed By: bwasti

Differential Revision: D26090008

Pulled By: malfet

fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
2021-01-26 15:53:47 -08:00
Bert Maher
c4029444d1 [nnc] Per-operator benchmarks (#51093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51093

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.
ghstack-source-id: 120403675

Test Plan:
```
op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75
```

Reviewed By: zheng-xq

Differential Revision: D26069791

fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
2021-01-26 14:10:08 -08:00
Bram Wasti
f08464f31d [nnc] Add benchmarks
Summary: Adding a set of benchmarks for key operators

Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench

Reviewed By: ZolotukhinM

Differential Revision: D25981260

fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
2021-01-26 13:51:33 -08:00
Xiaoqiang Zheng
b96a6516a6 Add CPP Full Reduction Benchmarks. (#50193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193

* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.

Test plans:

* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D

Measurements:

On a Broadwell E5-2686 CPU,

Reduce1D/Torch/16777216            5638547 ns    5638444 ns        119 BYTES=11.902G/s
Reduce1D/Naive/16777216           19308235 ns   19308184 ns         36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216    8433348 ns    8433038 ns         85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216     5608836 ns    5608727 ns        124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216      5550233 ns    5550221 ns        126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216         21451047 ns   21450752 ns         33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216     23701732 ns   23701229 ns         30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216     23683589 ns   23682978 ns         30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216      5378019 ns    5377909 ns        131 BYTES=12.4786G/s

Result summary:

* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.

Follow-up items:

* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
  * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25821880

Pulled By: zheng-xq

fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
2021-01-21 10:00:50 -08:00
Bert Maher
468c99fba4 Reapply D25856891: [te] Benchmark comparing fused overhead to unfused (#50543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543

Original commit changeset: 2d2f07f79986

Was part of a stack that got reverted.  This is just a benchmark.
ghstack-source-id: 119825594

Test Plan: CI

Reviewed By: navahgar

Differential Revision: D25912439

fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676
2021-01-14 14:17:45 -08:00
Mike Ruberry
4ee631cdf0 Revert D25856891: [te] Benchmark comparing fused overhead to unfused
Test Plan: revert-hammer

Differential Revision:
D25856891 (36ae3feb22)

Original commit changeset: 0e99515ec2e7

fbshipit-source-id: 2d2f07f79986ca7815b9eae63e734db76bdfc0c8
2021-01-14 04:33:35 -08:00
Bert Maher
36ae3feb22 [te] Benchmark comparing fused overhead to unfused (#50305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305

That's it
ghstack-source-id: 119631533

Test Plan:
```
buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead
```
```
Run on (24 X 2394.67 MHz CPU s)
2021-01-08 16:06:17
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead         2157 ns       2157 ns     311314
UnfusedOverhead       2443 ns       2443 ns     311221
```

Reviewed By: ZolotukhinM

Differential Revision: D25856891

fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3
2021-01-13 12:09:37 -08:00
Bram Wasti
1047957831 [te][reapply] Add fast log approximation based on sleef (#49575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575

This is a fast log implementations

benchmark:

```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25627157

fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
Edward Yang
ea4ccc730e Revert D25445815: [te] Add fast log approximation based on sleef
Test Plan: revert-hammer

Differential Revision:
D25445815 (1329066b69)

Original commit changeset: 20696eacd12a

fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
Bram Wasti
1329066b69 [te] Add fast log approximation based on sleef
Summary:
This is a fast log implementations

benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25445815

fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00
Bert Maher
464d23e6b4 [te][benchmark] Add more optimized versions of gemm (#48159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159

Test Plan: Imported from OSS

Reviewed By: Chillee, ngimel

Differential Revision: D25059742

Pulled By: bertmaher

fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55
2020-11-18 12:21:08 -08:00
Bert Maher
b7261de0df [pytorch][te] Add compilation time benchmark (#46124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124

We want to make sure we can actually fuse kernels within a fairly
tight time budget.  So here's a quick benchmark of codegen for a simple
pointwise activation function (swish).  I kept all the intermediate tensors
separate to force TE to actually do inlining.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.

Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish                         5123276 ns    5119846 ns        148
BM_CompileSwishLLVMOnly                 4754361 ns    4753701 ns        160
```

Reviewed By: asuhan

Differential Revision: D24232801

fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
2020-10-09 23:11:37 -07:00
Bert Maher
f2e569461b [te] Tiled (m=32 x n=32) gemm benchmark (#45905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142402

Pulled By: bertmaher

fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f
2020-10-06 16:57:31 -07:00
Bert Maher
50f89578dd [te] Add a benchmark harness (#45875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875

Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).

Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).

Right now there's just an unoptimized implementation that is expected to be not
very fast.  More optimized versions are coming.

Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                    73405 ns      73403 ns       8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128        3073003 ns    3072808 ns        229 GFLOPS=1.36497G/s
```

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142403

Pulled By: bertmaher

fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
2020-10-06 16:57:27 -07:00