Commit Graph

73 Commits

Author SHA1 Message Date
cyy
8f291e8c00 Fix clang-tidy warnings in torch/jit (#146963)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963
Approved by: https://github.com/davidberard98
2025-02-15 03:36:59 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
cyy
7bbdf87517 [22/N] Fix clang-tidy warnings in jit (#134829)
Follows  #134537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829
Approved by: https://github.com/ezyang
2024-09-19 19:24:42 +00:00
cyy
07fe1dd58f [13/N] Fix clang-tidy warnings in jit (#132411)
Follows  #132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411
Approved by: https://github.com/Skylion007
2024-08-02 03:14:09 +00:00
cyy
c99adce9a1 [12/N] Fix clang-tidy warnings in jit (#132209)
Follows #132131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209
Approved by: https://github.com/Skylion007
2024-08-01 15:12:12 +00:00
cyy
eccbd408e5 [10/N] Fix clang-tidy warnings in jit (#132122)
Follows #132010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122
Approved by: https://github.com/Skylion007
2024-07-30 12:56:31 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
PyTorch MergeBot
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
Richard Barnes
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
Aaron Gokaslan
b9182cbbd8 Fixup torch jit with some initializers and moves (#92037)
Fixup some minor codequality issues in torch JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92037
Approved by: https://github.com/ezyang
2023-01-12 17:29:24 +00:00
Aaron Gokaslan
18b37bbff9 Clang-Tidy: Improve tensorexpr headers with additional std::moves (#91572)
Splitting #91559 into smaller pieces

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91572
Approved by: https://github.com/ezyang
2023-01-05 09:57:54 +00:00
Wang, Eikan
429a80dded [NNC] Lowering function generates the output buffer with the specified stride (#76529)
Summary:
Pass stride information to lowering function to generate the output bufer with proper memory layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529

Reviewed By: ZolotukhinM

Differential Revision: D36116712

Pulled By: IvanKobzarev

fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929
(cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)
2022-05-04 20:04:22 +00:00
Peter Bell
2e480fc2db Cleanup ATen-core forward declarations
I noticed that when `SymInt` was introduced, `jit_type_base.h` was
added as an include to the `Operator.h` template which is supposed to
be kept extremely clean and only use forward declarations. Also,
that forward declarations for `OptionalArrayRef` were missing.

So, I've refactored the forward declarations into
`ATen/core/ATen_fwd.h` and cleaned up some of the `c10`
headers that were masking these missing declarations. I've also
re-generated the pre-compiled header so `SymInt` is included.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576
Approved by: https://github.com/albanD
2022-05-02 14:50:48 +00:00
zengk95
1d55518198 Revert "[nnc] Strides to Tensor (#72962)"
This reverts commit 939060925f.

Fixes https://github.com/pytorch/vision/issues/5873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332
Approved by: https://github.com/seemethere
2022-04-25 19:50:00 +00:00
Ivan Kobzarev
939060925f [nnc] Strides to Tensor (#72962)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM, cpuhrsch

Differential Revision: D34589306

Pulled By: IvanKobzarev

fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944
(cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)
2022-04-23 19:35:15 +00:00
Wang, Eikan
ef0873327e [NNC] Add utility functions to check channels-last contiguous (#75938)
Summary:
The `Buf` uses `std::vector<ExprHandle>` to represent its strides. The `ExprHandle` could be an immediate value or a mathematical expression with variables involved both for the static shape and dynamic shape. So it is hard to directly deduce the channels-last contiguous layout based on the numerical calculation. Hence, the utility functions of this PR are based on the pattern match to check whether the `Buf` is channels-last contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75938

Reviewed By: cpuhrsch

Differential Revision: D35724091

Pulled By: ZolotukhinM

fbshipit-source-id: f79ae21749d0aad8601f0434b52df88602ff09bf
(cherry picked from commit 3712bbbe4bea57c5c1abe1eafde4b8778e13e0c4)
2022-04-22 06:42:39 -07:00
Mikhail Zolotukhin
9123e9b3b5 [TensorExpr] Switch from ExprPtr to ExprHandle in Compute impl. (#72389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72389

This is an NFC change that just prepares the code for the upcoming
deletion of `DimArg` class. This change makes `Compute` and `Reduce`
APIs to use `ExprHandle` everywhere.

There should be no observable behavior change from this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D34030295

Pulled By: ZolotukhinM

fbshipit-source-id: 3fd035b6a6bd0a07ccfa92e118819478ae85412a
(cherry picked from commit 1b0a4b6fac)
2022-02-11 01:21:59 +00:00
Ivan Kobzarev
6fb8ebcd92 [tensorexp] Add strides to Buf (#68018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68018

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32262381

Pulled By: IvanKobzarev

fbshipit-source-id: dba79add0bf703bc2378d64e726d4c47ec30e3be
2021-11-13 08:33:01 -08:00
Ivan Kobzarev
e52d0e773b [tensorexpr][ir][quant] Adding qscale and qzero to tensorexpr IR Buf (#66675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66675

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31676328

Pulled By: IvanKobzarev

fbshipit-source-id: c6479415fa7d809e02dd3789ee0bfd6dfe50dc92
2021-10-27 01:32:16 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Bert Maher
e7fb35021a [nnc] Enable fusion of bfloat16 ops (#64196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64196

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30643864

Pulled By: bertmaher

fbshipit-source-id: e95edeaf7089464d713ea1d1f951743d3e5f61c5
2021-08-30 20:09:36 -07:00
Raghavan Raman
a836d83957 [nnc] Fixed warning due to implicit parameter conversion (#64117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64117

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30616945

Pulled By: navahgar

fbshipit-source-id: eaf69232ac4a684ab5f97a54a514971655f86ef3
2021-08-30 04:39:34 -07:00
Bert Maher
2e6221a232 [nnc] Make 64-bit dimensions work (#64077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077

We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272

Test Plan: unit tests; new IR level test with huge sizes

Reviewed By: ZolotukhinM

Differential Revision: D30596689

fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
2021-08-28 19:59:47 -07:00
Cheng Chang
0f6b524665 [NNC] Add C++ codegen backend to NNC (#62869)
Summary:
Adds a C++ codegen backend to NNC to generate C++ for CPU instead of generating LLVM IR.
Tensors are represented as blobs of float. Vector operations are devectorized/unrolled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62869

Test Plan:
https://github.com/pytorch/pytorch/tree/mvz-nnc-aot-prototype makes it able to AOT compile the whole MobileNetV3 model into binary code through LLVM codegen in NNC.

I forked that branch to https://github.com/cheng-chang/pytorch/tree/cc-aot-cpp, merged this PR into it, and modified `fancy_compile` to compile MobileNetV3 into C++ through

```
import torch

m = torch.jit.load('mobnet.pt')
m.eval()
f = torch.jit.freeze(m)
torch._C._fancy_compile(f.graph, [1, 3, 224, 224])
```

The generated C++ file `mobnet.cc` can be found at https://gist.github.com/cheng-chang/e2830cc6920b39204ebf368035b2bcec.

I manually compiled the generated C++ through `g++ -o mobnet -std=c++14 -L./build/lib -ltorch_cpu -ltorch mobnet.cc`, and it succeeded.

Reviewed By: ZolotukhinM

Differential Revision: D30149482

Pulled By: cheng-chang

fbshipit-source-id: e77b189f0353e37cd309423a48a513e668d07675
2021-08-26 09:56:37 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
4e15a6f495 [TensorExpr] Switch Exprs and Stmt from kernel-arena to shared_ptr. (#63216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63216

Currently there are three classes managed by KernelArena: Expr, Stmt,
and Tensor (and derived classes). KernelArena has been a long standing
painpoint for NNC devs and we're moving away from that memory management
model to ref-count based memory model (using shared_ptr). This commit
switches Expr and Stmt to shared_ptr and is the biggest change in this
transition. Later commits will detach Tensor from KernelArena and kill
the arena + scope altogether.

Differential Revision:
D30353195
D30353195

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 9575225ada3d0fb65087ae40435f3dfea4792cae
2021-08-24 00:32:11 -07:00
Mikhail Zolotukhin
1dc2b52764 [TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195

This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.

The changes are mechanical and should not affect any functionality.

With this PR, we're changing the following:
 * `Add*` --> `AddPtr`
 * `new Add(...)` --> `alloc<Add>(...)`
 * `dynamic_cast<Add*>` --> `to<Add>`
 * `static_cast<Add*>` --> `static_to<Add>`

Due to some complications with args forwarding, some places became more
verbose, e.g.:
 * `new Block({})` --> `new Block(std::vector<ExprPtr>())`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292779

Pulled By: ZolotukhinM

fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
2021-08-17 13:44:45 -07:00
Raghavan Raman
e50e8b07d8 [nnc] Updated IRMutator and IRSimplifier to perform in-place mutations. (#63246)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63246

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30309636

Pulled By: navahgar

fbshipit-source-id: 409ea8d6982888cfee9127e6248044dd2ed9d8d4
2021-08-16 00:09:22 -07:00
Raghavan Raman
59dd12042e [nnc] Removed const from all fields in IR. (#62336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336

This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.

This is the first step in making all NNC mutations in-place.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30049829

Pulled By: navahgar

fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
2021-08-03 11:44:36 -07:00
Mike Guo
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
Jason Ansel
85517a2b70 [TensorExpr] More python binding cleanups (#60058)
Summary:
A few more quality of life improvements for NNC's python bindings:
- Use standard `torch.dtype`s (rather than `te.Dtype`)
- Make names optional (they don't seem to matter)
- Make shapes optional
- A few implicit conversions to make code cleaner

Followup to https://github.com/pytorch/pytorch/issues/59920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60058

Reviewed By: bertmaher

Differential Revision: D29151953

Pulled By: jansel

fbshipit-source-id: c8286e329eb4ee3921ca0786e17248cf6a898bd8
2021-06-16 20:06:08 -07:00
Hui Guo
f4fdc49957 [NNC] Add python bindings for loopnest.compress_buffer (#59681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59681

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28981573

Pulled By: huiguoo

fbshipit-source-id: 003d66df576903c71bf46c95851fe6ccbba76f29
2021-06-11 11:28:39 -07:00
Raghavan Raman
eef72f3f8a [NNC] Update Buf on mutation instead of creating new ones (#57513)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57513

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28226917

Pulled By: navahgar

fbshipit-source-id: 4e74c56a85b7aadc285b872b8ef8f8e26f31c8ce
2021-05-06 01:08:23 -07:00
Hui Guo
afe6b4c8ee [NNC] Add logical Operators '&&' and '||' (#56947)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56947

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28007342

Pulled By: huiguoo

fbshipit-source-id: a2ad8d2e99d7c8d8c8bdcd8f65fa3f340bdd2bbc
2021-05-01 18:44:27 -07:00
Raghavan Raman
5b7317b562 [NNC] API for Buffer Compression (#55853)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54338

This PR adds the following API in NNC to implement "buffer compression".

```
static void compressBuffer(Buf* buf, Stmt* stmt);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55853

Reviewed By: ezyang

Differential Revision: D27960986

Pulled By: navahgar

fbshipit-source-id: a69988e607196f3e2db0212313ea5deefb9859ac
2021-04-23 14:12:03 -07:00
Bert Maher
90f848572c NNC depthwise conv2d implementation (#54920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54920

Add a depthwise convolution implementation and reasonably good
schedules for 3x3 stride=1,2.
ghstack-source-id: 126076113

Test Plan: new tensorexpr test: Conv.DepthwiseConv2D

Reviewed By: ZolotukhinM

Differential Revision: D27413745

fbshipit-source-id: 833da6072b655fbe2b679704e9d56a08e1bf7e7e
2021-04-08 21:56:53 -07:00
Nikita Shulga
6a39613f35 [BE] Make torch/csrc/jit/tensorexpr/ clang-tidy clean (#55628)
Summary:
Mostly auto-generated changes using
```
 python3 tools/clang_tidy.py -c build -x torch/csrc/jit/tensorexpr/eval.cpp -s
```
With following common patterns manually fixed
- Use ` = default` instead of `{}`
- deleted methods should be public
- Use pass-by-value + std::move instead of pass-by-reference+copy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55628

Reviewed By: walterddr

Differential Revision: D27655378

Pulled By: malfet

fbshipit-source-id: 92be87a08113435d820711103ea9b0364182c71a
2021-04-08 19:44:14 -07:00
Mikhail Zolotukhin
ff6b3c76ab [TensorExpr] Add TORCH_APIs to all expr classes. (#55002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55002

Test Plan: Imported from OSS

Reviewed By: navahgar, walterddr

Differential Revision: D27446409

Pulled By: ZolotukhinM

fbshipit-source-id: 3442d5876bc68974fb3d44878f89c1a7895668d2
2021-04-01 19:48:10 -07:00
Mikhail Zolotukhin
1ccaec0238 [TensorExpr] Cleanup IRNodeType enum. (#55001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55001

The enum is only used for precedence computation thus we only need to
enum node-types for which we know the precedence priority.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27446410

Pulled By: ZolotukhinM

fbshipit-source-id: 217dd63c4fd086155030ebf0c3e1772605109f7b
2021-04-01 19:48:07 -07:00
Horace He
42e0983230 [NNC] Added some APIs for dealing directly with Bufs (instead of Tensors) (#53011)
Summary:
(also includes some python binding stuff :P)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53011

Reviewed By: gchanan, robieta

Differential Revision: D26801120

Pulled By: Chillee

fbshipit-source-id: 42a1efb6cbc9ddc0b72b780f3d6b712b3ae62b09
2021-03-05 06:55:48 -08:00
Hui Guo
973e306c84 changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167

Test Plan:
Imported from OSS

`python test/test_jit_fuser_te.py`
`python test/test_jit_fuser_legacy.py`
`python test/test_jit_fuser.py`
`build/bin/test_tensorexpr`

Reviewed By: ZolotukhinM

Differential Revision: D25814342

Pulled By: huiguoo

fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee
2021-02-22 15:08:51 -08:00
Zirui Tao
2b202667c1 [1/N] CPU pointwise optimization: Add a benchmark for Relu
Summary: As title

Test Plan:
Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated
  Total time: 02:32.3 min
Run on (24 X 2394.45 MHz CPU s)
2021-02-16 21:29:30
----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
relu_nnc/64                                        1738 ns       1738 ns     410535 log/s=36.8257M/s
relu_nnc/512                                       1708 ns       1708 ns     408678 log/s=299.711M/s
relu_nnc/8192                                      3297 ns       3297 ns     214362 log/s=2.48499G/s
relu_nnc/32768                                    10725 ns      10722 ns      61032 log/s=3.05603G/s
log_nnc_sleef/64                                   2076 ns       2075 ns     326248 log/s=30.8436M/s
log_nnc_sleef/512                                  3070 ns       3069 ns     230616 log/s=166.81M/s
log_nnc_sleef/8192                                22214 ns      22210 ns      31251 log/s=368.849M/s
log_nnc_sleef/32768                               85835 ns      85824 ns       8366 log/s=381.804M/s
log_nnc_fast/64                                    1852 ns       1852 ns     379123 log/s=34.5532M/s
log_nnc_fast/512                                   2456 ns       2456 ns     299463 log/s=208.503M/s
log_nnc_fast/8192                                 10953 ns      10952 ns      69894 log/s=747.957M/s
log_nnc_fast/32768                                35424 ns      35422 ns      19986 log/s=925.08M/s
log_nnc_vml/64                                     2361 ns       2361 ns     356220 log/s=27.1063M/s
log_nnc_vml/512                                    2218 ns       2218 ns     313444 log/s=230.857M/s
log_nnc_vml/8192                                   8420 ns       8420 ns      81594 log/s=972.912M/s
log_nnc_vml/32768                                 29484 ns      29484 ns      21701 log/s=1.1114G/s
log_aten/64                                       15970 ns      15970 ns      44401 log/s=4.00742M/s
log_aten/512                                      18344 ns      18344 ns      41056 log/s=27.9114M/s
log_aten/8192                                     24894 ns      24893 ns      27414 log/s=329.084M/s
log_aten/32768                                    29129 ns      29125 ns      22477 log/s=1.12508G/s
logit_nnc_sleef/64                                 2379 ns       2379 ns     261168 logit/s=26.8981M/s
logit_nnc_sleef/512                                5778 ns       5774 ns     114009 logit/s=88.6757M/s
logit_nnc_sleef/8192                              57268 ns      57236 ns      12429 logit/s=143.127M/s
logit_nnc_sleef/32768                            216356 ns     216344 ns       3026 logit/s=151.462M/s
logit_nnc_fast/64                                  2178 ns       2173 ns     282306 logit/s=29.4565M/s
logit_nnc_fast/512                                 2955 ns       2943 ns     202527 logit/s=173.95M/s
logit_nnc_fast/8192                               14836 ns      14835 ns      46794 logit/s=552.192M/s
logit_nnc_fast/32768                              53999 ns      53997 ns      12842 logit/s=606.846M/s
logit_nnc_vml/64                                   2132 ns       2132 ns     335874 logit/s=30.018M/s
logit_nnc_vml/512                                  3029 ns       3029 ns     250988 logit/s=169.058M/s
logit_nnc_vml/8192                                13264 ns      13263 ns      53504 logit/s=617.655M/s
logit_nnc_vml/32768                               49395 ns      48284 ns      14526 logit/s=678.654M/s
logit_aten/64                                     88180 ns      86690 ns       9270 logit/s=738.261k/s
logit_aten/512                                    54682 ns      54489 ns      10000 logit/s=9.3964M/s
logit_aten/8192                                  170878 ns     164357 ns       6965 logit/s=49.8427M/s
logit_aten/32768                                 452291 ns     434638 ns       3967 logit/s=75.3915M/s
logit_caffe2/64                                   30170 ns      29902 ns      24686 logit/s=2.14029M/s
logit_caffe2/512                                 203517 ns     201201 ns       3570 logit/s=2.54472M/s
logit_caffe2/8192                               3199528 ns    3157098 ns        220 logit/s=2.59479M/s
logit_caffe2/32768                             12520838 ns   12504846 ns         56 logit/s=2.62042M/s
tanh_nnc_fast/64                                   1979 ns       1977 ns     309745 tanh/s=32.3752M/s
tanh_nnc_fast/512                                  2331 ns       2331 ns     300937 tanh/s=219.636M/s
tanh_nnc_fast/8192                                 8323 ns       8323 ns      83601 tanh/s=984.26M/s
tanh_nnc_fast/32768                               30767 ns      30766 ns      23024 tanh/s=1065.06M/s
tanh_aten/64                                      17181 ns      17180 ns      36818 tanh/s=3.72522M/s
tanh_aten/512                                     19071 ns      19036 ns      37243 tanh/s=26.8968M/s
tanh_aten/8192                                    53542 ns      52006 ns      16268 tanh/s=157.521M/s
tanh_aten/32768                                  619869 ns     587600 ns       1000 tanh/s=55.7658M/s
tanh_caffe2/64                                     9668 ns       9654 ns      70926 tanh/s=6.62919M/s
tanh_caffe2/512                                   70409 ns      70409 ns       9881 tanh/s=7.27184M/s
tanh_caffe2/8192                                1179098 ns    1179011 ns        644 tanh/s=6.9482M/s
tanh_caffe2/32768                               4384300 ns    4382613 ns        156 tanh/s=7.47682M/s
BatchNorm/ATen/1/64/112/112                    23186429 ns   23183715 ns         27 GB/s=277.028M/s
BatchNorm/ATen/1/256/14/14                      1772907 ns    1770636 ns        394 GB/s=226.703M/s
BatchNorm/ATen/1/128/28/28                      3069417 ns    3069229 ns        232 GB/s=261.569M/s
BatchNorm/ATen/1/64/56/56                       6367276 ns    6367190 ns        111 GB/s=252.173M/s
BatchNorm/ATen/1/512/7/7                        1334734 ns    1334373 ns        516 GB/s=150.411M/s
BatchNorm/ATen/5/64/112/112                   131727903 ns  131721364 ns          7 GB/s=243.792M/s
BatchNorm/ATen/5/256/14/14                      7879002 ns    7874672 ns         85 GB/s=254.873M/s
BatchNorm/ATen/5/128/28/28                     15561373 ns   15269781 ns         42 GB/s=262.877M/s
BatchNorm/ATen/5/64/56/56                      29169722 ns   29107393 ns         24 GB/s=275.812M/s
BatchNorm/ATen/5/512/7/7                        5042006 ns    5028687 ns        100 GB/s=199.559M/s
BatchNorm/NNC/1/64/112/112                      3303598 ns    3271058 ns        188 GB/s=1.96344G/s
BatchNorm/NNC/1/256/14/14                        330641 ns     326644 ns       2033 GB/s=1.22889G/s
BatchNorm/NNC/1/128/28/28                        498706 ns     497894 ns       1131 GB/s=1.61242G/s
BatchNorm/NNC/1/64/56/56                        1116910 ns    1114768 ns        641 GB/s=1.44033G/s
BatchNorm/NNC/1/512/7/7                          163380 ns     163351 ns       3493 GB/s=1.22867G/s
BatchNorm/NNC/5/64/112/112                     16392078 ns   16386427 ns         41 GB/s=1.95971G/s
BatchNorm/NNC/5/256/14/14                       1133781 ns    1133369 ns        674 GB/s=1.77086G/s
BatchNorm/NNC/5/128/28/28                       2053208 ns    2053211 ns        276 GB/s=1.95503G/s
BatchNorm/NNC/5/64/56/56                        3874949 ns    3874734 ns        165 GB/s=2.07193G/s
BatchNorm/NNC/5/512/7/7                          653665 ns     651498 ns       1236 GB/s=1.54033G/s
BatchNorm/ATenRelu/1/64/112/112                36878892 ns   36100523 ns         22 GB/s=177.907M/s
BatchNorm/ATenRelu/1/256/14/14                  6404318 ns    5544976 ns        100 GB/s=72.3913M/s
BatchNorm/ATenRelu/1/128/28/28                  5897059 ns    5735509 ns        106 GB/s=139.973M/s
BatchNorm/ATenRelu/1/64/56/56                  10075458 ns    9965146 ns         62 GB/s=161.125M/s
BatchNorm/ATenRelu/1/512/7/7                    2680507 ns    2662541 ns        254 GB/s=75.3806M/s
BatchNorm/ATenRelu/5/64/112/112               145738113 ns  144253693 ns          5 GB/s=222.612M/s
BatchNorm/ATenRelu/5/256/14/14                 13582519 ns   13427209 ns         65 GB/s=149.476M/s
BatchNorm/ATenRelu/5/128/28/28                 22747138 ns   22627185 ns         31 GB/s=177.401M/s
BatchNorm/ATenRelu/5/64/56/56                  53609692 ns   52936728 ns         15 GB/s=151.656M/s
BatchNorm/ATenRelu/5/512/7/7                   11378314 ns   11083777 ns         65 GB/s=90.5395M/s
BatchNorm/NNCRelu/1/64/112/112                  3154436 ns    3148939 ns        193 GB/s=2.03958G/s
BatchNorm/NNCRelu/1/256/14/14                    337341 ns     337163 ns       1926 GB/s=1.19055G/s
BatchNorm/NNCRelu/1/128/28/28                    505570 ns     505569 ns       1231 GB/s=1.58794G/s
BatchNorm/NNCRelu/1/64/56/56                     903452 ns     903421 ns        659 GB/s=1.77728G/s
BatchNorm/NNCRelu/1/512/7/7                      158521 ns     158321 ns       3781 GB/s=1.2677G/s
BatchNorm/NNCRelu/5/64/112/112                 15488210 ns   15480019 ns         41 GB/s=2.07446G/s
BatchNorm/NNCRelu/5/256/14/14                   1149186 ns    1148963 ns        649 GB/s=1.74683G/s
BatchNorm/NNCRelu/5/128/28/28                   2011589 ns    2011424 ns        320 GB/s=1.99564G/s
BatchNorm/NNCRelu/5/64/56/56                    3776274 ns    3776060 ns        161 GB/s=2.12607G/s
BatchNorm/NNCRelu/5/512/7/7                      699762 ns     699582 ns        975 GB/s=1.43446G/s
BM_CompileSwish                                30471825 ns   30470017 ns         24
BM_CompileSwishLLVMOnly                        27479624 ns   27473475 ns         25
FusedOverhead                                    196219 ns     196195 ns       3342
UnfusedOverhead                                  220210 ns     220119 ns       3302
Gemm/Torch/128/128/128                           115526 ns     115343 ns       7414 GFLOPS=36.3637G/s
Gemm/TensorExprNoopt/128/128/128                3155851 ns    3155706 ns        210 GFLOPS=1.32912G/s
Gemm/TensorExprTile32x32/128/128/128             124454 ns     124452 ns       5774 GFLOPS=33.7021G/s
Gemm/TensorExprTile4x16/128/128/128              174408 ns     174366 ns       3987 GFLOPS=24.0546G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128      72949 ns      72948 ns       9028 GFLOPS=57.4974G/s
Gemm/TensorExprTile4x16Cache/128/128/128          73237 ns      73234 ns       9501 GFLOPS=57.2726G/s
Reduce1D/Torch/16777216                       426865265 ns  426853756 ns          2 BYTES=157.217M/s
Reduce1D/Naive/16777216                       132347709 ns  132343710 ns          5 BYTES=507.08M/s
Reduce1D/NativeRfactor/16777216               234668375 ns  234664682 ns          3 BYTES=285.978M/s
Reduce1D/TeNaive/16777216                      20468304 ns   20467906 ns         34 BYTES=3.27874G/s
Reduce1D/TeSplitTail/16777216                  20378995 ns   20378678 ns         34 BYTES=3.29309G/s
Reduce1D/TeSplitMask/16777216                  20371783 ns   20371260 ns         36 BYTES=3.29429G/s
Reduce1D/TeRfactorV2/16777216                   8235908 ns    8235723 ns         84 BYTES=8.14851G/s

CPU info:

Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows:

  *-cpu:0
       description: CPU
       product: Intel Core Processor (Broadwell)
       vendor: Intel Corp.
       physical id: 400
       bus info: cpu@0
       version: 6.61.2
       slot: CPU 0
       size: 2GHz
       capacity: 2GHz
       width: 64 bits
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
       configuration: cores=1 enabledcores=1 microcode=1 threads=1

Reviewed By: bwasti

Differential Revision: D26275048

fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5
2021-02-17 17:18:28 -08:00
Bert Maher
2e35fe9535 [te] Implement log approximation using the VML approach (#51752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752

Using a straight power series approximation with enough terms gives
precision down to the denormal range, and avoids the fp division used in the
sleef approach.  This is nice because recent CPUs have dual pipelined fma units,
so we can compute 16 logarithms in parallel; whereas there's usually only one
FP divider and it has a fairly high latency/low throughput.
ghstack-source-id: 121392347

Test Plan:
On my avx2+fma broadwell:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           178 ns        178 ns    3933565 log/s=358.993M/s
log_nnc_sleef/512         1286 ns       1285 ns     559459 log/s=398.354M/s
log_nnc_sleef/8192       19366 ns      19364 ns      36619 log/s=423.053M/s
log_nnc_sleef/32768      79288 ns      79286 ns       8718 log/s=413.287M/s

log_nnc_fast/64             92 ns         92 ns    7644990 log/s=696.939M/s
log_nnc_fast/512           483 ns        483 ns    1426802 log/s=1059.49M/s
log_nnc_fast/8192         7519 ns       7514 ns      95319 log/s=1090.23M/s
log_nnc_fast/32768       31344 ns      31338 ns      22397 log/s=1045.62M/s

log_nnc_vml/64              88 ns         88 ns    7923812 log/s=728.469M/s
log_nnc_vml/512            454 ns        454 ns    1521437 log/s=1.12739G/s
log_nnc_vml/8192          6763 ns       6763 ns     103264 log/s=1.21136G/s
log_nnc_vml/32768        26565 ns      26564 ns      23609 log/s=1.23354G/s

log_aten/64                418 ns        418 ns    1651401 log/s=153.117M/s
log_aten/512               801 ns        801 ns     875857 log/s=638.923M/s
log_aten/8192             6877 ns       6872 ns     100840 log/s=1.19208G/s
log_aten/32768           26989 ns      26988 ns      26268 log/s=1.21416G/s
```

Reviewed By: bwasti, zheng-xq

Differential Revision: D26246400

fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d
2021-02-10 02:09:10 -08:00
Mikhail Zolotukhin
42aeb68128 [TensorExpr] Move 'initializer' field from 'Tensor' to 'Buf'. (#50993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50993

This is the first step to make 'Tensor` a thin wrapper over 'Buf' and
'Stmt', which will be finished in subsequent PRs. This change also
allows to remove 'buf_initializers_' from 'LoopNest', making it "less
stateful".

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038224

Pulled By: ZolotukhinM

fbshipit-source-id: f418816e54c62f291fa45812901487394e9b95b5
2021-01-27 16:10:53 -08:00
Bram Wasti
d60d108280 [nnc] Expose fast tanh/sigmoid (#50736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50736

Exposes tanh and sigmoid to other backends

Test Plan: buck test caffe2/test/cpp/tensorexpr:tensorexpr -- "ATen.fast"

Reviewed By: bertmaher

Differential Revision: D25884911

fbshipit-source-id: f9a5286450331f60935cfd40bb23f4a4f4c1d087
2021-01-22 09:56:02 -08:00
Peng Wu
6568572712 Support integral types for kAbs in SimpleIREvaluator (#49357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357

This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)

Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`

Imported from OSS

Reviewed By: eellison

Differential Revision: D25545791

fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
2020-12-18 07:57:58 -08:00
Bram Wasti
1047957831 [te][reapply] Add fast log approximation based on sleef (#49575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575

This is a fast log implementations

benchmark:

```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25627157

fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
Edward Yang
ea4ccc730e Revert D25445815: [te] Add fast log approximation based on sleef
Test Plan: revert-hammer

Differential Revision:
D25445815 (1329066b69)

Original commit changeset: 20696eacd12a

fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
Bram Wasti
1329066b69 [te] Add fast log approximation based on sleef
Summary:
This is a fast log implementations

benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25445815

fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00