Commit Graph

21 Commits

Author SHA1 Message Date
Richard Barnes
29d759948e use irange for loops 2 (#66746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705361

fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
2021-12-10 04:26:23 -08:00
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Mikhail Zolotukhin
3a0165da49 [TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551

Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.

Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148926

Pulled By: ZolotukhinM

fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
2021-09-30 22:56:18 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin
dd96c26066 [TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778

This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30487425

Pulled By: ZolotukhinM

fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
Bert Maher
10e11dbdcd Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60550

Original commit changeset: ed655497a981

Whatever gcc version OSS Bazel uses wasn't happy move-constructing the
SimpleIREvaluator, so use a unique_ptr instead.

Test Plan:
CI.  Hope that the gcc version used by OSS Bazel build is
happier with this (it should be), since actually testing it locally is
an intractable pain.

Reviewed By: navahgar

Differential Revision: D29333116

fbshipit-source-id: c3e4b5d8c91eb96a43ae5315a01ca0c0f4d4a99d
2021-06-23 10:50:03 -07:00
Anjali Chourdia
b14f19b6fe Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum
Test Plan: revert-hammer

Differential Revision:
D29190420 (21479ad20c)

Original commit changeset: 86246df82098

fbshipit-source-id: ed655497a981783da4c8f13e2d7fec104e3cb184
2021-06-23 06:59:37 -07:00
Bert Maher
21479ad20c [nnc][tests] Tests and benchmarks for computeSum (#60160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60160

Adds a few simple tests and benchmarks for the `computeSum` op
(equivalent to `at::sum`).

The benchmarks test 1D reduction and 2D row and column reduction.  Performance
is in the ballpark of aten (14-15 GB/s) on my skylake devserver for all cases,
and occasionally better (e.g. 256k * 64 row reduction goes from 9 GB/s to 13).

Results (on my skylake-avx512, with turbo disabled):
```
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
Reduce1D/Torch/16777216               4746995 ns    4746722 ns        150 BYTES=14.1379G/s
Reduce1D/Naive/16777216              34063215 ns   34061388 ns         21 BYTES=1.97023G/s
Reduce1D/NativeRfactor/16777216       5057175 ns    5057167 ns        139 BYTES=13.2701G/s
Reduce1D/TeNaive/16777216            33868945 ns   33868851 ns         21 BYTES=1.98143G/s
Reduce1D/TeSplitTail/16777216        33902786 ns   33900436 ns         21 BYTES=1.97959G/s
Reduce1D/TeSplitMask/16777216        33922509 ns   33920604 ns         21 BYTES=1.97841G/s
Reduce1D/TeRfactorV1/16777216         5141150 ns    5141002 ns        135 BYTES=13.0537G/s
Reduce1D/Op/16777216                  5140390 ns    5140091 ns        135 BYTES=13.056G/s
Reduce2DCol/Torch/8/2097152          12824403 ns   12823563 ns         55 BYTES=5.8874G/s
Reduce2DCol/Torch/64/262144           8306873 ns    8306743 ns         83 BYTES=8.20507G/s
Reduce2DCol/Torch/4096/4096           7992364 ns    7992239 ns         87 BYTES=8.3988G/s
Reduce2DCol/OpSchedule/8/2097152/0    4866144 ns    4865766 ns        138 BYTES=15.5161G/s
Reduce2DCol/OpSchedule/64/262144/0   36668978 ns   36666415 ns         19 BYTES=1.85885G/s
Reduce2DCol/OpSchedule/4096/4096/0  155862459 ns  155801266 ns          4 BYTES=430.839M/s
Reduce2DCol/OpSchedule/8/2097152/1    8067683 ns    8061117 ns         85 BYTES=9.36563G/s
Reduce2DCol/OpSchedule/64/262144/1    7496686 ns    7496562 ns         93 BYTES=9.09183G/s
Reduce2DCol/OpSchedule/4096/4096/1    5262821 ns    5262186 ns        131 BYTES=12.7562G/s
Reduce2DCol/OpSchedule/8/2097152/2    6237899 ns    6237210 ns        109 BYTES=12.1044G/s
Reduce2DCol/OpSchedule/64/262144/2    5258012 ns    5257655 ns        127 BYTES=12.9635G/s
Reduce2DCol/OpSchedule/4096/4096/2    5231686 ns    5228241 ns        132 BYTES=12.839G/s
Reduce2DCol/OpSchedule/8/2097152/3   11088573 ns   11087557 ns         62 BYTES=6.80921G/s
Reduce2DCol/OpSchedule/64/262144/3    5338843 ns    5338326 ns        127 BYTES=12.7676G/s
Reduce2DCol/OpSchedule/4096/4096/3    4311617 ns    4308102 ns        162 BYTES=15.5812G/s
Reduce2DRow/Torch/8/2097152           4642244 ns    4641794 ns        151 BYTES=14.4575G/s
Reduce2DRow/Torch/64/262144           4628311 ns    4628245 ns        151 BYTES=14.4999G/s
Reduce2DRow/Torch/4096/4096           4894012 ns    4893316 ns        143 BYTES=13.7177G/s
Reduce2DRow/Torch/262144/64          10469098 ns   10468027 ns         68 BYTES=6.51101G/s
Reduce2DRow/Hand/262144/64            5554380 ns    5554059 ns        126 BYTES=12.2716G/s
Reduce2DRow/OpSchedule/8/2097152/0   33890363 ns   33888931 ns         21 BYTES=1.98026G/s
Reduce2DRow/OpSchedule/64/262144/0   33901317 ns   33899436 ns         21 BYTES=1.97965G/s
Reduce2DRow/OpSchedule/4096/4096/0   33500358 ns   33498815 ns         21 BYTES=2.00381G/s
Reduce2DRow/OpSchedule/262144/64/0   13132231 ns   13131049 ns         53 BYTES=5.19056G/s
Reduce2DRow/OpSchedule/8/2097152/1    5200423 ns    5200025 ns        134 BYTES=12.9055G/s
Reduce2DRow/OpSchedule/64/262144/1    5204428 ns    5204327 ns        133 BYTES=12.8949G/s
Reduce2DRow/OpSchedule/4096/4096/1    8724355 ns    8723370 ns         80 BYTES=7.69488G/s
Reduce2DRow/OpSchedule/262144/64/1 1811861280 ns 1811352083 ns          1 BYTES=37.6279M/s
Reduce2DRow/OpSchedule/8/2097152/2    9169829 ns    9168946 ns         76 BYTES=7.31915G/s
Reduce2DRow/OpSchedule/64/262144/2    9159901 ns    9158560 ns         76 BYTES=7.32747G/s
Reduce2DRow/OpSchedule/4096/4096/2    9217398 ns    9215557 ns         76 BYTES=7.28391G/s
Reduce2DRow/OpSchedule/262144/64/2   10820450 ns   10818998 ns         66 BYTES=6.29979G/s
Reduce2DRow/OpSchedule/8/2097152/3    5227921 ns    5226544 ns        133 BYTES=12.84G/s
Reduce2DRow/OpSchedule/64/262144/3    5194362 ns    5194082 ns        133 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/4096/4096/3    5196080 ns    5195349 ns        134 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/262144/64/3    5235189 ns    5234728 ns        133 BYTES=13.0202G/s
```

ghstack-source-id: 131753875

Test Plan: these tests

Reviewed By: navahgar

Differential Revision: D29190420

fbshipit-source-id: 86246df82098da4f5493d6c4f34a40016d95a9f0
2021-06-22 23:04:09 -07:00
Raghavan Raman
dd7bbe1a63 [NNC] Make splitWithMask transform in-place (#58269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58269

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427227

Pulled By: navahgar

fbshipit-source-id: 4e38a436abcf4752fd7ef6ab3666876eec6ea5ba
2021-05-25 11:32:51 -07:00
Raghavan Raman
e2467cc43e [NNC] Make splitWithTail transform in-place (#58268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427228

Pulled By: navahgar

fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f
2021-05-25 11:31:14 -07:00
Mikhail Zolotukhin
b01a15d3d3 [TensorExpr] Redesign Rfactor loopnest transformation. (#55324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324

With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.

The new `rfactor` semantics is as follows:

```
Requirements:
 * S is the reduction store
 * S is the only statement in the innermost loop
 * There is at least two reduction arguments in S
 * OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
 used in the store and all other reduction variables are index variables of
 children loops of OUTER_REDUCTION_FOR
 * OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
 corresponding to the other reduction variables and the store, nested into
 each other

What it does:
  * Introduce a new buffer with an extra dimension of a size equal to the
  span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
  RFAC_BUF_PTR)
  * Insert an initialization store for the new buffer in
  OUTER_REDUCTION_FOR before its nested loop
  * Replace the reduction store to the original buffer with the reduction
  store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
  from reduction arguments
  * Insert a final reduction store over the extra dimension of the new
  buffer to the original buffer
  * Returns TRUE if the transformation succeeded and FALSE otherwise

Example:
Original IR:
S1: for i        # normal axis
S2:   X[i] = 0
S3:   for j      # reduction axis
S4:     for k    # reduction axis
S5:       X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})

After RFACTOR(S5, S3)
S1: for i               # normal axis
S2:   X[i] = 0
S3:   for j             # reduction axis for X, normal axis for X_rfac
        X_rfac[i,j] = 0
S4:     for k           # reduction axis
          X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
        X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```

Differential Revision: D27694960

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
2021-04-13 12:08:48 -07:00
Xiaoqiang Zheng
9f86b656ba Resubmit: Adding parallel support for the LLVM backend. (#54122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54122

Test Plan:
* USE_TBB=1 ATEN_THREADING=TBB python setup.py develop --cmake
  * USE_TBB=1 ATEN_THREADING=NATIVE python setup.py develop --cmake
  * USE_TBB=1 ATEN_THREADING=OMP python setup.py develop --cmake
  * cd build; ninja bin/tensorexpr_bench
  * bin/test_tensorexpr --gtest_filter="*Parallel*"

Reviewed By: bertmaher

Differential Revision: D27109802

Pulled By: zheng-xq

fbshipit-source-id: db159466d0b46357bcf0fbefb36094bee312368c
2021-03-18 07:19:37 -07:00
Nikita Shulga
d57ae6c46d Revert D26906509: Adding parallel support for the LLVM backend.
Test Plan: revert-hammer

Differential Revision:
D26906509 (95d2318510)

Original commit changeset: 12c17f2f21af

fbshipit-source-id: cc86d0dfca0dd791b31bda23a0172fc1cfd89760
2021-03-11 17:54:47 -08:00
Xiaoqiang Zheng
95d2318510 Adding parallel support for the LLVM backend. (#53243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53243

Test Plan: Imported from OSS

Reviewed By: bertmaher, Chillee

Differential Revision: D26906509

Pulled By: zheng-xq

fbshipit-source-id: 12c17f2f21af11e73fa4c5b5199043a7a15ecdec
2021-03-11 03:27:37 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Mikhail Zolotukhin
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
Xiaoqiang Zheng
b96a6516a6 Add CPP Full Reduction Benchmarks. (#50193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193

* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.

Test plans:

* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D

Measurements:

On a Broadwell E5-2686 CPU,

Reduce1D/Torch/16777216            5638547 ns    5638444 ns        119 BYTES=11.902G/s
Reduce1D/Naive/16777216           19308235 ns   19308184 ns         36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216    8433348 ns    8433038 ns         85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216     5608836 ns    5608727 ns        124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216      5550233 ns    5550221 ns        126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216         21451047 ns   21450752 ns         33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216     23701732 ns   23701229 ns         30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216     23683589 ns   23682978 ns         30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216      5378019 ns    5377909 ns        131 BYTES=12.4786G/s

Result summary:

* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.

Follow-up items:

* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
  * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25821880

Pulled By: zheng-xq

fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
2021-01-21 10:00:50 -08:00