Commit Graph

54 Commits

Author SHA1 Message Date
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Bert Maher
2e6221a232 [nnc] Make 64-bit dimensions work (#64077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077

We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272

Test Plan: unit tests; new IR level test with huge sizes

Reviewed By: ZolotukhinM

Differential Revision: D30596689

fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
2021-08-28 19:59:47 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin
dd96c26066 [TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778

This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30487425

Pulled By: ZolotukhinM

fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
Mikhail Zolotukhin
7fdba4564a [TensorExpr] IRSimplifier: sort terms in polynomials, terms, minterms, maxterms. (#63197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197

This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292776

Pulled By: ZolotukhinM

fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
2021-08-18 14:49:27 -07:00
Mikhail Zolotukhin
1dc2b52764 [TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195

This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.

The changes are mechanical and should not affect any functionality.

With this PR, we're changing the following:
 * `Add*` --> `AddPtr`
 * `new Add(...)` --> `alloc<Add>(...)`
 * `dynamic_cast<Add*>` --> `to<Add>`
 * `static_cast<Add*>` --> `static_to<Add>`

Due to some complications with args forwarding, some places became more
verbose, e.g.:
 * `new Block({})` --> `new Block(std::vector<ExprPtr>())`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292779

Pulled By: ZolotukhinM

fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
2021-08-17 13:44:45 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Bert Maher
b963607d50 [nnc] Insert alloc/free at global scope (#61725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61725

Alloc/free inside a loop isn't really an optimization, and furthermore
it breaks some attempted optimization in the llvm backend: we use alloca for
small allocations, which is efficient since alloca is on the stack, but there's
no corresponding free, so we leak tons of stack.  I hit this while building an
rfactor buffer inside a very deeply nested loop.
ghstack-source-id: 133627310

Test Plan:
Unit test which simulates use of a temp buffer in a deeply nested
loop.

Reviewed By: navahgar

Differential Revision: D29533364

fbshipit-source-id: c321f4cb05304cfb9146afe32edc4567b623412e
2021-07-16 08:42:24 -07:00
Raghavan Raman
b822928e33 [nnc] Removed setGPUBlockIndex and setGPUThreadIndex methods from LoopNest (#59495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59495

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915960

Pulled By: navahgar

fbshipit-source-id: 20a4032b031aba6e43d85433ade5f0680c65fbc0
2021-06-15 10:37:46 -07:00
Raghavan Raman
aa163aeff5 [nnc] Made several LoopNest APIs static (#59494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59494

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915959

Pulled By: navahgar

fbshipit-source-id: bf52e30d893f4d86812219b538a14307f347f10b
2021-06-15 10:36:31 -07:00
Raghavan Raman
30e24b2d2b [nnc] Modified vectorize API to return bool (#59422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59422

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28886980

Pulled By: navahgar

fbshipit-source-id: 58cc3ecd86564a312a132f8260d836b096505095
2021-06-11 12:02:19 -07:00
Mikhail Zolotukhin
daa35141e8 Reland: "[TensorExpr] Fix handling of 0-dim tensors." (#59508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508

An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918342

Pulled By: ZolotukhinM

fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae
2021-06-08 22:48:17 -07:00
Nikita Shulga
ba3a90b55e Revert D28819780: [TensorExpr] Fix handling of 0-dim tensors.
Test Plan: revert-hammer

Differential Revision:
D28819780

Original commit changeset: f3feff35a1ce

fbshipit-source-id: 1dca4ac9cea0b67e9f02800f6d5b3c7e4ae1d81a
2021-06-04 19:25:30 -07:00
Mikhail Zolotukhin
d60efd8207 [TensorExpr] Fix handling of 0-dim tensors. (#59279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279

There were some issues with how we handle 0-dim cases in lowerings and
also in how we generate reductions in that special case. This PR fixes
those issues and reenables a bunch of tests.

Differential Revision:
D28819780
D28819780

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736
2021-06-04 13:58:15 -07:00
Hui Guo
7c4ac9e3ee [NNC] Fix loopnest.cache_accesses for reduce ops (fixed #59002) (#59136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59136

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28768598

Pulled By: huiguoo

fbshipit-source-id: 99ab8430bc0ba395e2a041b03a7761de335ddda5
2021-06-03 21:04:14 -07:00
Raghavan Raman
dd7bbe1a63 [NNC] Make splitWithMask transform in-place (#58269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58269

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427227

Pulled By: navahgar

fbshipit-source-id: 4e38a436abcf4752fd7ef6ab3666876eec6ea5ba
2021-05-25 11:32:51 -07:00
Raghavan Raman
e2467cc43e [NNC] Make splitWithTail transform in-place (#58268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427228

Pulled By: navahgar

fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f
2021-05-25 11:31:14 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Mikhail Zolotukhin
7ab654afd7 [TensorExpr] Rename Tensor::call to Tensor::load to be consistent with Buf and Placeholder. (#55826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55826

It's a mechanical change.

Differential Revision: D27717777

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: fbc1bb99602250c706cf2c8c2684119c323e4d51
2021-04-13 12:08:53 -07:00
Mikhail Zolotukhin
b01a15d3d3 [TensorExpr] Redesign Rfactor loopnest transformation. (#55324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324

With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.

The new `rfactor` semantics is as follows:

```
Requirements:
 * S is the reduction store
 * S is the only statement in the innermost loop
 * There is at least two reduction arguments in S
 * OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
 used in the store and all other reduction variables are index variables of
 children loops of OUTER_REDUCTION_FOR
 * OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
 corresponding to the other reduction variables and the store, nested into
 each other

What it does:
  * Introduce a new buffer with an extra dimension of a size equal to the
  span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
  RFAC_BUF_PTR)
  * Insert an initialization store for the new buffer in
  OUTER_REDUCTION_FOR before its nested loop
  * Replace the reduction store to the original buffer with the reduction
  store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
  from reduction arguments
  * Insert a final reduction store over the extra dimension of the new
  buffer to the original buffer
  * Returns TRUE if the transformation succeeded and FALSE otherwise

Example:
Original IR:
S1: for i        # normal axis
S2:   X[i] = 0
S3:   for j      # reduction axis
S4:     for k    # reduction axis
S5:       X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})

After RFACTOR(S5, S3)
S1: for i               # normal axis
S2:   X[i] = 0
S3:   for j             # reduction axis for X, normal axis for X_rfac
        X_rfac[i,j] = 0
S4:     for k           # reduction axis
          X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
        X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```

Differential Revision: D27694960

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
2021-04-13 12:08:48 -07:00
Joel Schlosser
defc649eca Update to short forms of splitWithTail / splitWithMask (#55542)
Summary:
Switched to short forms of `splitWithTail` / `splitWithMask` for all tests in `test/cpp/tensorexpr/test_*.cpp` (except test_loopnest.cpp)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55542

Reviewed By: mrshenli

Differential Revision: D27632033

Pulled By: jbschlosser

fbshipit-source-id: dc2ba134f99bff8951ae61e564cd1daea92c41df
2021-04-09 10:15:20 -07:00
Mikhail Zolotukhin
688e350725 [TensorExpr] Nuke DepTracker and findAllNeededTensors. (#54997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997

DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.

DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.

Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27446414

Pulled By: ZolotukhinM

fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
2021-04-01 19:46:26 -07:00
Bert Maher
997f05cd34 [nnc] Add an initialization expression to Reduce() (#53751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53751

Sometimes the initial value of a reduction expression needs to be
computed with reference to the loop axes; for example, adding bias can be
efficiently represented by initializing the accumulator from the bias tensor:
```
C[n, c, h, w] = bias[c]
for (...)
  C[n, c, h, w] += ...
```
ghstack-source-id: 123592861

Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- Reductions.InitFunction`

Reviewed By: navahgar

Differential Revision: D26940321

fbshipit-source-id: 8a08e19e5d0b9ad453a07fab8b61e75dcd3d626b
2021-03-10 17:13:14 -08:00
Hui Guo
973e306c84 changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167

Test Plan:
Imported from OSS

`python test/test_jit_fuser_te.py`
`python test/test_jit_fuser_legacy.py`
`python test/test_jit_fuser.py`
`build/bin/test_tensorexpr`

Reviewed By: ZolotukhinM

Differential Revision: D25814342

Pulled By: huiguoo

fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee
2021-02-22 15:08:51 -08:00
Bert Maher
ac121165e2 Remove ReduceOp::accumulator (#52196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52196

A reduction does not need to know the buffer into which its
result will be written.  This change gets us closer to being able to
create reductions inside Compute, where we have access to the tensor
axes.
ghstack-source-id: 121813071

Test Plan: test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D26420107

Pulled By: bertmaher

fbshipit-source-id: c8d8a99649adfd6de56fe53a728f5aa034a84f13
2021-02-17 23:36:23 -08:00
Bert Maher
a788c2d777 [nnc] Remove output_args from ReduceOp (#52187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52187

ReduceOp doesn't need to track the indices that its result will be written into.
ghstack-source-id: 121813075

Test Plan:
test_tensorexpr, tensorexpr_bench

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D26420575

fbshipit-source-id: 7afcfa611515334e36de8039722011687f3b61e4
2021-02-17 23:36:18 -08:00
Mikhail Zolotukhin
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
Mikhail Zolotukhin
a5b27d7a31 [TensorExpr] Move SimpleIREval implementation from .h to .cpp. (#49697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697

Mostly mechanical move. This refactoring helps to hide unnecessary
details from the SimpleIREval interface and make it more similar to a
pure 'codegen'.

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D25668696

Pulled By: ZolotukhinM

fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c
2020-12-21 20:20:15 -08:00
Nick Gibson
db2e9c1e7f [NNC] Intermediate allocs flattened and dependency support (#49554)
Summary:
Makes two changes in NNC for intermediate buffer allocations:
1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages.
2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff).

I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554

Reviewed By: VitalyFedyunin

Differential Revision: D25643133

Pulled By: nickgg

fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c
2020-12-21 10:35:15 -08:00
Bert Maher
07657b6001 [tensorexpr] Switch cpp tests to pure gtest (#48160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160

We no longer use the custom c++ test infra anyways, so move to pure
gtest.

Fixes #45703
ghstack-source-id: 116977283

Test Plan: `buck test //caffe2/test/cpp/tensorexpr`

Reviewed By: navahgar, nickgg

Differential Revision: D25046618

fbshipit-source-id: da34183d87465f410379048148c28e1623618553
2020-11-18 12:23:34 -08:00
Nick Gibson
957e45a97c [NNC] Support vectorization of reductions (#47924)
Summary:
Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work.

There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924

Reviewed By: bertmaher

Differential Revision: D24963464

Pulled By: nickgg

fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51
2020-11-16 10:43:53 -08:00
Raghavan Raman
f58842c214 Enable inlining into reductions (#47020)
Summary:
This diff enables inlining producers into reductions. It also guards against inlining reductions themselves.

Prior to this diff, if there was a reduction in the loopnest, no inlining was happening. After this change, we will inline all non-output buffers that do not correspond to a reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47020

Reviewed By: albanD

Differential Revision: D24644346

Pulled By: navahgar

fbshipit-source-id: ad234a6877b65be2457b734cbb7f3a1800baa6a5
2020-11-02 15:33:38 -08:00
Nick Gibson
402abdfdf4 [NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869)
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:

Before:
```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int j_1 = 0; j_1 < 10; j_1++) {
    B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
  }
```

After `cacheAccesses(A->buf(), "A_local", j_loop);`

```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int i_2 = 0; i_2 < 2; i_2++) {
    for (int j_1 = 0; j_1 < 11; j_1++) {
      A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
    }
  }
  for (int j_2 = 0; j_2 < 10; j_2++) {
    B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
  }
}
```

Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
  sum[l1] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
}
```

After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:

```
for (int l1 = 0; l1 < 4; l1++) {
  Allocate(d_local, float, {1});
  sum[l1] = 0.f;
  d_local[0] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
  sum[l1] = (sum[l1]) + (d_local[0]);
  Free(d_local);
}
```

I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.

This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.

While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869

Reviewed By: mruberry

Differential Revision: D24195276

Pulled By: nickgg

fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
2020-10-08 14:13:28 -07:00
Mikhail Zolotukhin
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
Mikhail Zolotukhin
b86008ab75 [TensorExpr] Remove buf_ field from class Tensor. (#45390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390

Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.

Differential Revision: D23952865

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
2020-09-29 01:21:57 -07:00
Mikhail Zolotukhin
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
Mikhail Zolotukhin
92306b85d5 [TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388

Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.

Differential Revision: D23952867

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
2020-09-29 01:17:10 -07:00
Nick Gibson
69839ea3f6 [NNC] make inlining immediate (take 3) (#44231)
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.

The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.

I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231

Reviewed By: albanD

Differential Revision: D23689688

Pulled By: nickgg

fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
2020-09-15 11:12:24 -07:00
Mikhail Zolotukhin
6474057c76 Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs
Test Plan: revert-hammer

Differential Revision:
D23503636 (70aecd2a7f)

Original commit changeset: cdbdc902b7a1

fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230
2020-09-04 10:58:23 -07:00
Nick Gibson
70aecd2a7f [NNC] make inlining immediate (take 2) and fix bugs (#43885)
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.

This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.

This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).

This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885

Reviewed By: gmagogsfm

Differential Revision: D23503636

Pulled By: nickgg

fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
2020-09-03 16:49:24 -07:00
Mikhail Zolotukhin
3105d8a9b2 [TensorExpr] Fuser: rely on input types when checking whether a device is supported. (#44139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139

Also, make sure that we're checking that condition when we're starting a
new fusion group, not only when we merge a node into an existing fusion
group. Oh, and one more: add a test checking that we're rejecting graphs
with unspecified shapes.

Differential Revision: D23507510

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b
2020-09-03 16:27:14 -07:00
generatedunixname89002005287564@sandcastle1415.cln1.facebook.com
1dd658f28f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#43953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43953

Reviewed By: malfet

Differential Revision: D23445556

fbshipit-source-id: 89cd6833aa06f35c5d3c99d698abb08cd61ae4ab
2020-09-01 21:48:28 -07:00
Nick Gibson
7ffdd765c8 [TensorExpr] more convenient outer Rfactor output (#40050)
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.

An example:
```
  Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
  LoopNest loop({c});
  std::vector<For*> loops = loop.getLoopStmtsFor(c);
  auto v = loops.at(0)->var();
  loop.rfactor(c->body(), v);
```
Before:
```
{
  Allocate(tmp_buf, float, {m});
  sum[0] = 0.f;
  for (int m_1 = 0; m_1 < m; m_1++) {
    tmp_buf[m_1] = 0.f;
  }
  for (int m_1 = 0; m_1 < m; m_1++) {
    for (int n = 0; n < n_1; n++) {
      for (int k = 0; k < k_1; k++) {
        tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
      }
    }
  }
  for (int m_1 = 0; m_1 < m; m_1++) {
    sum[0] = (sum[0]) + (tmp_buf[m_1]);
  }
  Free(tmp_buf);
}
```

After:
```
{
  sum[0] = 0.f;
  for (int m = 0; m < m_1; m++) {
    Allocate(tmp_buf, float, {m_1});
    tmp_buf[m] = 0.f;
    for (int n = 0; n < n_1; n++) {
      for (int k = 0; k < k_1; k++) {
        tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
      }
    }
    sum[0] = (sum[0]) + (tmp_buf[m]);
    Free(tmp_buf);
  }
}
```

The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050

Reviewed By: ZolotukhinM

Differential Revision: D22605639

Pulled By: nickgg

fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0
2020-07-21 14:44:26 -07:00
Nick Gibson
2ed4ed8733 [TensorExpr] Fix two bugs in Rfactor (#39268)
Summary:
The two bugs were:
* Non-reduction axes were not added when inserting the new ReduceOp, meaning if a reduction with non-reduce axes was rfactored we'd produce bad outputs. There were no tests of Rfactor with non-reduce axis so I modified a test to do this.
* The new statements were always prepended to the block, meaning writes to a buffer could be reordered after the usage of that buffer. This mostly happened in the case where we rfactor a previously rfactored reduction. There was a test of this, but since it only tested rfactoring the outer reduction axis there was never any other statements at the insertion point (the tests of the insertion point argument also do this). I added a new test which covers various rfactor-axis cases.

Also cleaned up tests, removed some helper code we don't need etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39268

Differential Revision: D21864489

Pulled By: nickgg

fbshipit-source-id: d314d20997a8472ec96b72f7a9068d6da6d2399c
2020-06-03 14:38:34 -07:00
Nick Gibson
cf8001d2d0 [TensorExpr] Fix a bug in Rfactor when there are multiple reductions (#38733)
Summary:
In `LoopNest::rfactor` we assume that there is only a single reduction below the insertion point, and when replacing the reduction we recursively replace all reductions below that point. This is not a safe assumption, as a number of transformations can introduce additional ReduceOps - most directly a `splitWithTail` on the innermost reduce axis.

This PR fixes that bug, and adds some unit tests covering the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38733

Differential Revision: D21723634

Pulled By: nickgg

fbshipit-source-id: 3ed6ffcdc2c15aef7504f9b2b91e8d827e0b5d88
2020-05-27 16:49:34 -07:00
Nick Gibson
2f21dfb541 [TensorExpr] Eager reduction initialization & removal from ReduceOp (#38585)
Summary:
This PR removes the deferred initializer field from ReduceOp in favour of eagerly initializing buffers when they are created (either in the constructor of `LoopNest`, or in `rfactor()`). This allows a pretty good simplification of reduction logic, removing almost all of the reduction expander and the ReduceInitCleaner & unpopular NoOp node added in the last fix.

Eager initialization is better for us anyway because it allows more opportunities to transform the initialization loop.

Added a few more tests, testReduceOverSplitWithTail failed before this change due to a bug in splitWithTail which now can't happen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38585

Differential Revision: D21621551

Pulled By: nickgg

fbshipit-source-id: 378137e5723b4a6d6e390239efb12adce22a8215
2020-05-18 15:56:43 -07:00
Nick Gibson
8bf3124572 [TensorExpr] Fix bug when splitting inner reduce axis with tail (#38420)
Summary:
Fixes a bug in the following code:
```
    Tensor* c = Reduce("sum", {{10, "m"}}, Sum(), b, {{10, "n"}, {10, "k"}});
    // split N loop with tail:
    loop.splitWithTail(loop.getLoopStmtsFor(c)[1], 8, &outer, &inner, &tail);
```

When this is expanded there are two ReduceOps:

```
for (int m = 0; m < 10; m++) {
  for (int n_outer = 0; n_outer < (10 - 0) / 8; n_outer++) {
    for (int n_inner = 0; n_inner < 8; n_inner++) {
      for (int k = 0; k < 10; k++) {
        sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_outer * 8 + n_inner, k]), out_args={m}, reduce_args={n_inner, n_outer, k});
      }
    }
  }
  for (int n_tail = 0; n_tail < (10 - 0) % 8; n_tail++) {
    for (int k = 0; k < 10; k++) {
      sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_tail + ((10 - 0) / 8) * 8, k]), out_args={m}, reduce_args={n_tail, k});
    }
  }
}
```

But each ReduceOp will expand it's initializer, which in this case will overwrite the sum of the split loop:

```
for (int m = 0; m < 10; m++) {
  sum[m] = 0.f;
  for (int n_inner = 0; n_inner < 8; n_inner++) {
    for (int k = 0; k < 10; k++) {
      sum[m] = (sum[m]) + (b[(100 * m + k) + 10 * n_inner]);
    }
  }
  sum[m] = 0.f;          <------- *HERE*
  for (int n_tail = 0; n_tail < 2; n_tail++) {
    for (int k = 0; k < 10; k++) {
      sum[m] = (sum[m]) + (b[((100 * m + k) + 10 * n_tail) + 80]);
    }
  }
}
```

The simplest fix is to remove the initializer from the tail loop, which requires adding support for Reductions without an initializer (I did via adding a NoOp Expr rather than handling nullptr). Also moved the ReductionExpander from loopnest.cpp to reduction.h as loopnest is getting a bit heavy.

Added tests for all kinds of splits on a simple 3D reduction to verify no more problems of this type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38420

Differential Revision: D21587583

Pulled By: nickgg

fbshipit-source-id: e0766934481917007119612eb60cc76c3242e44a
2020-05-14 22:58:28 -07:00
Nick Gibson
4e2ea6e013 [TensorExpr] Remove the Tensor argument from loopnest.reorderAxis (#37873)
Summary:
Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For.

resistor does this work for you?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873

Differential Revision: D21428016

Pulled By: nickgg

fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35
2020-05-06 12:02:15 -07:00