Commit Graph

63 Commits

Author SHA1 Message Date
Bert Maher
e4d19798f3 [nnc][tests] Convert a bunch of FileCheck to checkIR
Summary:
I added a helper to convert a Stmt to string and FileCheck it, so
started using it in a bunch of places.  I replaced about half the current uses,
got tired, started to write a Perl script to automate it, realized that was
hard, and decided to give up for a bit.  But this cleans up some of the tests a
bit, so seems easy to review and worth landing.

Test Plan: test_tensorexpr --gtest_filter=LoopNest.*

Reviewed By: navahgar

Differential Revision: D27375866

fbshipit-source-id: 15894b9089dec5cf25f340fe17e6e54546a64257
2021-03-26 20:27:50 -07:00
Bert Maher
24f589df44 [nnc] Disabled test case for failure in implementing conv1d (#54756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54756

We have multiple bugs here, one relating to index flattening and the
other to computeAt.
ghstack-source-id: 125054729

Test Plan: yikes

Reviewed By: ZolotukhinM

Differential Revision: D27354082

fbshipit-source-id: 8b15bac28e3eba4629881ae0f3bd143636f65ad7
2021-03-26 20:27:48 -07:00
Bert Maher
e542e67253 [nnc] Test case for computeAt with reduction (#54755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54755

As title.  A step on the way to using computeAt to optimize
convolution.
ghstack-source-id: 125054730

Test Plan: new test

Reviewed By: ZolotukhinM

Differential Revision: D27353663

fbshipit-source-id: 930e09d96d1f74169bf148cd30fc195c6759a3e9
2021-03-26 20:25:18 -07:00
Raghavan Raman
601e79200d [NNC] Implementing LoopFusion (#54461)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54337

This PR adds a new API to NNC to perform loop fusion.

```
static For* fuseLoops(const std::vector<For*>& loops);
```

Loop fusion is done only when all the conditions below are satisfied.
  * All the loops have the same parent.
  * There are no statements between these loops in their parent body.
  * The start bounds are the same for all loops.
  * The stop bounds are the same for all loops.
  * Fusing the loops does not violate or add any dependencies.

This PR also adds an API to check for partial overlaps in `buffer_inference.h` and fixes a bug in `mem_dependency_checker.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54461

Reviewed By: bertmaher

Differential Revision: D27254888

Pulled By: navahgar

fbshipit-source-id: c21b027d738e5022e9cb88f6f72cd9e255bdb15e
2021-03-23 21:20:00 -07:00
Raghavan Raman
4b2abc4b8e [NNC] Adding API to distribute loops (#53865)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53864

This PR adds the following APIs that perform loop distribution to `LoopNest`:
```
static std::vector<For*> distributeLoop(For* loop, const std::unordered_set<Stmt*>& pivots);
static std::vector<For*> distributeLoop(For* loop);
static std::vector<For*> distributeLoopOverInnerLoops(For* loop);
```

* The first method distributes the given loop over its body by splitting after every given pivot stmt.
* The second method distributes the given loop over every stmt in its body.
* The last method distributes the given loop over its body by splitting after every `For` stmt in its body.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53865

Reviewed By: mruberry

Differential Revision: D27075006

Pulled By: navahgar

fbshipit-source-id: 031746aad619fe84c109e78b53387535e7f77cef
2021-03-18 07:27:39 -07:00
Bert Maher
a852fdb6b5 [nnc] Test for using int64 dimensions (#54094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54094

We should be able to use 64-bit integers for loop boundaries and
buffer/tensor indexing.
ghstack-source-id: 124116846

Test Plan: New tests, disabled

Reviewed By: ZolotukhinM

Differential Revision: D27094934

fbshipit-source-id: a53de21a0ef523ea3560d5dd4707df50624896ef
2021-03-17 10:59:26 -07:00
Raghavan Raman
ef07a04072 [NNC] New APIs to get loops corresponding to a Buf (#53778)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53092

This PR adds the following APIs to NNC.
```
// In For:
static For* getParentLoop(const Stmt* st);
static std::vector<For*> getEnclosingLoopNest(const Stmt* st);

// In LoopNest:
std::vector<const Stmt*> getAllWritesToBuf(const Buf*) const;
std::vector<For*> getAllInnermostLoopsWritingToBuf(const Buf*) const;
std::vector<std::vector<For*>> getAllLoopNestsWritingToBuf(const Buf*) const;
```

These APIs are required for some usecases that involve multiple transformations like `splitWithTail` followed by `reorder` as shown in https://github.com/pytorch/pytorch/issues/53092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53778

Reviewed By: albanD

Differential Revision: D26987013

Pulled By: navahgar

fbshipit-source-id: 491459eddfff045132d2358631ad069bbcc520df
2021-03-12 18:50:15 -08:00
Horace He
d4602b7e45 [NNC] Fixes case where inlining wouldn't work because dim-size was 1. (#53254)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52581

The git diff is absolutely atrocious since I also refactored the code to share stuff between `Load` and `FunctionCall`.

Biggest questions I have about this diff are:

1. The asserts I added. From my understanding it's not possible to have a constant index in `Store` that's non-zero, since `Store` always creates a new buffer. Perhaps the user can write this kind of incorrect code, though, so perhaps I should just check for it and not assert it?

2. I don't think(?) I need to do any special handling for `index_vars`, but wasn't totally able to track the logic there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53254

Reviewed By: albanD

Differential Revision: D26991064

Pulled By: Chillee

fbshipit-source-id: 0bcd612d5f4b031c0b34e68a72d9c8d12d118be8
2021-03-11 20:53:20 -08:00
Bert Maher
3bd250fd03 [nnc] Test ability to vectorize reads from an intermediate tensor (#53752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53752

This test doesn't work today because we don't properly vectorize
"FunctionCall" (which is the way one accesses an intermediate tensor).
ghstack-source-id: 123592860

Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- LoopNest.VectorizeUse`

Reviewed By: ZolotukhinM

Differential Revision: D26895550

fbshipit-source-id: 0798ebf3e6a834bd70181732c81528455d5329fa
2021-03-10 20:32:10 -08:00
Bert Maher
565d8235e5 [nnc] Test cases for uneven split + reorder (#53091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53091

Split with tail followed by reorder causes a segfault in NNC
Split with mask followed by reorder generates invalid code that writes out of
bounds
ghstack-source-id: 122870733

Test Plan: LoopNest.ColReduceSplit*

Reviewed By: navahgar

Differential Revision: D26746254

fbshipit-source-id: f8a0de18531b34d2bf06ccaa35d9c98b81b5c600
2021-03-02 20:36:48 -08:00
Mikhail Zolotukhin
aba33b0042 [TensorExpr] IRVerifier: add index verifier for Store. (#53137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53137

Also, add casting to Int for Load and Store indices.

Fixes #52773.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26760256

Pulled By: ZolotukhinM

fbshipit-source-id: a2d3141b17584724a5feabcabec25d0577b83a30
2021-03-02 19:56:28 -08:00
Raghavan Raman
aae188c529 [NNC] Handle non literal constant bounds in Unroll. (#53029)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53000

Also added test to confirm this case works in FlattenLoop as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53029

Reviewed By: bertmaher

Differential Revision: D26742705

Pulled By: navahgar

fbshipit-source-id: d87a0f9698411026b5b6e55eee7c2b9fb123d06b
2021-03-02 00:35:27 -08:00
Mikhail Zolotukhin
d3b427a0e3 [TensorExpr] Add an unmasked Load constructor. (#52790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52790

Fixes #52774.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26649542

Pulled By: ZolotukhinM

fbshipit-source-id: ab1c9e55f52e59d0bd00fbde2ec3125f8c7917ee
2021-02-24 22:45:29 -08:00
Mikhail Zolotukhin
88a160dc21 [TensorExpr] LoopNest: Cleanup LoopNest constructors. (#52726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52726

This change removes `input_bufs_` and `intermediate_bufs_` from
`LoopNest` class as they can be deduced from the root stmt and the list
of output bufs. As a result, the constuctor of the LoopNest also becomes
simpler as we now need to pass just one list of bufs.

Note: we might consider passing list of input bufs for verification
purposes (only inputs buffers are allowed to not have a definition), but
since we don't really have an IR verifier yet, there is no need in it
now. Once we add IR verifier, we could reconsider it.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26629596

Pulled By: ZolotukhinM

fbshipit-source-id: 81f544e9602b6855b7968d540b9ae06bd7c7e6d8
2021-02-24 13:26:22 -08:00
Mikhail Zolotukhin
b63a1e31d3 [TensorExpr] Inlining: allow inlining into Load exprs. (#52627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52627

Currently inliner only inlines into Calls, this PR extends this to
 cover Loads too. Eventually we will remove Calls altogether and use
 Loads everywhere, this is one step in that direction.

Differential Revision: D26589377

Test Plan: Imported from OSS

Reviewed By: asuhan

Pulled By: ZolotukhinM

fbshipit-source-id: ca28f0df2273eb214f203467c6ba3d8f02a8a3b6
2021-02-22 21:47:24 -08:00
Hui Guo
973e306c84 changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167

Test Plan:
Imported from OSS

`python test/test_jit_fuser_te.py`
`python test/test_jit_fuser_legacy.py`
`python test/test_jit_fuser.py`
`build/bin/test_tensorexpr`

Reviewed By: ZolotukhinM

Differential Revision: D25814342

Pulled By: huiguoo

fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee
2021-02-22 15:08:51 -08:00
Mikhail Zolotukhin
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
Mikhail Zolotukhin
42aeb68128 [TensorExpr] Move 'initializer' field from 'Tensor' to 'Buf'. (#50993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50993

This is the first step to make 'Tensor` a thin wrapper over 'Buf' and
'Stmt', which will be finished in subsequent PRs. This change also
allows to remove 'buf_initializers_' from 'LoopNest', making it "less
stateful".

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038224

Pulled By: ZolotukhinM

fbshipit-source-id: f418816e54c62f291fa45812901487394e9b95b5
2021-01-27 16:10:53 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Elias Ellison
efe1fc21fc Dont inlinine intermediates on cpu (#49565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49565

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ZolotukhinM

Differential Revision: D25688271

Pulled By: eellison

fbshipit-source-id: 9ea7858e2db4fb31292e04440fc72ee04623c688
2021-01-04 15:46:20 -08:00
Mikhail Zolotukhin
a5b27d7a31 [TensorExpr] Move SimpleIREval implementation from .h to .cpp. (#49697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697

Mostly mechanical move. This refactoring helps to hide unnecessary
details from the SimpleIREval interface and make it more similar to a
pure 'codegen'.

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D25668696

Pulled By: ZolotukhinM

fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c
2020-12-21 20:20:15 -08:00
Nick Gibson
db2e9c1e7f [NNC] Intermediate allocs flattened and dependency support (#49554)
Summary:
Makes two changes in NNC for intermediate buffer allocations:
1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages.
2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff).

I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554

Reviewed By: VitalyFedyunin

Differential Revision: D25643133

Pulled By: nickgg

fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c
2020-12-21 10:35:15 -08:00
Elias Ellison
9056173acc [NNC] Dont inline outputs buffers on cpu (#49488)
Summary:
In https://github.com/pytorch/pytorch/pull/48967/ we enabled output buffer inlining,  which results in duplicate computation if one output depends on another. This was done to fix correctness for CUDA, but is not needed for correctness for CPU and results in  perf slowdown.

The output buffer inlining solution for CUDA is intended to be an interim solution because it does not work with reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49488

Reviewed By: ezyang

Differential Revision: D25596071

Pulled By: eellison

fbshipit-source-id: bc3d987645da5ce3c603b4abac3586b169656cfd
2020-12-16 16:28:25 -08:00
Nick Gibson
5469aa5e7f [NNC] Add a non functional Tensor kind (#48750)
Summary:
Adds the CompoundTensor, a specialisation of the NNC Tensor which allows arbitrary production statements. This will allow lowering of aten ops into specific NNC IR patterns (which don't need to be functional) - allowing us to shortcut to the optimized form of common patterns.

This is part 1 of trying to clean up the lowering of aten::cat so it is easier to optimize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48750

Reviewed By: tugsbayasgalan

Differential Revision: D25433517

Pulled By: nickgg

fbshipit-source-id: de13c4719f8f87619ab254e5f324f13b5be1c9da
2020-12-10 19:43:50 -08:00
Nick Gibson
c5bc6b40ab [NNC] Dead Store Elimination (#49030)
Summary:
Adds a new optimization method to LoopNest which eliminates stores that do not contribute to any output. It's unlikely any of the lowerings of aten operators produce these stores yet, but this creates some wiggle room for transformations in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49030

Reviewed By: tugsbayasgalan

Differential Revision: D25434538

Pulled By: nickgg

fbshipit-source-id: fa1ead82e6f7440cc783c6116b23d0b7a5b5db4b
2020-12-09 18:49:53 -08:00
Mikhail Zolotukhin
2b70bcd014 [TensorExpr] Enable inlining for output tensors too. (#48967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48967

We previously didn't inline output tensors which resulted in correctness
issues like #48533. This PR allows inlining for output tensors too -
this could result in duplicated computations, but we can address that
later once correctness is ensured.

Performance results on FastRNNS:
Before the fix:
```
Benchmarking LSTMs...
            name          avg_fwd          std_fwd          avg_bwd          std_bwd
           cudnn            10.09          0.05431            17.55           0.2108
            aten            21.52           0.1276             26.7            1.471
             jit            13.25           0.8748            22.47             1.73
      jit_premul            11.43           0.3226            19.43            2.231
 jit_premul_bias            11.84           0.2245            20.33            2.205
      jit_simple            13.27           0.9906            22.15           0.9724
  jit_multilayer            13.38           0.8748            22.82             1.01
              py            33.55            4.837            46.41            6.333
```
After the fix:
```
Benchmarking LSTMs...
            name          avg_fwd          std_fwd          avg_bwd          std_bwd
           cudnn            10.09          0.05979            17.45           0.1987
            aten            21.21            0.144            26.43           0.7356
             jit            13.01           0.2925            23.21           0.8454
      jit_premul             11.4           0.3905            19.62            2.448
 jit_premul_bias            11.85           0.2461            20.29           0.6592
      jit_simple            13.08           0.8533            22.81            1.315
  jit_multilayer            12.93           0.1095            23.57            1.459
              py            31.21            2.783            44.63            6.073
```

Differential Revision: D25383949

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: ZolotukhinM

fbshipit-source-id: 16f5727475109a278499bef7905f6aad18c8527a
2020-12-08 13:24:40 -08:00
Bert Maher
07657b6001 [tensorexpr] Switch cpp tests to pure gtest (#48160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160

We no longer use the custom c++ test infra anyways, so move to pure
gtest.

Fixes #45703
ghstack-source-id: 116977283

Test Plan: `buck test //caffe2/test/cpp/tensorexpr`

Reviewed By: navahgar, nickgg

Differential Revision: D25046618

fbshipit-source-id: da34183d87465f410379048148c28e1623618553
2020-11-18 12:23:34 -08:00
Raghavan Raman
fa108bd264 Add flatten loops transformation (#46365)
Summary:
This diff removes the dependency of flattening on tensors by performing flattening on loops instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46365

Reviewed By: ailzhang

Differential Revision: D24366347

Pulled By: navahgar

fbshipit-source-id: 4ba182f37212b6e4033cae13f8e75bc5144389f4
2020-10-16 17:05:26 -07:00
Nick Gibson
402abdfdf4 [NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869)
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:

Before:
```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int j_1 = 0; j_1 < 10; j_1++) {
    B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
  }
```

After `cacheAccesses(A->buf(), "A_local", j_loop);`

```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int i_2 = 0; i_2 < 2; i_2++) {
    for (int j_1 = 0; j_1 < 11; j_1++) {
      A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
    }
  }
  for (int j_2 = 0; j_2 < 10; j_2++) {
    B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
  }
}
```

Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
  sum[l1] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
}
```

After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:

```
for (int l1 = 0; l1 < 4; l1++) {
  Allocate(d_local, float, {1});
  sum[l1] = 0.f;
  d_local[0] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
  sum[l1] = (sum[l1]) + (d_local[0]);
  Free(d_local);
}
```

I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.

This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.

While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869

Reviewed By: mruberry

Differential Revision: D24195276

Pulled By: nickgg

fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
2020-10-08 14:13:28 -07:00
Mikhail Zolotukhin
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
Mikhail Zolotukhin
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
Mikhail Zolotukhin
92306b85d5 [TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388

Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.

Differential Revision: D23952867

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
2020-09-29 01:17:10 -07:00
Nick Gibson
9e206ee9f1 [NNC] Fix a bug in SplitWithMask when splitting multiple times (#45141)
Summary:
When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can.

The fix is just to simplify the loop extents before doing the extent calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141

Reviewed By: ezyang

Differential Revision: D23869170

Pulled By: nickgg

fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5
2020-09-23 14:04:58 -07:00
Nick Gibson
69839ea3f6 [NNC] make inlining immediate (take 3) (#44231)
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.

The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.

I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231

Reviewed By: albanD

Differential Revision: D23689688

Pulled By: nickgg

fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
2020-09-15 11:12:24 -07:00
Alex Suhan
a188dbdf3f Check for index-rank consistency in FunctionInliner (#44561)
Summary:
When caller / callee pairs are	inserted into the mapping, verify that
the arity of the buffer access is consistent with its declared rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561

Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch

Reviewed By: albanD

Differential Revision: D23684342

Pulled By: asuhan

fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab
2020-09-14 14:07:22 -07:00
Cheng Chang
b7ef4eec46 [NNC] Add loop slicing transforms (#43854)
Summary:
Add new transforms `sliceHead` and `sliceTail` to `LoopNest`, for example:

Before transformation:
```
for x in 0..10:
  A[x] = x*2
```

After `sliceHead(x, 4)`:

```
for x in 0..4:
  A[x] = x*2
for x in 4..10:
  A[x] = x*2
```

After `sliceTail(x, 1)`:
```
for x in 0..4:
  A[x] = x*2
for x in 4..9:
  A[x] = x*2
for x in 9..10:
  A[x] = x*2
```

`sliceHead(x, 10)` and `sliceTail(x, 10)` is no-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43854

Test Plan: Tests are added in `test_loopnest.cpp`, the tests cover the basic transformations, and also tests the combination with other transformations such as `splitWithTail`.

Reviewed By: nickgg

Differential Revision: D23417366

Pulled By: cheng-chang

fbshipit-source-id: 06c6348285f2bafb4be3286d1642bfbe1ea499bf
2020-09-11 12:09:12 -07:00
Cheng Chang
28bd4929bd [NNC] Make it able to normalize loop with variable start (#44133)
Summary:
Loops with variable start can also be normalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44133

Test Plan: updated testNormalizeStartVariable.

Reviewed By: navahgar

Differential Revision: D23507097

Pulled By: cheng-chang

fbshipit-source-id: 4e9aad1cd4f4a839f59a00bf8ddf97637a1a6648
2020-09-09 23:05:57 -07:00
Mikhail Zolotukhin
6474057c76 Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs
Test Plan: revert-hammer

Differential Revision:
D23503636 (70aecd2a7f)

Original commit changeset: cdbdc902b7a1

fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230
2020-09-04 10:58:23 -07:00
Nick Gibson
70aecd2a7f [NNC] make inlining immediate (take 2) and fix bugs (#43885)
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.

This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.

This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).

This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885

Reviewed By: gmagogsfm

Differential Revision: D23503636

Pulled By: nickgg

fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
2020-09-03 16:49:24 -07:00
Raghavan Raman
100649d6a9 Normalize loops with non-zero start. (#43179)
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.

This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179

Reviewed By: nickgg

Differential Revision: D23220534

Pulled By: navahgar

fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
2020-08-21 12:37:27 -07:00
Nick Gibson
944ac133d0 [NNC] Remove VarBinding and go back to Let stmts (#42634)
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.

So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634

Reviewed By: mruberry

Differential Revision: D22969771

Pulled By: nickgg

fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
2020-08-07 10:50:38 -07:00
Alexandru Suhan
1848b43c4d [NNC] Add loop unroll transformation (#42465)
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:

```
for x in 0..3:
  A[x] = x*2
```

becomes:

```
A[0] = 0
A[1] = 2
A[2] = 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465

Test Plan: `test_tensorexpr` unit tests.

Reviewed By: agolynski

Differential Revision: D22914418

Pulled By: asuhan

fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
2020-08-05 20:46:32 -07:00
Nick Gibson
f47e00bdc3 [NNC] Bounds Inference: make inferred bounds respect gaps (#42185)
Summary:
A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions:
* We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds.
* We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]:
```
for i in 1 to 10...
  x[i] = x[i-1]
```
* Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but *not* ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it.

Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185

Reviewed By: mruberry

Differential Revision: D22870391

Pulled By: nickgg

fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa
2020-07-31 20:22:04 -07:00
Nick Gibson
aa91a65b59 [TensorExpr] Fix propagation of loop options when splitting loops (#40035)
Summary:
Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035

Reviewed By: ZolotukhinM

Differential Revision: D22080263

Pulled By: nickgg

fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67
2020-07-22 11:49:07 -07:00
Nick Gibson
5153cdbe87 [TensorExpr] fix a bug in ReorderAxis when there are trailing loops (#38841)
Summary:
Fixes a bug in reorder axis where we append the new reordered loops to the enclosing block, even if there were statements after it. e.g. with 3 Computes:
```
for (int m1 ...
  for (int n1 ...
    for (int k1 ...
      Body 1
for (int m2 ...
  for (int n2 ...
    for (int k2 ...
      Body 2
for (int m3 ...
  for (int n3 ...
    for (int k3 ...
      Body 3
```

If we reorder loops m2 and k2, we were also reordering the body statements like this:

```
for (int m1 ...
  for (int n1 ...
    for (int k1 ...
      Body 1
for (int m3 ...
  for (int n3 ...
    for (int k3 ...
      Body 3
for (int k2 ...
  for (int n2 ...
    for (int m2 ...
      Body 2
```

This is because we always append the new loops to their parent. This PR fixes the logic to replace the old loop root with the new loop, which keeps things consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38841

Differential Revision: D21723670

Pulled By: nickgg

fbshipit-source-id: 1dee8bb153182fcaa2cabd948197577e8e80acd7
2020-05-31 22:22:45 -07:00
Nikita Shulga
c6e9e9359f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#39023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39023

Reviewed By: orionr

Differential Revision: D21702529

fbshipit-source-id: 6945bba95609102409850b105a8a091e33b8acc9
2020-05-27 14:07:26 -07:00
Owen Anderson
65260d48c8 Fix splitWithTail to insert the tail immediately after the outer loop. (#37941)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37941

Differential Revision: D21429733

Pulled By: resistor

fbshipit-source-id: 12094d990c11da8b44f32a52aa5e50b3f3575145
2020-05-07 00:05:23 -07:00
Nick Gibson
4e2ea6e013 [TensorExpr] Remove the Tensor argument from loopnest.reorderAxis (#37873)
Summary:
Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For.

resistor does this work for you?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873

Differential Revision: D21428016

Pulled By: nickgg

fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35
2020-05-06 12:02:15 -07:00
Mikhail Zolotukhin
1c0bad25f3 [TensorExpr] Add dtype to class Buf. (#36611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36611

Currently Buf represents underlying storage but it didn't have dtype.
That resulted in specifying dtypes in different places and there was no
mechanism to enforce its consistency: e.g. one could've created a kFloat
expression and use a kInt buffer to store its result. Now we're
centralizing where the logic regarding the storage is located and we can
start enforcing semantics rules.

Follow-ups: we can merge Buffer and BufHandle classes as the former is
now a mere wrapper over the latter.

Test Plan: Imported from OSS

Differential Revision: D21027356

Pulled By: ZolotukhinM

fbshipit-source-id: c06aa2c4077fdcde3bb4ca622d324aece79b5a9c
2020-05-05 15:04:37 -07:00
Owen Anderson
564de515f5 Add an iterator to Block. (#37542)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37542

Differential Revision: D21314421

Pulled By: resistor

fbshipit-source-id: e54d7a8a5c9c1186be59f69b5b8af030fc054b32
2020-05-01 15:12:49 -07:00