Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887
BufHandle has exactly the same functionality and should be used instead.
Differential Revision:
D30889483
D30889483
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077
We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272
Test Plan: unit tests; new IR level test with huge sizes
Reviewed By: ZolotukhinM
Differential Revision: D30596689
fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586
This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.
After this change nothing uses KernelScope/KernelArena and they can be
safely removed.
Differential Revision:
D30429114
D30429114
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778
This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30487425
Pulled By: ZolotukhinM
fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197
This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292776
Pulled By: ZolotukhinM
fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61725
Alloc/free inside a loop isn't really an optimization, and furthermore
it breaks some attempted optimization in the llvm backend: we use alloca for
small allocations, which is efficient since alloca is on the stack, but there's
no corresponding free, so we leak tons of stack. I hit this while building an
rfactor buffer inside a very deeply nested loop.
ghstack-source-id: 133627310
Test Plan:
Unit test which simulates use of a temp buffer in a deeply nested
loop.
Reviewed By: navahgar
Differential Revision: D29533364
fbshipit-source-id: c321f4cb05304cfb9146afe32edc4567b623412e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508
An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28918342
Pulled By: ZolotukhinM
fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279
There were some issues with how we handle 0-dim cases in lowerings and
also in how we generate reductions in that special case. This PR fixes
those issues and reenables a bunch of tests.
Differential Revision:
D28819780
D28819780
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324
With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.
The new `rfactor` semantics is as follows:
```
Requirements:
* S is the reduction store
* S is the only statement in the innermost loop
* There is at least two reduction arguments in S
* OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
used in the store and all other reduction variables are index variables of
children loops of OUTER_REDUCTION_FOR
* OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
corresponding to the other reduction variables and the store, nested into
each other
What it does:
* Introduce a new buffer with an extra dimension of a size equal to the
span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
RFAC_BUF_PTR)
* Insert an initialization store for the new buffer in
OUTER_REDUCTION_FOR before its nested loop
* Replace the reduction store to the original buffer with the reduction
store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
from reduction arguments
* Insert a final reduction store over the extra dimension of the new
buffer to the original buffer
* Returns TRUE if the transformation succeeded and FALSE otherwise
Example:
Original IR:
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis
S4: for k # reduction axis
S5: X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})
After RFACTOR(S5, S3)
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis for X, normal axis for X_rfac
X_rfac[i,j] = 0
S4: for k # reduction axis
X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```
Differential Revision: D27694960
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
Summary:
Switched to short forms of `splitWithTail` / `splitWithMask` for all tests in `test/cpp/tensorexpr/test_*.cpp` (except test_loopnest.cpp)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55542
Reviewed By: mrshenli
Differential Revision: D27632033
Pulled By: jbschlosser
fbshipit-source-id: dc2ba134f99bff8951ae61e564cd1daea92c41df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997
DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.
DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.
Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27446414
Pulled By: ZolotukhinM
fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53751
Sometimes the initial value of a reduction expression needs to be
computed with reference to the loop axes; for example, adding bias can be
efficiently represented by initializing the accumulator from the bias tensor:
```
C[n, c, h, w] = bias[c]
for (...)
C[n, c, h, w] += ...
```
ghstack-source-id: 123592861
Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- Reductions.InitFunction`
Reviewed By: navahgar
Differential Revision: D26940321
fbshipit-source-id: 8a08e19e5d0b9ad453a07fab8b61e75dcd3d626b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52196
A reduction does not need to know the buffer into which its
result will be written. This change gets us closer to being able to
create reductions inside Compute, where we have access to the tensor
axes.
ghstack-source-id: 121813071
Test Plan: test_tensorexpr
Reviewed By: ZolotukhinM
Differential Revision: D26420107
Pulled By: bertmaher
fbshipit-source-id: c8d8a99649adfd6de56fe53a728f5aa034a84f13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52187
ReduceOp doesn't need to track the indices that its result will be written into.
ghstack-source-id: 121813075
Test Plan:
test_tensorexpr, tensorexpr_bench
Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D26420575
fbshipit-source-id: 7afcfa611515334e36de8039722011687f3b61e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697
Mostly mechanical move. This refactoring helps to hide unnecessary
details from the SimpleIREval interface and make it more similar to a
pure 'codegen'.
Test Plan: Imported from OSS
Reviewed By: nickgg
Differential Revision: D25668696
Pulled By: ZolotukhinM
fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c
Summary:
Makes two changes in NNC for intermediate buffer allocations:
1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages.
2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff).
I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554
Reviewed By: VitalyFedyunin
Differential Revision: D25643133
Pulled By: nickgg
fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160
We no longer use the custom c++ test infra anyways, so move to pure
gtest.
Fixes#45703
ghstack-source-id: 116977283
Test Plan: `buck test //caffe2/test/cpp/tensorexpr`
Reviewed By: navahgar, nickgg
Differential Revision: D25046618
fbshipit-source-id: da34183d87465f410379048148c28e1623618553
Summary:
Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work.
There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924
Reviewed By: bertmaher
Differential Revision: D24963464
Pulled By: nickgg
fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51
Summary:
This diff enables inlining producers into reductions. It also guards against inlining reductions themselves.
Prior to this diff, if there was a reduction in the loopnest, no inlining was happening. After this change, we will inline all non-output buffers that do not correspond to a reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47020
Reviewed By: albanD
Differential Revision: D24644346
Pulled By: navahgar
fbshipit-source-id: ad234a6877b65be2457b734cbb7f3a1800baa6a5
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:
Before:
```
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
A[i, j] = i * j;
}
}
for (int i_1 = 0; i_1 < 20; i_1++) {
for (int j_1 = 0; j_1 < 10; j_1++) {
B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
}
```
After `cacheAccesses(A->buf(), "A_local", j_loop);`
```
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
A[i, j] = i * j;
}
}
for (int i_1 = 0; i_1 < 20; i_1++) {
for (int i_2 = 0; i_2 < 2; i_2++) {
for (int j_1 = 0; j_1 < 11; j_1++) {
A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
}
}
for (int j_2 = 0; j_2 < 10; j_2++) {
B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
}
}
```
Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
sum[l1] = 0.f;
for (int n1_1 = 0; n1_1 < 3; n1_1++) {
for (int m1_1 = 0; m1_1 < 2; m1_1++) {
sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
}
}
}
```
After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:
```
for (int l1 = 0; l1 < 4; l1++) {
Allocate(d_local, float, {1});
sum[l1] = 0.f;
d_local[0] = 0.f;
for (int n1_1 = 0; n1_1 < 3; n1_1++) {
for (int m1_1 = 0; m1_1 < 2; m1_1++) {
d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
}
}
sum[l1] = (sum[l1]) + (d_local[0]);
Free(d_local);
}
```
I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.
This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.
While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869
Reviewed By: mruberry
Differential Revision: D24195276
Pulled By: nickgg
fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520
With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23998789
Pulled By: ZolotukhinM
fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390
Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.
Differential Revision: D23952865
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388
Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.
Differential Revision: D23952867
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.
The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.
I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231
Reviewed By: albanD
Differential Revision: D23689688
Pulled By: nickgg
fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.
This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.
This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).
This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885
Reviewed By: gmagogsfm
Differential Revision: D23503636
Pulled By: nickgg
fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139
Also, make sure that we're checking that condition when we're starting a
new fusion group, not only when we merge a node into an existing fusion
group. Oh, and one more: add a test checking that we're rejecting graphs
with unspecified shapes.
Differential Revision: D23507510
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.
An example:
```
Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
LoopNest loop({c});
std::vector<For*> loops = loop.getLoopStmtsFor(c);
auto v = loops.at(0)->var();
loop.rfactor(c->body(), v);
```
Before:
```
{
Allocate(tmp_buf, float, {m});
sum[0] = 0.f;
for (int m_1 = 0; m_1 < m; m_1++) {
tmp_buf[m_1] = 0.f;
}
for (int m_1 = 0; m_1 < m; m_1++) {
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
}
}
}
for (int m_1 = 0; m_1 < m; m_1++) {
sum[0] = (sum[0]) + (tmp_buf[m_1]);
}
Free(tmp_buf);
}
```
After:
```
{
sum[0] = 0.f;
for (int m = 0; m < m_1; m++) {
Allocate(tmp_buf, float, {m_1});
tmp_buf[m] = 0.f;
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
}
}
sum[0] = (sum[0]) + (tmp_buf[m]);
Free(tmp_buf);
}
}
```
The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050
Reviewed By: ZolotukhinM
Differential Revision: D22605639
Pulled By: nickgg
fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0
Summary:
The two bugs were:
* Non-reduction axes were not added when inserting the new ReduceOp, meaning if a reduction with non-reduce axes was rfactored we'd produce bad outputs. There were no tests of Rfactor with non-reduce axis so I modified a test to do this.
* The new statements were always prepended to the block, meaning writes to a buffer could be reordered after the usage of that buffer. This mostly happened in the case where we rfactor a previously rfactored reduction. There was a test of this, but since it only tested rfactoring the outer reduction axis there was never any other statements at the insertion point (the tests of the insertion point argument also do this). I added a new test which covers various rfactor-axis cases.
Also cleaned up tests, removed some helper code we don't need etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39268
Differential Revision: D21864489
Pulled By: nickgg
fbshipit-source-id: d314d20997a8472ec96b72f7a9068d6da6d2399c
Summary:
In `LoopNest::rfactor` we assume that there is only a single reduction below the insertion point, and when replacing the reduction we recursively replace all reductions below that point. This is not a safe assumption, as a number of transformations can introduce additional ReduceOps - most directly a `splitWithTail` on the innermost reduce axis.
This PR fixes that bug, and adds some unit tests covering the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38733
Differential Revision: D21723634
Pulled By: nickgg
fbshipit-source-id: 3ed6ffcdc2c15aef7504f9b2b91e8d827e0b5d88
Summary:
This PR removes the deferred initializer field from ReduceOp in favour of eagerly initializing buffers when they are created (either in the constructor of `LoopNest`, or in `rfactor()`). This allows a pretty good simplification of reduction logic, removing almost all of the reduction expander and the ReduceInitCleaner & unpopular NoOp node added in the last fix.
Eager initialization is better for us anyway because it allows more opportunities to transform the initialization loop.
Added a few more tests, testReduceOverSplitWithTail failed before this change due to a bug in splitWithTail which now can't happen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38585
Differential Revision: D21621551
Pulled By: nickgg
fbshipit-source-id: 378137e5723b4a6d6e390239efb12adce22a8215
Summary:
Fixes a bug in the following code:
```
Tensor* c = Reduce("sum", {{10, "m"}}, Sum(), b, {{10, "n"}, {10, "k"}});
// split N loop with tail:
loop.splitWithTail(loop.getLoopStmtsFor(c)[1], 8, &outer, &inner, &tail);
```
When this is expanded there are two ReduceOps:
```
for (int m = 0; m < 10; m++) {
for (int n_outer = 0; n_outer < (10 - 0) / 8; n_outer++) {
for (int n_inner = 0; n_inner < 8; n_inner++) {
for (int k = 0; k < 10; k++) {
sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_outer * 8 + n_inner, k]), out_args={m}, reduce_args={n_inner, n_outer, k});
}
}
}
for (int n_tail = 0; n_tail < (10 - 0) % 8; n_tail++) {
for (int k = 0; k < 10; k++) {
sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_tail + ((10 - 0) / 8) * 8, k]), out_args={m}, reduce_args={n_tail, k});
}
}
}
```
But each ReduceOp will expand it's initializer, which in this case will overwrite the sum of the split loop:
```
for (int m = 0; m < 10; m++) {
sum[m] = 0.f;
for (int n_inner = 0; n_inner < 8; n_inner++) {
for (int k = 0; k < 10; k++) {
sum[m] = (sum[m]) + (b[(100 * m + k) + 10 * n_inner]);
}
}
sum[m] = 0.f; <------- *HERE*
for (int n_tail = 0; n_tail < 2; n_tail++) {
for (int k = 0; k < 10; k++) {
sum[m] = (sum[m]) + (b[((100 * m + k) + 10 * n_tail) + 80]);
}
}
}
```
The simplest fix is to remove the initializer from the tail loop, which requires adding support for Reductions without an initializer (I did via adding a NoOp Expr rather than handling nullptr). Also moved the ReductionExpander from loopnest.cpp to reduction.h as loopnest is getting a bit heavy.
Added tests for all kinds of splits on a simple 3D reduction to verify no more problems of this type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38420
Differential Revision: D21587583
Pulled By: nickgg
fbshipit-source-id: e0766934481917007119612eb60cc76c3242e44a
Summary:
Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For.
resistor does this work for you?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873
Differential Revision: D21428016
Pulled By: nickgg
fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35