Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.
This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.
This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).
This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885
Reviewed By: gmagogsfm
Differential Revision: D23503636
Pulled By: nickgg
fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.
This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179
Reviewed By: nickgg
Differential Revision: D23220534
Pulled By: navahgar
fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.
So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634
Reviewed By: mruberry
Differential Revision: D22969771
Pulled By: nickgg
fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:
```
for x in 0..3:
A[x] = x*2
```
becomes:
```
A[0] = 0
A[1] = 2
A[2] = 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465
Test Plan: `test_tensorexpr` unit tests.
Reviewed By: agolynski
Differential Revision: D22914418
Pulled By: asuhan
fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
Summary:
A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions:
* We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds.
* We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]:
```
for i in 1 to 10...
x[i] = x[i-1]
```
* Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but *not* ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it.
Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185
Reviewed By: mruberry
Differential Revision: D22870391
Pulled By: nickgg
fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa
Summary:
Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035
Reviewed By: ZolotukhinM
Differential Revision: D22080263
Pulled By: nickgg
fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67
Summary:
Fixes a bug in reorder axis where we append the new reordered loops to the enclosing block, even if there were statements after it. e.g. with 3 Computes:
```
for (int m1 ...
for (int n1 ...
for (int k1 ...
Body 1
for (int m2 ...
for (int n2 ...
for (int k2 ...
Body 2
for (int m3 ...
for (int n3 ...
for (int k3 ...
Body 3
```
If we reorder loops m2 and k2, we were also reordering the body statements like this:
```
for (int m1 ...
for (int n1 ...
for (int k1 ...
Body 1
for (int m3 ...
for (int n3 ...
for (int k3 ...
Body 3
for (int k2 ...
for (int n2 ...
for (int m2 ...
Body 2
```
This is because we always append the new loops to their parent. This PR fixes the logic to replace the old loop root with the new loop, which keeps things consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38841
Differential Revision: D21723670
Pulled By: nickgg
fbshipit-source-id: 1dee8bb153182fcaa2cabd948197577e8e80acd7
Summary:
Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For.
resistor does this work for you?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873
Differential Revision: D21428016
Pulled By: nickgg
fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36611
Currently Buf represents underlying storage but it didn't have dtype.
That resulted in specifying dtypes in different places and there was no
mechanism to enforce its consistency: e.g. one could've created a kFloat
expression and use a kInt buffer to store its result. Now we're
centralizing where the logic regarding the storage is located and we can
start enforcing semantics rules.
Follow-ups: we can merge Buffer and BufHandle classes as the former is
now a mere wrapper over the latter.
Test Plan: Imported from OSS
Differential Revision: D21027356
Pulled By: ZolotukhinM
fbshipit-source-id: c06aa2c4077fdcde3bb4ca622d324aece79b5a9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37050
With this change curly braces are printed as a part of Block rather than
a part of the enclosing statement. It allows us, for instance, to more
easily see nested blocks: now they will be printed each in its own
curly-braced scope.
As a side effect, I had to change how we print loop options. Previously
we did it like this:
```
for (...) { // <loop options>
<loop body (Block)>
}
```
Now, since everything in between { and } is a part of the block, we have
to do it the following way:
```
for (...) /* <loop options> */ {
<loop body (Block)>
}
```
Note the change from '//' to '/* .. */' for the loop option comments.
Test Plan: Imported from OSS
Differential Revision: D21171851
Pulled By: ZolotukhinM
fbshipit-source-id: 39f51a9e15aec03b6527b0634fd4b9e01a912cda
Summary:
Adds a capability for reordering axes in the LoopNest. This was fairly straightforward except when handling Reduction initializers which required more changes, UPDATE: actually the complicated bit was preserving the ordering of statements in the loopnest which should not be reordered.
Usage looks something like this:
```
Tensor* tensor = Compute(
"f", {{2, "x"}, {3, "y"}}, [](const VarHandle& x, const VarHandle& y) {
return ExprHandle(1.0f) + cast<float>(x) * x + cast<float>(y) * y;
});
LoopNest l({tensor});
/* LoopNest looks like:
for x in ...
for y in ...
f[x,y] = 1 + x * x + y * y;
*/
auto loops = l.getLoopStmtsFor(tensor);
l.reorderAxis(tensor, loops[0], loops[1])
/* LoopNest looks like:
for y in ...
for x in ...
f[x,y] = 1 + x * x + y * y;
*/
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36540
Differential Revision: D21068143
Pulled By: nickgg
fbshipit-source-id: f02c29004376df4f5a9bedff366c075772726618
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35800
This PR includes the following changes:
* Introduce a new `Expr` type `Buf`: it plays a similar to `Var` role, but also has dimensions.
* Use the new `Buf` class in `Store` and `Load` instead of `Var` for specifying where to store to or load from. `Buf` contains the dimensions info of the buffer we're loading/storing to and hence we are able to keep N-d indexes without flattening them into a 1-d index ([x,y] vs [x+y*W]).
* Flattening of the indexes is now a separate pass that is executed in `LoopNest::prepareForCodegen` - backends still expect indexes to be flattened, and this PR preserves that.
* `Tensor` now contains a `Buf` instead of `Var`, and thus Tensor now has the dimensions info (previously it was a property of a `Function`, not a `Tensor`). This brings us closer to Tensor being a combination of Buffer + Function, where Buffer specifies iteration domain and the Function defines a computation.
TODOs:
* Consider merging `Buffer` with `Buf` or `BufHandle`. It seems that we don't need all of them.
* Harden the logic of how we create buffers in fuser pass. Currently it seems that sometimes we don't set dimensions.
* Use `Buf` in `Allocate` and `Free`.
* Make it clearer that `Function` doesn't "own" dimensions info and that dimensions are a property of a Tensor, not a Function.
Differential Revision: D20789005
Test Plan: Imported from OSS
Reviewed By: zheng-xq
Pulled By: ZolotukhinM
fbshipit-source-id: e04188d1d297f195f1c46669c614557d6bb6cde4