Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/56157
This PR updates the `flatten` API in `LoopNest` to perform the flattening transformation in-place. After this transformation, the first loop in the input becomes the flattened loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56629
Reviewed By: H-Huang
Differential Revision: D28004787
Pulled By: navahgar
fbshipit-source-id: 7474ae237fae3fff0cd1c64a276a8831dc5b7db0
Summary:
This PR includes:
* Update to the loop-carried dependence check API to correctly ignore loop-independent dependences and handle all kinds of loop-carried dependences like RAW, WAR and WAW.
* Fix for the overlap API to look only for conflicting buffer accesses where at least one of them is a Store.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56354
Reviewed By: bertmaher
Differential Revision: D27856202
Pulled By: navahgar
fbshipit-source-id: 206e4ec771fe0f7f2ccf4b11b29e35df7b9b18bc
Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/56357
Changes the `fuseLoops` API to the following form:
```
static bool fuseLoops(const std::vector<For*>& loops, For** fused);
```
Also, adds a new API to check for loop-carried dependences:
```
static bool hasLoopCarriedDependence(For* loop);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56353
Reviewed By: bertmaher
Differential Revision: D27856214
Pulled By: navahgar
fbshipit-source-id: 443557088692585657faee296602c547a00117dd
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/56157
This PR changes `normalize` API in `LoopNest` to transform the given `For` statement and not create a new one.
New API:
```
static bool normalize(For* f);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56158
Reviewed By: agolynski
Differential Revision: D27798361
Pulled By: navahgar
fbshipit-source-id: 57626a5a367bdf94a0efbd9dc8538f5e4e410d6b
Summary:
This PR allows fusing loops whose bounds are specified as expressions that are equal.
For example:
```
for (int j = 0; j < M + N; j++) {
A[j] = 10 * j;
}
for (int k = 0; k < M + N; k++) {
B[k] = 20 * k;
}
```
`fuseLoops(j, k)` is possible since the stop bounds of the two loops are equal though they are different `Expr*` and will result in:
```
for (int j = 0; j < M + N; j++) {
A[j] = 10 * j;
B[j] = 20 * j;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55997
Reviewed By: bertmaher
Differential Revision: D27841270
Pulled By: navahgar
fbshipit-source-id: a64e4503b7f8f28bc0c9823225bc923177bb4c2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825
The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.
Differential Revision: D27717776
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324
With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.
The new `rfactor` semantics is as follows:
```
Requirements:
* S is the reduction store
* S is the only statement in the innermost loop
* There is at least two reduction arguments in S
* OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
used in the store and all other reduction variables are index variables of
children loops of OUTER_REDUCTION_FOR
* OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
corresponding to the other reduction variables and the store, nested into
each other
What it does:
* Introduce a new buffer with an extra dimension of a size equal to the
span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
RFAC_BUF_PTR)
* Insert an initialization store for the new buffer in
OUTER_REDUCTION_FOR before its nested loop
* Replace the reduction store to the original buffer with the reduction
store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
from reduction arguments
* Insert a final reduction store over the extra dimension of the new
buffer to the original buffer
* Returns TRUE if the transformation succeeded and FALSE otherwise
Example:
Original IR:
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis
S4: for k # reduction axis
S5: X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})
After RFACTOR(S5, S3)
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis for X, normal axis for X_rfac
X_rfac[i,j] = 0
S4: for k # reduction axis
X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```
Differential Revision: D27694960
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52690
This PR adds the following APIs:
```
static bool areLoopsPerfectlyNested(const std::vector<For*>& loops);
static std::vector<For*> reorder(
const std::vector<For*>& loops,
const std::vector<size_t>& permutation);
```
The first API checks if the given list of loops are perfectly nested. The second API reorders the given list of loops according to the permutation specified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55568
Reviewed By: albanD
Differential Revision: D27689734
Pulled By: navahgar
fbshipit-source-id: dc1bffdbee068c3f401188035772b41847cbc7c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54999
BaseCallNode was used as a base class for Intrinsics and FunctionCall.
Now FunctionCall is gone, so BaseCallNode could be removed as well.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27446411
Pulled By: ZolotukhinM
fbshipit-source-id: be8ce06fbac72bfe355e5e3e1d2aa2267fae79fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54998
The only reason why we couldn't use Load instead of FunctionCall was
DepTracker. Now this is gone and we finally could replace FunctionCall
with Load.
Test Plan: Imported from OSS
Reviewed By: bertmaher, pbelevich
Differential Revision: D27446412
Pulled By: ZolotukhinM
fbshipit-source-id: 9183ae5541c2618abc9026b1dc4c4c9fab085d47
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997
DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.
DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.
Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27446414
Pulled By: ZolotukhinM
fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54439
For now the only way to represent conv2d in TE is via an external call,
and since aten library doesn't have an out variant for conv2d, the
external call has to perform an extra copy. Because of that fusing
conv2d now regressed performance and hence is disabled. However, in near
future we should have two alternative ways to enable it:
1) represent conv2d natively in TE (without an external call)
2) add an out variant for conv2d
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D27237045
Pulled By: ZolotukhinM
fbshipit-source-id: f5545ff711b75f9f37bc056316d1999a70043b4c
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54337
This PR adds a new API to NNC to perform loop fusion.
```
static For* fuseLoops(const std::vector<For*>& loops);
```
Loop fusion is done only when all the conditions below are satisfied.
* All the loops have the same parent.
* There are no statements between these loops in their parent body.
* The start bounds are the same for all loops.
* The stop bounds are the same for all loops.
* Fusing the loops does not violate or add any dependencies.
This PR also adds an API to check for partial overlaps in `buffer_inference.h` and fixes a bug in `mem_dependency_checker.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54461
Reviewed By: bertmaher
Differential Revision: D27254888
Pulled By: navahgar
fbshipit-source-id: c21b027d738e5022e9cb88f6f72cd9e255bdb15e
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53864
This PR adds the following APIs that perform loop distribution to `LoopNest`:
```
static std::vector<For*> distributeLoop(For* loop, const std::unordered_set<Stmt*>& pivots);
static std::vector<For*> distributeLoop(For* loop);
static std::vector<For*> distributeLoopOverInnerLoops(For* loop);
```
* The first method distributes the given loop over its body by splitting after every given pivot stmt.
* The second method distributes the given loop over every stmt in its body.
* The last method distributes the given loop over its body by splitting after every `For` stmt in its body.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53865
Reviewed By: mruberry
Differential Revision: D27075006
Pulled By: navahgar
fbshipit-source-id: 031746aad619fe84c109e78b53387535e7f77cef
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53092
This PR adds the following APIs to NNC.
```
// In For:
static For* getParentLoop(const Stmt* st);
static std::vector<For*> getEnclosingLoopNest(const Stmt* st);
// In LoopNest:
std::vector<const Stmt*> getAllWritesToBuf(const Buf*) const;
std::vector<For*> getAllInnermostLoopsWritingToBuf(const Buf*) const;
std::vector<std::vector<For*>> getAllLoopNestsWritingToBuf(const Buf*) const;
```
These APIs are required for some usecases that involve multiple transformations like `splitWithTail` followed by `reorder` as shown in https://github.com/pytorch/pytorch/issues/53092
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53778
Reviewed By: albanD
Differential Revision: D26987013
Pulled By: navahgar
fbshipit-source-id: 491459eddfff045132d2358631ad069bbcc520df
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52581
The git diff is absolutely atrocious since I also refactored the code to share stuff between `Load` and `FunctionCall`.
Biggest questions I have about this diff are:
1. The asserts I added. From my understanding it's not possible to have a constant index in `Store` that's non-zero, since `Store` always creates a new buffer. Perhaps the user can write this kind of incorrect code, though, so perhaps I should just check for it and not assert it?
2. I don't think(?) I need to do any special handling for `index_vars`, but wasn't totally able to track the logic there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53254
Reviewed By: albanD
Differential Revision: D26991064
Pulled By: Chillee
fbshipit-source-id: 0bcd612d5f4b031c0b34e68a72d9c8d12d118be8
Summary:
* Replacing vector of Tensors with a set of output buffers in `TensorExprKernel`.
* Creating a block statement while compiling in `TensorExprKernel`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53688
Reviewed By: mrshenli
Differential Revision: D26941222
Pulled By: navahgar
fbshipit-source-id: 9eb81ec2effcdeafbeaa67d1e12475166054f80f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52901
This PR implements IR Verifier and adds a call to it in `LoopNest`
constructors. Checks that were in expr/stmt constructors before are now
moved to the corresponding `::make` functions or to the verifier. They
didn't really help from the constructors anyway since an exception
thrown from there led to a segfault due to the fact our memory
management works (object was not fully created but was registered in the
kernel arena for destruction anyway).
Fixes#52778.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26682928
Pulled By: ZolotukhinM
fbshipit-source-id: c56524015cdffb1ed8bce4394509961a4071dcfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52726
This change removes `input_bufs_` and `intermediate_bufs_` from
`LoopNest` class as they can be deduced from the root stmt and the list
of output bufs. As a result, the constuctor of the LoopNest also becomes
simpler as we now need to pass just one list of bufs.
Note: we might consider passing list of input bufs for verification
purposes (only inputs buffers are allowed to not have a definition), but
since we don't really have an IR verifier yet, there is no need in it
now. Once we add IR verifier, we could reconsider it.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26629596
Pulled By: ZolotukhinM
fbshipit-source-id: 81f544e9602b6855b7968d540b9ae06bd7c7e6d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52628
Prior to this change ExternalCalls were not considered as Loads or
Stores to/from its buffers, which led to incorrect behavior in inlining.
This PR fixes it.
Differential Revision: D26589378
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: cd69d5f7075f6dc756aabcf676842b9a250334d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52627
Currently inliner only inlines into Calls, this PR extends this to
cover Loads too. Eventually we will remove Calls altogether and use
Loads everywhere, this is one step in that direction.
Differential Revision: D26589377
Test Plan: Imported from OSS
Reviewed By: asuhan
Pulled By: ZolotukhinM
fbshipit-source-id: ca28f0df2273eb214f203467c6ba3d8f02a8a3b6
Summary:
Remove the dependency tracker that works on Tensors, DepTracker, from LoopNest. This is essential to the goal of removing Tensors from LoopNest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52405
Reviewed By: heitorschueroff
Differential Revision: D26548621
Pulled By: navahgar
fbshipit-source-id: b20f23d608c19ac71aebd31c14777d653eead36c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52196
A reduction does not need to know the buffer into which its
result will be written. This change gets us closer to being able to
create reductions inside Compute, where we have access to the tensor
axes.
ghstack-source-id: 121813071
Test Plan: test_tensorexpr
Reviewed By: ZolotukhinM
Differential Revision: D26420107
Pulled By: bertmaher
fbshipit-source-id: c8d8a99649adfd6de56fe53a728f5aa034a84f13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52187
ReduceOp doesn't need to track the indices that its result will be written into.
ghstack-source-id: 121813075
Test Plan:
test_tensorexpr, tensorexpr_bench
Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D26420575
fbshipit-source-id: 7afcfa611515334e36de8039722011687f3b61e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52177
I'm trying to get rid of `output_args` for reductions, because they
shouldn't be necessary; it's reducing over its reduction axis, why
does it need to know where its output is going?
Rfactor is probably the trickiest place where we use output_args, but
it looks like it's mostly just carrying around the location of the
store, so use that instead.
ghstack-source-id: 121813072
Test Plan:
build/bin/test_tensorexpr && build/bin/tensorexpr_bench
Imported from OSS
Reviewed By: navahgar
Differential Revision: D26420548
fbshipit-source-id: aeab564c6113fa02eabb14c9b70c7edfd05b264d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51751
Similar in spirit to the `__builtin_expect` C intrinsic, it's useful
to be able to hint the expected branch direction in a tensor expression. Using
this flag has a few effects on codegen:
- The CompareSelect is generated using conditional branches, rather than selects
- The conditional branches are strongly hinted (like, 100000:1) in the indicated direction
- A vectorized hinted CompareSelect computes its condition in parallel with a
mask "reduction" (e.g. a bitcast from `<i1 x 8>` to `<i*>`). In AVX terms
this sequence might look like:
```
vpcmpgtd %ymm0, %ymm1, %ymm2
vmovmskps %ymm2, %eax
```
The motivating case for this addition is an attempt I'm making to replicate
fast transcendentals using tensor expressions. Floating-point numbers have
lots of special cases (denormals, inf, nan) that need special handling, and
it's convenient to be able to punt that handling off to a slow path while
keeping the fast path nice and tight.
ghstack-source-id: 121366315
Test Plan:
I'm not sure how to test this (except I can tell you it works for
the `log` implementation I'm working on right now). It would be nice to plumb
the LLIR/ASM output through programmatically so it can be used in FileCheck.
Maybe I'll do that in another diff?
Reviewed By: asuhan
Differential Revision: D26246401
fbshipit-source-id: 900f7fa0520010fb9931d6e3efc8680a51f8d844
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50994
Eventually, 'Tensor' will be fully responsible for its 'Stmt' and moving
this method to it is one step in that direction.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038222
Pulled By: ZolotukhinM
fbshipit-source-id: 0549f0ae6b46a93ff7608a22e79faa5115eef661
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50993
This is the first step to make 'Tensor` a thin wrapper over 'Buf' and
'Stmt', which will be finished in subsequent PRs. This change also
allows to remove 'buf_initializers_' from 'LoopNest', making it "less
stateful".
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038224
Pulled By: ZolotukhinM
fbshipit-source-id: f418816e54c62f291fa45812901487394e9b95b5