Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72390
This class didn't add much value and only caused more boilerplate code.
This change removes the class and updates all the use cases with
uses of `ExprHandle`.
A side effect of this change is different names in loop variables, which
caused massive mechanical changes in our tests.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D34030296
Pulled By: ZolotukhinM
fbshipit-source-id: 2ba4e313506a43ab129a10d99e72b638b7d40108
(cherry picked from commit c2ec46a058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66515
These passes should not be used generally as they change API of the
model's forward method, but they help experimenting with the model and
ironing out all the kinks before it can be compiled properly. In the
long run ideally we should provide a better way to enable such
experiments.
Differential Revision:
D31590862
D31590862
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 74ded34c6c871d4cafa29f43dc27c7e71daff8fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66744
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D31705358
fbshipit-source-id: d6ea350cbaa8f452fc78f238160e5374be637a48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887
BufHandle has exactly the same functionality and should be used instead.
Differential Revision:
D30889483
D30889483
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077
We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272
Test Plan: unit tests; new IR level test with huge sizes
Reviewed By: ZolotukhinM
Differential Revision: D30596689
fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586
This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.
After this change nothing uses KernelScope/KernelArena and they can be
safely removed.
Differential Revision:
D30429114
D30429114
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778
This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30487425
Pulled By: ZolotukhinM
fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197
This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292776
Pulled By: ZolotukhinM
fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61725
Alloc/free inside a loop isn't really an optimization, and furthermore
it breaks some attempted optimization in the llvm backend: we use alloca for
small allocations, which is efficient since alloca is on the stack, but there's
no corresponding free, so we leak tons of stack. I hit this while building an
rfactor buffer inside a very deeply nested loop.
ghstack-source-id: 133627310
Test Plan:
Unit test which simulates use of a temp buffer in a deeply nested
loop.
Reviewed By: navahgar
Differential Revision: D29533364
fbshipit-source-id: c321f4cb05304cfb9146afe32edc4567b623412e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508
An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28918342
Pulled By: ZolotukhinM
fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279
There were some issues with how we handle 0-dim cases in lowerings and
also in how we generate reductions in that special case. This PR fixes
those issues and reenables a bunch of tests.
Differential Revision:
D28819780
D28819780
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324
With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.
The new `rfactor` semantics is as follows:
```
Requirements:
* S is the reduction store
* S is the only statement in the innermost loop
* There is at least two reduction arguments in S
* OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
used in the store and all other reduction variables are index variables of
children loops of OUTER_REDUCTION_FOR
* OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
corresponding to the other reduction variables and the store, nested into
each other
What it does:
* Introduce a new buffer with an extra dimension of a size equal to the
span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
RFAC_BUF_PTR)
* Insert an initialization store for the new buffer in
OUTER_REDUCTION_FOR before its nested loop
* Replace the reduction store to the original buffer with the reduction
store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
from reduction arguments
* Insert a final reduction store over the extra dimension of the new
buffer to the original buffer
* Returns TRUE if the transformation succeeded and FALSE otherwise
Example:
Original IR:
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis
S4: for k # reduction axis
S5: X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})
After RFACTOR(S5, S3)
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis for X, normal axis for X_rfac
X_rfac[i,j] = 0
S4: for k # reduction axis
X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```
Differential Revision: D27694960
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
Summary:
Switched to short forms of `splitWithTail` / `splitWithMask` for all tests in `test/cpp/tensorexpr/test_*.cpp` (except test_loopnest.cpp)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55542
Reviewed By: mrshenli
Differential Revision: D27632033
Pulled By: jbschlosser
fbshipit-source-id: dc2ba134f99bff8951ae61e564cd1daea92c41df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997
DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.
DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.
Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27446414
Pulled By: ZolotukhinM
fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53751
Sometimes the initial value of a reduction expression needs to be
computed with reference to the loop axes; for example, adding bias can be
efficiently represented by initializing the accumulator from the bias tensor:
```
C[n, c, h, w] = bias[c]
for (...)
C[n, c, h, w] += ...
```
ghstack-source-id: 123592861
Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- Reductions.InitFunction`
Reviewed By: navahgar
Differential Revision: D26940321
fbshipit-source-id: 8a08e19e5d0b9ad453a07fab8b61e75dcd3d626b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52196
A reduction does not need to know the buffer into which its
result will be written. This change gets us closer to being able to
create reductions inside Compute, where we have access to the tensor
axes.
ghstack-source-id: 121813071
Test Plan: test_tensorexpr
Reviewed By: ZolotukhinM
Differential Revision: D26420107
Pulled By: bertmaher
fbshipit-source-id: c8d8a99649adfd6de56fe53a728f5aa034a84f13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52187
ReduceOp doesn't need to track the indices that its result will be written into.
ghstack-source-id: 121813075
Test Plan:
test_tensorexpr, tensorexpr_bench
Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D26420575
fbshipit-source-id: 7afcfa611515334e36de8039722011687f3b61e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697
Mostly mechanical move. This refactoring helps to hide unnecessary
details from the SimpleIREval interface and make it more similar to a
pure 'codegen'.
Test Plan: Imported from OSS
Reviewed By: nickgg
Differential Revision: D25668696
Pulled By: ZolotukhinM
fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c
Summary:
Makes two changes in NNC for intermediate buffer allocations:
1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages.
2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff).
I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554
Reviewed By: VitalyFedyunin
Differential Revision: D25643133
Pulled By: nickgg
fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160
We no longer use the custom c++ test infra anyways, so move to pure
gtest.
Fixes#45703
ghstack-source-id: 116977283
Test Plan: `buck test //caffe2/test/cpp/tensorexpr`
Reviewed By: navahgar, nickgg
Differential Revision: D25046618
fbshipit-source-id: da34183d87465f410379048148c28e1623618553
Summary:
Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work.
There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924
Reviewed By: bertmaher
Differential Revision: D24963464
Pulled By: nickgg
fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51
Summary:
This diff enables inlining producers into reductions. It also guards against inlining reductions themselves.
Prior to this diff, if there was a reduction in the loopnest, no inlining was happening. After this change, we will inline all non-output buffers that do not correspond to a reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47020
Reviewed By: albanD
Differential Revision: D24644346
Pulled By: navahgar
fbshipit-source-id: ad234a6877b65be2457b734cbb7f3a1800baa6a5
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:
Before:
```
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
A[i, j] = i * j;
}
}
for (int i_1 = 0; i_1 < 20; i_1++) {
for (int j_1 = 0; j_1 < 10; j_1++) {
B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
}
```
After `cacheAccesses(A->buf(), "A_local", j_loop);`
```
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
A[i, j] = i * j;
}
}
for (int i_1 = 0; i_1 < 20; i_1++) {
for (int i_2 = 0; i_2 < 2; i_2++) {
for (int j_1 = 0; j_1 < 11; j_1++) {
A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
}
}
for (int j_2 = 0; j_2 < 10; j_2++) {
B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
}
}
```
Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
sum[l1] = 0.f;
for (int n1_1 = 0; n1_1 < 3; n1_1++) {
for (int m1_1 = 0; m1_1 < 2; m1_1++) {
sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
}
}
}
```
After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:
```
for (int l1 = 0; l1 < 4; l1++) {
Allocate(d_local, float, {1});
sum[l1] = 0.f;
d_local[0] = 0.f;
for (int n1_1 = 0; n1_1 < 3; n1_1++) {
for (int m1_1 = 0; m1_1 < 2; m1_1++) {
d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
}
}
sum[l1] = (sum[l1]) + (d_local[0]);
Free(d_local);
}
```
I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.
This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.
While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869
Reviewed By: mruberry
Differential Revision: D24195276
Pulled By: nickgg
fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520
With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23998789
Pulled By: ZolotukhinM
fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390
Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.
Differential Revision: D23952865
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a