Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66242
While working on random test generation, I observed that many simple transformations were upsetting vectorization. Digging deeper, I found that it calls SplitWithTail which incorrectly splits the loop when the loop start is not zero. This path normalizes the loop before we start splitting it.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D31506853
Pulled By: anijain2305
fbshipit-source-id: 5c5f2568ce0a239bfaa515458be52541eafd23b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887
BufHandle has exactly the same functionality and should be used instead.
Differential Revision:
D30889483
D30889483
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077
We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272
Test Plan: unit tests; new IR level test with huge sizes
Reviewed By: ZolotukhinM
Differential Revision: D30596689
fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586
This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.
After this change nothing uses KernelScope/KernelArena and they can be
safely removed.
Differential Revision:
D30429114
D30429114
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778
This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30487425
Pulled By: ZolotukhinM
fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197
This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292776
Pulled By: ZolotukhinM
fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336
This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.
This is the first step in making all NNC mutations in-place.
Test Plan: Imported from OSS
Reviewed By: iramazanli
Differential Revision: D30049829
Pulled By: navahgar
fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61725
Alloc/free inside a loop isn't really an optimization, and furthermore
it breaks some attempted optimization in the llvm backend: we use alloca for
small allocations, which is efficient since alloca is on the stack, but there's
no corresponding free, so we leak tons of stack. I hit this while building an
rfactor buffer inside a very deeply nested loop.
ghstack-source-id: 133627310
Test Plan:
Unit test which simulates use of a temp buffer in a deeply nested
loop.
Reviewed By: navahgar
Differential Revision: D29533364
fbshipit-source-id: c321f4cb05304cfb9146afe32edc4567b623412e
Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/56157
This PR updates the `flatten` API in `LoopNest` to perform the flattening transformation in-place. After this transformation, the first loop in the input becomes the flattened loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56629
Reviewed By: H-Huang
Differential Revision: D28004787
Pulled By: navahgar
fbshipit-source-id: 7474ae237fae3fff0cd1c64a276a8831dc5b7db0
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
This PR includes:
* Update to the loop-carried dependence check API to correctly ignore loop-independent dependences and handle all kinds of loop-carried dependences like RAW, WAR and WAW.
* Fix for the overlap API to look only for conflicting buffer accesses where at least one of them is a Store.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56354
Reviewed By: bertmaher
Differential Revision: D27856202
Pulled By: navahgar
fbshipit-source-id: 206e4ec771fe0f7f2ccf4b11b29e35df7b9b18bc
Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/56357
Changes the `fuseLoops` API to the following form:
```
static bool fuseLoops(const std::vector<For*>& loops, For** fused);
```
Also, adds a new API to check for loop-carried dependences:
```
static bool hasLoopCarriedDependence(For* loop);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56353
Reviewed By: bertmaher
Differential Revision: D27856214
Pulled By: navahgar
fbshipit-source-id: 443557088692585657faee296602c547a00117dd
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/56157
This PR changes `normalize` API in `LoopNest` to transform the given `For` statement and not create a new one.
New API:
```
static bool normalize(For* f);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56158
Reviewed By: agolynski
Differential Revision: D27798361
Pulled By: navahgar
fbshipit-source-id: 57626a5a367bdf94a0efbd9dc8538f5e4e410d6b
Summary:
This PR allows fusing loops whose bounds are specified as expressions that are equal.
For example:
```
for (int j = 0; j < M + N; j++) {
A[j] = 10 * j;
}
for (int k = 0; k < M + N; k++) {
B[k] = 20 * k;
}
```
`fuseLoops(j, k)` is possible since the stop bounds of the two loops are equal though they are different `Expr*` and will result in:
```
for (int j = 0; j < M + N; j++) {
A[j] = 10 * j;
B[j] = 20 * j;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55997
Reviewed By: bertmaher
Differential Revision: D27841270
Pulled By: navahgar
fbshipit-source-id: a64e4503b7f8f28bc0c9823225bc923177bb4c2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56094
Now FunctionCalls are merged with Loads and vectorization for
intermediate values automatically started to work.
Fixes#53553.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D27781519
Pulled By: ZolotukhinM
fbshipit-source-id: 1ed68ca2399e9bd4598639bd6dd8f369365f0ef0