Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197
This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292776
Pulled By: ZolotukhinM
fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825
The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.
Differential Revision: D27717776
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160
We no longer use the custom c++ test infra anyways, so move to pure
gtest.
Fixes#45703
ghstack-source-id: 116977283
Test Plan: `buck test //caffe2/test/cpp/tensorexpr`
Reviewed By: navahgar, nickgg
Differential Revision: D25046618
fbshipit-source-id: da34183d87465f410379048148c28e1623618553
Summary:
This is a rewrite of the Registerizer, supporting scalar replacement in *vastly* more situations. As a refresher, the registerizer does this:
Before:
``` A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
After:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups).
I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet.
It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses.
There are a few cases that make this more challenging:
- For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements.
E.g. Before:
```
A[0] = 2;
for (int x1 = 0; x1 < 10; x1++) {
A[0] = (A[0]) + x1;
}
for (int x2 = 1; x2 < 10; x2++) {
A[x2] = A[x2 - 1];
}
for (int x3 = 0; x3 < 10; x3++) {
A[0] = (A[0]) + x3;
}
```
After:
```
int A_1 = 2;
for (int x1 = 0; x1 < 10; x1++) {
A_1 = A_1 + x1;
}
A[0] = A_1;
for (int x2 = 1; x2 < 10; x2++) {
A[x2] = A[x2 - 1];
}
int A_2 = A[0];
for (int x3 = 0; x3 < 10; x3++) {
A_2 = A_2 + x3;
}
A[0] = A_2;
```
- Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it.
E.g. this example:
```
if (x<5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
A[x] = (A[x]) + 1;
if (x>5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
```
The A[x] access can be registerized due to the unconditional access between the two conditions:
```
int A_1 = A[x];
if (x<5 ? 1 : 0) {
A_1 = A_1 + 1;
}
A_1 = A_1 + 1;
if (x>5 ? 1 : 0) {
A_1 = A_1 + 1;
}
A[x] = A_1;
```
But this example has no accesses that can be registerized:
```
if (x<5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
if (x>5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
```
- IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within.
E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it.
```
A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]);
```
But the equivalent kernel using Cond can be registerized:
```
if (x<3 ? 1 : 0) {
float B_1 = B[x];
A[x] = B_1 + B_1;
} else {
A[x] = B[x];
}
```
- Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var.
E.g. no accesses in this example can be registerized:
```
for (int x = 0; x < 10; x++) {
int y = 30;
A[y] = x + (A[y]);
}
```
But they can in this example:
```
int y = 30;
for (int x = 0; x < 10; x++) {
A[y] = x + (A[y]);
}
```
**Testing**
The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests.
**Performance**
Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount.
Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574
Reviewed By: albanD
Differential Revision: D24151517
Pulled By: nickgg
fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520
With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23998789
Pulled By: ZolotukhinM
fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
Summary:
Fixes a bug in the NNC registerizer for Cuda where it would hoist reads out of a conditional context when trying to cache them. As a quick fix, prevent scalar replacement if a usage is within a condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44223
Reviewed By: gchanan
Differential Revision: D23551247
Pulled By: nickgg
fbshipit-source-id: 17a7bf2be4c8c3dd8a9ab7997dce9aea200c3685
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.
First up the good stuff, benchmark before:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 5.7917 9.7037 6.9386 6.0448
(100, 100) 5.9338 14.972 7.1139 6.3254
(100, 10000) 21.453 741.54 145.74 12.555
(1000, 1000) 8.0678 122.75 22.833 9.0778
Row sum Caffe2 NNC Simple Better
(10, 100) 5.4502 7.9661 6.1469 5.5587
(100, 100) 5.7613 13.897 21.49 5.5808
(100, 10000) 21.702 82.398 75.462 22.793
(1000, 1000) 22.527 129 176.51 22.517
```
After:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 6.0458 9.4966 7.1094 6.056
(100, 100) 5.9299 9.1482 7.1693 6.593
(100, 10000) 21.739 121.97 162.63 14.376
(1000, 1000) 9.2374 29.01 26.883 10.127
Row sum Caffe2 NNC Simple Better
(10, 100) 5.9773 8.1792 7.2307 5.8941
(100, 100) 6.1456 9.3155 24.563 5.8163
(100, 10000) 25.384 30.212 88.531 27.185
(1000, 1000) 26.517 32.702 209.31 26.537
```
Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).
The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.
It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878
Reviewed By: glaringlee
Differential Revision: D23382499
Pulled By: nickgg
fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.
For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.
This diff got a bit unwieldy with the integration code so that will come in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606
Reviewed By: bertmaher
Differential Revision: D22970969
Pulled By: nickgg
fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180