Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
As we live in C++17 world
This is a functional no-op, just
- `s/namespace at { namespace native {/namespace at::native {/`
- `s/namespace torch { namespace jit {/namespace torch::jit {/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92100
Approved by: https://github.com/izaitsevfb
Apply clang-tidy fixups to prefer member initializer and modernize-pass-by-value. This is a mostly a noop, but it should make a few ctors slighlty more readable and more efficient. Also drops in some missing moves that prevents a lot of unnecessary copying.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91538
Approved by: https://github.com/ezyang
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887
BufHandle has exactly the same functionality and should be used instead.
Differential Revision:
D30889483
D30889483
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077
We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272
Test Plan: unit tests; new IR level test with huge sizes
Reviewed By: ZolotukhinM
Differential Revision: D30596689
fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336
This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.
This is the first step in making all NNC mutations in-place.
Test Plan: Imported from OSS
Reviewed By: iramazanli
Differential Revision: D30049829
Pulled By: navahgar
fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825
The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.
Differential Revision: D27717776
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53137
Also, add casting to Int for Load and Store indices.
Fixes#52773.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26760256
Pulled By: ZolotukhinM
fbshipit-source-id: a2d3141b17584724a5feabcabec25d0577b83a30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52901
This PR implements IR Verifier and adds a call to it in `LoopNest`
constructors. Checks that were in expr/stmt constructors before are now
moved to the corresponding `::make` functions or to the verifier. They
didn't really help from the constructors anyway since an exception
thrown from there led to a segfault due to the fact our memory
management works (object was not fully created but was registered in the
kernel arena for destruction anyway).
Fixes#52778.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26682928
Pulled By: ZolotukhinM
fbshipit-source-id: c56524015cdffb1ed8bce4394509961a4071dcfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51594
ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.
To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.
The reason the PR was previously reverted was that the LLVM generated
calls to bridge functions were breaking unwind tables. This is now fixed
by requiring bridge functions to never throw and setting the
corresponding attribute in the LLVM generated code.
Differential Revision: D26213882
Test Plan: Imported from OSS
Reviewed By: pbelevich, ngimel
Pulled By: ZolotukhinM
fbshipit-source-id: db954d8338e2d750c2bf0a41e88e38bd494f2945
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51475
ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.
To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.
Test Plan: Imported from OSS
Reviewed By: pbelevich, Chillee
Differential Revision: D26179083
Pulled By: ZolotukhinM
fbshipit-source-id: 9e44de098ae94d25772cf5e2659d539fa6f3f659
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357
This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)
Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`
Imported from OSS
Reviewed By: eellison
Differential Revision: D25545791
fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520
With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23998789
Pulled By: ZolotukhinM
fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388
Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.
Differential Revision: D23952867
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.
For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.
This diff got a bit unwieldy with the integration code so that will come in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606
Reviewed By: bertmaher
Differential Revision: D22970969
Pulled By: nickgg
fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36611
Currently Buf represents underlying storage but it didn't have dtype.
That resulted in specifying dtypes in different places and there was no
mechanism to enforce its consistency: e.g. one could've created a kFloat
expression and use a kInt buffer to store its result. Now we're
centralizing where the logic regarding the storage is located and we can
start enforcing semantics rules.
Follow-ups: we can merge Buffer and BufHandle classes as the former is
now a mere wrapper over the latter.
Test Plan: Imported from OSS
Differential Revision: D21027356
Pulled By: ZolotukhinM
fbshipit-source-id: c06aa2c4077fdcde3bb4ca622d324aece79b5a9c
Summary:
Second attempt at the reduction frontend for the TensorExpr compiler. Has two APIs, a simple version for common reduction types and a customizable Reducer fronted which allows specifying initializer, reduction interaction via lambda and body via lambda.
Simple API looks like so:
```
Buffer b(BufHandle("b", {10}), kInt);
Tensor* c = Reduce("sum", {}, Sum(b), {{10, "m"}});
```
An example of specializing a Sum to do Matmul:
```
Buffer tA(BufHandle("tA", {M, K}), kFloat);
Buffer tB(BufHandle("tB", {K, N}), kFloat);
Sum matmul([&](ParameterList& v) {
ExprHandle m = v[0];
ExprHandle n = v[1];
ExprHandle k = v[2];
return tA(m, k) * tB(k, n);
});
Tensor* mm = Reduce("mm", {{M, "m"}, {N, "n"}}, matmul, {{K, "k"}});
```
A fully specialized Reduction:
```
VarHandle searchValue("searchValue", kInt);
Buffer b(BufHandle("b", {4, 10}), kInt);
Reducer anyEqSV(
ExprHandle(0),
[](ExprHandle a, ExprHandle b) {
return CompareSelect::make(a, 1, 1, b, kEQ);
},
[&](ParameterList& v) {
return CompareSelect::make(b.call(v), searchValue, kEQ);
});
Tensor* any = Reduce("anyEqual", {{4, "i"}}, anyEqSV, {{10, "j"}});
```
---
Until lowering, Reductions are held in a compound form for easier optimization:
```
VarHandle m("m", kInt);
Buffer b(BufHandle("b", {2, 3, m}), kFloat);
Tensor* c = Reduce("sum", {{2, "l"}, {3, "n"}}, Sum(b), {{m, "m"}});
LoopNest loop({c});
std::cout << *loop.root_stmt() << "\n";
```
```
for (int l = 0; l < 2; l++) {
for (int n = 0; n < 3; n++) {
for (int m = 0; m < m_1; m++) {
sum[l, n] = ReduceOp(sum[l, n] = float(0);, (sum[l, n]) + (b[l, n, m]), {m});
}
}
}
```
```
loop.prepareForCodegen();
std::cout << *loop.root_stmt() << "\n";
```
```
for (int l = 0; l < 2; l++) {
for (int n = 0; n < 3; n++) {
sum[(0 + l * (1 * 3)) + n * 1] = float(0);
for (int m = 0; m < m_1; m++) {
sum[(0 + l * (1 * 3)) + n * 1] = (sum[(0 + l * (1 * 3)) + n * 1]) + (b[((0 + l * ((1 * m_1) * 3)) + n * (1 * m_1)) + m * 1]);
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35866
Differential Revision: D20965577
Pulled By: nickgg
fbshipit-source-id: afe506c90db794447180056417013bcaf0e2c049
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35800
This PR includes the following changes:
* Introduce a new `Expr` type `Buf`: it plays a similar to `Var` role, but also has dimensions.
* Use the new `Buf` class in `Store` and `Load` instead of `Var` for specifying where to store to or load from. `Buf` contains the dimensions info of the buffer we're loading/storing to and hence we are able to keep N-d indexes without flattening them into a 1-d index ([x,y] vs [x+y*W]).
* Flattening of the indexes is now a separate pass that is executed in `LoopNest::prepareForCodegen` - backends still expect indexes to be flattened, and this PR preserves that.
* `Tensor` now contains a `Buf` instead of `Var`, and thus Tensor now has the dimensions info (previously it was a property of a `Function`, not a `Tensor`). This brings us closer to Tensor being a combination of Buffer + Function, where Buffer specifies iteration domain and the Function defines a computation.
TODOs:
* Consider merging `Buffer` with `Buf` or `BufHandle`. It seems that we don't need all of them.
* Harden the logic of how we create buffers in fuser pass. Currently it seems that sometimes we don't set dimensions.
* Use `Buf` in `Allocate` and `Free`.
* Make it clearer that `Function` doesn't "own" dimensions info and that dimensions are a property of a Tensor, not a Function.
Differential Revision: D20789005
Test Plan: Imported from OSS
Reviewed By: zheng-xq
Pulled By: ZolotukhinM
fbshipit-source-id: e04188d1d297f195f1c46669c614557d6bb6cde4