Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69477
This diff adds a new run method to `TensorExprKernel` which takes in
output tensors as inputs and stores the output in those given tensors.
ghstack-source-id: 146107009
Test Plan: buck test mode/dev-nosan //caffe2/test/cpp/tensorexpr:tensorexpr -- --exact 'caffe2/test/cpp/tensorexpr:tensorexpr - Kernel.RunWithAllocatedOutputs'
Reviewed By: ZolotukhinM
Differential Revision: D32823890
fbshipit-source-id: edc1f4839785124048b034060feb71cb8c1be34f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68858
when executing with ir_eval, check for index out of bounds.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D32657881
Pulled By: davidberard98
fbshipit-source-id: 62dd0f85bb182b34e9c9f795ff761081290f6922
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67861
Previously submitted as https://github.com/pytorch/pytorch/pull/67197.
This got reverted because its failures were hidden by the failures of
another PR.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D32178196
Pulled By: navahgar
fbshipit-source-id: cc8a5c68aed360d06289e69645461cfa773e1300
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67734
The implementation of `aten::cat` op in NNC has to ignore tensors that have 0-size in any dimension.
Test Plan: `buck test mode/dev-nosan //caffe2/test/cpp/tensorexpr:tensorexpr -- --exact 'caffe2/test/cpp/tensorexpr:tensorexpr - Kernel.CatWithEmptyInputs'`
Reviewed By: ZolotukhinM
Differential Revision: D32122171
fbshipit-source-id: 90c697813bc504664673cdc262df6e7ce419c655
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66744
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D31705358
fbshipit-source-id: d6ea350cbaa8f452fc78f238160e5374be637a48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64717
This also exposed several bugs, which are fixed in this PR.
Differential Revision:
D30826408
D30826408
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: a67ec5739aceed9ffdf0d24f77eb3787cefe4560
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64627
This fixes the root cause of S242719
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D30801686
Pulled By: navahgar
fbshipit-source-id: b6d3ebdc7eb57116eaced53c2f35c7798bb17e80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64609
We've been using exceptions to indicate whether vectorization succeeded
or not, but that posed some problems with (e.g. we spent too much time
symbolicazing these exceptions). This change converts this mechanism to
a standard error return code.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D30795342
Pulled By: ZolotukhinM
fbshipit-source-id: 16e38b37bcdd78ceb438ac814cc377f35b058e17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077
We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272
Test Plan: unit tests; new IR level test with huge sizes
Reviewed By: ZolotukhinM
Differential Revision: D30596689
fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63923
The input graph can contain constants whose names contain special characters. So, all names of constants in the input graph need to be sanitized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63990
Reviewed By: ZolotukhinM
Differential Revision: D30558432
Pulled By: navahgar
fbshipit-source-id: de5b0c23d50ee8997f40f2c0fc605dda3719186f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63776
I reverted this out of an abundance of caution because some test
failures occurred, but they were all due to precision issues fixed lower in
this stack. Let's try again.
I've rolled the elimination of the allow-parallelism-in-fusions toggle into
this diff since they're pretty tightly coupled.
ghstack-source-id: 136529847
Test Plan: CI
Reviewed By: huiguoo
Differential Revision: D30484555
fbshipit-source-id: 38fd33520f710585d1130c365a8c60c9ce794a59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586
This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.
After this change nothing uses KernelScope/KernelArena and they can be
safely removed.
Differential Revision:
D30429114
D30429114
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60804
The lowerings are stored as a map c10::Symbol -> std::function and the
signature of thoese functions match the signature of
`computeOperandValue`. Custom lowerings have higher priority over the
standard ones, i.e. we can redefine how a given op is lowered.
In general this feature is aimed at unblocking users whose models
contain ops that are not yet supported by NNC - it allows to quickly add
a custom lowering for a given op.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D29409580
Pulled By: ZolotukhinM
fbshipit-source-id: e8e8dc9d3cb9155cfbf5c08a4216ba1b5b791a60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508
An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28918342
Pulled By: ZolotukhinM
fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279
There were some issues with how we handle 0-dim cases in lowerings and
also in how we generate reductions in that special case. This PR fixes
those issues and reenables a bunch of tests.
Differential Revision:
D28819780
D28819780
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57560
The new methods allow to peak into bufferArgs which describe parameters
that codegen expects. This description includes info whether a given
parameter is a scalar var or a buffer and in case it's a buffer allows
to get the corresponding `Buf*` pointer from which we could get the
expected sizes.
Relanding #57074 which was reverted because I forgot to guard a new
test with `ifdef LLVM`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28199048
Pulled By: ZolotukhinM
fbshipit-source-id: 636e838e7e242a3c63e97ec453b8fae9b6380231
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57552
This method uses `CodeGen::call_raw` instead of `CodeGen::call`.
Relanding #57328 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28195047
Pulled By: ZolotukhinM
fbshipit-source-id: bcfd3cb5b4f33a149b7549515ffd705e2c4f208f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57074
The new methods allow to peak into bufferArgs which describe parameters
that codegen expects. This description includes info whether a given
parameter is a scalar var or a buffer and in case it's a buffer allows
to get the corresponding `Buf*` pointer from which we could get the
expected sizes.
Differential Revision: D28048289
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: 3867e862a0ec3593906820826c2344bd8a8f5c0a
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56324
Inlining is great if LLVM's CSE kicks in; but if a kernel has multiple outputs
(and thus multiple loops), CSE has no chance.
So, this pass "horizontally" fuses the output loops together so that CSE can go
to town. Essentially we want to turn
```
for (...) {
output_1[] = some_complicated_expr...
}
for (...) {
output_2[] = some_complicated_expr...
}
```
Into:
```
for (...) {
output_1[] = complicated_expr
output_2[] = complicated_expr. // llvm cse should take care of this
}
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27841194
Pulled By: bertmaher
fbshipit-source-id: 54153bb59786be87183c636d64f05963c4b1624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56319
With this change the TorchScript graph can have constant tensors in it
and we still will be able to lower it to TE. The constants are
registered (or bound) within the `TensorExprKernel` object and when the
codegen is called, they are passed along with usual inputs and outputs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27838747
Pulled By: ZolotukhinM
fbshipit-source-id: 4a519d66fcc07fe5fa53f5cf9af28d25611f8437
Summary:
This PR adds an implementation for `aten::cat` in NNC without any conditionals. This version is not enabled by default.
Here is the performance of some micro benchmarks with and without conditionals. There is up to 50% improvement in performance without conditionals for some of the shapes.
aten::cat implementation in NNC **with** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 5.44 us, SOL 0.26 GB/s, algorithmic 0.51 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.75 us, SOL 1.05 GB/s, algorithmic 2.10 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.87 us, SOL 4.05 GB/s, algorithmic 8.11 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 14.52 us, SOL 8.31 GB/s, algorithmic 16.62 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 9.58 us, SOL 6.84 GB/s, algorithmic 13.68 GB/s
```
aten::cat implementation in NNC **without** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion --cat_wo_conditionals concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 4.67 us, SOL 0.30 GB/s, algorithmic 0.60 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.65 us, SOL 1.07 GB/s, algorithmic 2.14 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.10 us, SOL 4.56 GB/s, algorithmic 9.12 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 7.44 us, SOL 16.22 GB/s, algorithmic 32.44 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 6.46 us, SOL 10.14 GB/s, algorithmic 20.29 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53128
Reviewed By: bertmaher
Differential Revision: D26758613
Pulled By: navahgar
fbshipit-source-id: 00f56b7da630b42bc6e7ddd4444bae0cf3a5780a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52786
Previously, NNC did not sanitize input names. I ran into this in the next PR when making subgraph creation preserve debug names caused a number of NNC cuda failures. I also previously ran into this with some masked_fill failures internally, which led me to disable the operator.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26696699
Pulled By: eellison
fbshipit-source-id: 7c3af4d559d58762fb8332666784a4d5cd6a4167
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264
When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.
This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314
Reviewed By: ejguan
Differential Revision: D26491294
Pulled By: navahgar
fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264
Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.
Fix for https://github.com/pytorch/pytorch/issues/45604
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D25286213
Pulled By: eellison
fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47813
We have some code paths that at kernel invocation seem to handle dynamic sizes, but I'm not sure how well it works because we have other parts of our code base that assume that tenso shapes are always fully specified. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/kernel.cpp#L1572
As with some other PRs in the stack, I think it would be good to remove the features that aren't on/actively being worked on while they are not used.
I initially did this PR to try to speed up perf. I couldn't observe too much of a speed up, so we can decide to keep drop this PR if we want.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25286212
Pulled By: eellison
fbshipit-source-id: 4ae66e0af88d649dd4e592bc78686538c2fdbaeb