Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57552
This method uses `CodeGen::call_raw` instead of `CodeGen::call`.
Relanding #57328 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28195047
Pulled By: ZolotukhinM
fbshipit-source-id: bcfd3cb5b4f33a149b7549515ffd705e2c4f208f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57383
Notes: I picked up an activation from https://github.com/pytorch/pytorch/issues/56969. You can look at the [activations.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/Activation.cpp#L429) file which has both forward and backward kernel code to help you write the NNC lowering and the symbolic gradient.
I added a test in test_jit_fuser_te for the fusion, and I added an OpInfo and asserted that we expect to see autodiffable nodes to test the symbolic gradient.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D28197820
Pulled By: eellison
fbshipit-source-id: 05305d85c5bb0847c8f911b95ba47b137dca7e90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57169
The pass is planned to be used in AOT pipeline, where we expect input
graphs to be functional. As such, these graphs should not use 'self'
argument even if it is present, and thus it can be remove safely.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28128328
Pulled By: ZolotukhinM
fbshipit-source-id: a7dfbf7776682826100c8eb0fef982a2e81c2554
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57076
This pass is intended to be used in conjunction with shape propagation
pass: first we use sample inputs to specify shape info for graph inputs
and then we run shape-prop to infer shapes of intermediate values in the
graph.
Differential Revision: D28048290
Test Plan: Imported from OSS
Reviewed By: astaff
Pulled By: ZolotukhinM
fbshipit-source-id: 778d772e873d59d77af9f669f45dc44b9ee5e443
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56679
moved lowerings out of the TensorExprKernel and into independent functions
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D28082921
Pulled By: Chillee
fbshipit-source-id: af530510957ed4aa8b64dcc77ca36b69866d8000
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235
Reviewed By: janeyx99
Differential Revision: D28084444
Pulled By: malfet
fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56324
Inlining is great if LLVM's CSE kicks in; but if a kernel has multiple outputs
(and thus multiple loops), CSE has no chance.
So, this pass "horizontally" fuses the output loops together so that CSE can go
to town. Essentially we want to turn
```
for (...) {
output_1[] = some_complicated_expr...
}
for (...) {
output_2[] = some_complicated_expr...
}
```
Into:
```
for (...) {
output_1[] = complicated_expr
output_2[] = complicated_expr. // llvm cse should take care of this
}
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27841194
Pulled By: bertmaher
fbshipit-source-id: 54153bb59786be87183c636d64f05963c4b1624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56319
With this change the TorchScript graph can have constant tensors in it
and we still will be able to lower it to TE. The constants are
registered (or bound) within the `TensorExprKernel` object and when the
codegen is called, they are passed along with usual inputs and outputs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27838747
Pulled By: ZolotukhinM
fbshipit-source-id: 4a519d66fcc07fe5fa53f5cf9af28d25611f8437
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56289
While there's no reason to think non-float32 conv2d's *don't* work,
they're only tested in float32 now. Since that's the most important use case,
I'd rather restrict the dtypes than spend time testing all the weird dtype
combinations that could possibly happen.
ghstack-source-id: 126755549
Test Plan: unit tests
Reviewed By: navahgar
Differential Revision: D27828495
fbshipit-source-id: fcf179207f2c9b20e0e86eb2b85687517d87063c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54605
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D27298364
Pulled By: ZolotukhinM
fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55970
LLVM's support for float16 is not great, and we were seeing assertion
failures trying to generate code for vectorized uses. I note that clang
doesn't even try to vectorize operations involving half:
https://gcc.godbolt.org/z/86MW4xr17, so that's a good sign we shouldn't either.
Fixes#55905
ghstack-source-id: 126511474
Test Plan: pytest test_jit_fuser_te.py -k test_isnan
Reviewed By: asuhan
Differential Revision: D27752279
Pulled By: bertmaher
fbshipit-source-id: ac115080bf2a4a73d52b396d64a5bce0cf13abfe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825
The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.
Differential Revision: D27717776
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55213
Adds the integration of conv2d with the TE fuser. A few things of interest:
- I'm *super* selective of what convs get lowered. Only 3x3 depthwise, because
I've benchmarked those to death and I'm pretty sure it's a good change.
- I'm allowing single-node "fusion" groups for supported convs. (Maybe this is
a sign that conv2d codegen should go through a different path entirely, but
it seems to basically work).
I'll shared full benchmarkr results once I clean them up a little. To
summarize, I tested the following torchvision models containing depthwise
convolutions. Results are single-core on a skylake-avx512:
mobilenet_v2: 8% improvement
mobilenet_v3: 9% improvement
mnasnet: 10% improvement
shufflenet: 18% improvement
Note these are comparing against a baseline with a fast-but-buggy grouped
convolution implementation in MKLDNN. So perf results will be better if
compared on master, but I'm going to assume the MKLDNN bug will be fixed and
re-enabled.
Perf results are more complicated when comparing to freezing plus conversion to
mkldnn layout; mobilenet v2/v3 are still faster, but mnasnet and shufflenet are
not. Landing this doesn't prevent MKLDNN freezing from kicking in though, so
there's no harm (although landing mkldnn freezing will regress mobilenet, but
cest la vie).
ghstack-source-id: 126076112
Test Plan: New unit test, plus torchvision
Reviewed By: ZolotukhinM
Differential Revision: D27530272
fbshipit-source-id: 92153fad234bc9f1eaa4f7624c543168d1294a87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54439
For now the only way to represent conv2d in TE is via an external call,
and since aten library doesn't have an out variant for conv2d, the
external call has to perform an extra copy. Because of that fusing
conv2d now regressed performance and hence is disabled. However, in near
future we should have two alternative ways to enable it:
1) represent conv2d natively in TE (without an external call)
2) add an out variant for conv2d
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D27237045
Pulled By: ZolotukhinM
fbshipit-source-id: f5545ff711b75f9f37bc056316d1999a70043b4c
Summary:
* Replacing vector of Tensors with a set of output buffers in `TensorExprKernel`.
* Creating a block statement while compiling in `TensorExprKernel`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53688
Reviewed By: mrshenli
Differential Revision: D26941222
Pulled By: navahgar
fbshipit-source-id: 9eb81ec2effcdeafbeaa67d1e12475166054f80f
Summary:
This PR adds an implementation for `aten::cat` in NNC without any conditionals. This version is not enabled by default.
Here is the performance of some micro benchmarks with and without conditionals. There is up to 50% improvement in performance without conditionals for some of the shapes.
aten::cat implementation in NNC **with** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 5.44 us, SOL 0.26 GB/s, algorithmic 0.51 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.75 us, SOL 1.05 GB/s, algorithmic 2.10 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.87 us, SOL 4.05 GB/s, algorithmic 8.11 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 14.52 us, SOL 8.31 GB/s, algorithmic 16.62 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 9.58 us, SOL 6.84 GB/s, algorithmic 13.68 GB/s
```
aten::cat implementation in NNC **without** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion --cat_wo_conditionals concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 4.67 us, SOL 0.30 GB/s, algorithmic 0.60 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.65 us, SOL 1.07 GB/s, algorithmic 2.14 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.10 us, SOL 4.56 GB/s, algorithmic 9.12 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 7.44 us, SOL 16.22 GB/s, algorithmic 32.44 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 6.46 us, SOL 10.14 GB/s, algorithmic 20.29 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53128
Reviewed By: bertmaher
Differential Revision: D26758613
Pulled By: navahgar
fbshipit-source-id: 00f56b7da630b42bc6e7ddd4444bae0cf3a5780a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53283
We had `ShapeArg` and `KernelArg` classes, which were wrappers over
`BufferArg` without adding any new functionality on top of what already
existed. This PR removes them and replace their uses with `BufferArg`s
directly.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26821993
Pulled By: ZolotukhinM
fbshipit-source-id: d1f95ea069b9f38f1d32424464551df2565b3c49
Summary:
This PR builds an aggregate stmt for all the tensors in the kernel before constructing LoopNest. This migrates to using the LoopNest constructor that takes in a stmt and output buffers. This is one more step closer to eliminating the dependency of LoopNest on Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53024
Reviewed By: H-Huang
Differential Revision: D26729221
Pulled By: navahgar
fbshipit-source-id: 43e972585351f6902c14b383b137aaaee3aaa3e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52786
Previously, NNC did not sanitize input names. I ran into this in the next PR when making subgraph creation preserve debug names caused a number of NNC cuda failures. I also previously ran into this with some masked_fill failures internally, which led me to disable the operator.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26696699
Pulled By: eellison
fbshipit-source-id: 7c3af4d559d58762fb8332666784a4d5cd6a4167
Summary:
Remove the dependency tracker that works on Tensors, DepTracker, from LoopNest. This is essential to the goal of removing Tensors from LoopNest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52405
Reviewed By: heitorschueroff
Differential Revision: D26548621
Pulled By: navahgar
fbshipit-source-id: b20f23d608c19ac71aebd31c14777d653eead36c
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264
When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.
This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314
Reviewed By: ejguan
Differential Revision: D26491294
Pulled By: navahgar
fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50229
`fastmod -m 'cast(<((at|c10)::)?\w+Type>\(\)\s*)->' 'castRaw${1}->'` Presuming it builds, this is a safe change: the
result of `cast()` wasn't being saved anywhere, so we didn't need
it, so we can use a raw pointer instead of a new `shared_ptr`.
ghstack-source-id: 120769170
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837494
fbshipit-source-id: 46319100dc0dfc78f6d2b45148207f83481f2ada
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17