Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59157
Currently view is represented as a copy since we don't support inplace
operations in NNC (similar to `aten::reshape`). Lowering for
`aten::expand_as` is exactly the same as for the `aten::expand`, since
we're building the TE expression basing on the output shape anyway.
Differential Revision:
D28774224
D28774224
Test Plan: Imported from OSS
Reviewed By: Chillee
Pulled By: ZolotukhinM
fbshipit-source-id: 0a1593c4c6500dcc5a374213adb734180ae1f72e
Summary:
The triangular_solve only returns the first input, since the second input is just a copy of the first one. Why does that exist?
Also, I fixed the permute lowering - I was previously doing the inverse application of the permute.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59131
Reviewed By: ansley
Differential Revision: D28768169
Pulled By: Chillee
fbshipit-source-id: 8e78611c6145fb2257cb409ba98c14ac55cdbccf
Summary:
Finds a couple of bugs:
1. permute needs to wrap dimensions
2. slice needs to wrap dimensions
3. frac doesn't work correctly for negative values
4. Permute has some other failures.
This PR also fixes 1 + 2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58719
Reviewed By: SplitInfinity
Differential Revision: D28590457
Pulled By: Chillee
fbshipit-source-id: a67fce67799602f9396bfeef615e652364918fbd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58676
We only generate asm for small matmuls, but we were computing the # of
flops using an int32, which is too small.
Test Plan:
```
buck test mode/dev //caffe2/test:static_runtime -- --exact 'caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule)'
```
Reviewed By: navahgar
Differential Revision: D28562157
fbshipit-source-id: a07ceba5209ef6022ead09140380c116994755cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58256
Size-1 dims mess up our output restriding logic, because they're
technically "dense" no matter what stride the dimension has. In this example a
size-1 dim has stride 1, which causes all the indices to be taken mod 1 (i.e.,
all indices become 0). We work around this peculiar case by skipping size-1 in
our layout logic, since it has no impact on the rest of the tensor's indexing.
ghstack-source-id: 128932739
Test Plan:
new unit test, plus
```
buck test mode/dev //langtech/mobile/audio_stream_processor:audio_stream_processor_test -- --exact 'langtech/mobile/audio_stream_processor:audio_stream_processor_test - AudioStreamProcessorTest.DemucsReadWriteFloat'
```
Reviewed By: eellison
Differential Revision: D28424388
fbshipit-source-id: e33e39eef2a5bf2797bee78a5987558308b6d110
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57749
add to a fx test
Test Plan: Imported from OSS
Reviewed By: huiguoo
Differential Revision: D28425974
fbshipit-source-id: 195c7a1944decb7a2a99c2831cab38485f32be17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58028
We were trying to translate the device argument and thus throwing an
unsupported dtype.
ghstack-source-id: 128748658
Test Plan: predictor models
Reviewed By: navahgar
Differential Revision: D28347704
fbshipit-source-id: 331a5786339e01f9df1b1878970b0c5983a92980
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58026
Cat-without-conditionals is a valuable optimization on CPU but on GPU
it can generate invalid code since it may introduce allocations (i.e. extra
kernel launches)
ghstack-source-id: 128748630
Test Plan: predictor
Reviewed By: navahgar
Differential Revision: D28347703
fbshipit-source-id: f9e68cd7bcf5d316082ce8378ddf99f2d33fcc07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57906
I think it was accidentally flipped in #56875.
Test Plan: Imported from OSS
Reviewed By: Chillee
Differential Revision: D28312947
Pulled By: ZolotukhinM
fbshipit-source-id: 8d0f45e540f47daefbc270f5a2ade87f2171b958
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57552
This method uses `CodeGen::call_raw` instead of `CodeGen::call`.
Relanding #57328 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28195047
Pulled By: ZolotukhinM
fbshipit-source-id: bcfd3cb5b4f33a149b7549515ffd705e2c4f208f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57383
Notes: I picked up an activation from https://github.com/pytorch/pytorch/issues/56969. You can look at the [activations.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/Activation.cpp#L429) file which has both forward and backward kernel code to help you write the NNC lowering and the symbolic gradient.
I added a test in test_jit_fuser_te for the fusion, and I added an OpInfo and asserted that we expect to see autodiffable nodes to test the symbolic gradient.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D28197820
Pulled By: eellison
fbshipit-source-id: 05305d85c5bb0847c8f911b95ba47b137dca7e90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57169
The pass is planned to be used in AOT pipeline, where we expect input
graphs to be functional. As such, these graphs should not use 'self'
argument even if it is present, and thus it can be remove safely.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28128328
Pulled By: ZolotukhinM
fbshipit-source-id: a7dfbf7776682826100c8eb0fef982a2e81c2554
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57076
This pass is intended to be used in conjunction with shape propagation
pass: first we use sample inputs to specify shape info for graph inputs
and then we run shape-prop to infer shapes of intermediate values in the
graph.
Differential Revision: D28048290
Test Plan: Imported from OSS
Reviewed By: astaff
Pulled By: ZolotukhinM
fbshipit-source-id: 778d772e873d59d77af9f669f45dc44b9ee5e443
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56679
moved lowerings out of the TensorExprKernel and into independent functions
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D28082921
Pulled By: Chillee
fbshipit-source-id: af530510957ed4aa8b64dcc77ca36b69866d8000
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235
Reviewed By: janeyx99
Differential Revision: D28084444
Pulled By: malfet
fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56324
Inlining is great if LLVM's CSE kicks in; but if a kernel has multiple outputs
(and thus multiple loops), CSE has no chance.
So, this pass "horizontally" fuses the output loops together so that CSE can go
to town. Essentially we want to turn
```
for (...) {
output_1[] = some_complicated_expr...
}
for (...) {
output_2[] = some_complicated_expr...
}
```
Into:
```
for (...) {
output_1[] = complicated_expr
output_2[] = complicated_expr. // llvm cse should take care of this
}
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27841194
Pulled By: bertmaher
fbshipit-source-id: 54153bb59786be87183c636d64f05963c4b1624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56319
With this change the TorchScript graph can have constant tensors in it
and we still will be able to lower it to TE. The constants are
registered (or bound) within the `TensorExprKernel` object and when the
codegen is called, they are passed along with usual inputs and outputs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27838747
Pulled By: ZolotukhinM
fbshipit-source-id: 4a519d66fcc07fe5fa53f5cf9af28d25611f8437
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56289
While there's no reason to think non-float32 conv2d's *don't* work,
they're only tested in float32 now. Since that's the most important use case,
I'd rather restrict the dtypes than spend time testing all the weird dtype
combinations that could possibly happen.
ghstack-source-id: 126755549
Test Plan: unit tests
Reviewed By: navahgar
Differential Revision: D27828495
fbshipit-source-id: fcf179207f2c9b20e0e86eb2b85687517d87063c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54605
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D27298364
Pulled By: ZolotukhinM
fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55970
LLVM's support for float16 is not great, and we were seeing assertion
failures trying to generate code for vectorized uses. I note that clang
doesn't even try to vectorize operations involving half:
https://gcc.godbolt.org/z/86MW4xr17, so that's a good sign we shouldn't either.
Fixes#55905
ghstack-source-id: 126511474
Test Plan: pytest test_jit_fuser_te.py -k test_isnan
Reviewed By: asuhan
Differential Revision: D27752279
Pulled By: bertmaher
fbshipit-source-id: ac115080bf2a4a73d52b396d64a5bce0cf13abfe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825
The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.
Differential Revision: D27717776
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55213
Adds the integration of conv2d with the TE fuser. A few things of interest:
- I'm *super* selective of what convs get lowered. Only 3x3 depthwise, because
I've benchmarked those to death and I'm pretty sure it's a good change.
- I'm allowing single-node "fusion" groups for supported convs. (Maybe this is
a sign that conv2d codegen should go through a different path entirely, but
it seems to basically work).
I'll shared full benchmarkr results once I clean them up a little. To
summarize, I tested the following torchvision models containing depthwise
convolutions. Results are single-core on a skylake-avx512:
mobilenet_v2: 8% improvement
mobilenet_v3: 9% improvement
mnasnet: 10% improvement
shufflenet: 18% improvement
Note these are comparing against a baseline with a fast-but-buggy grouped
convolution implementation in MKLDNN. So perf results will be better if
compared on master, but I'm going to assume the MKLDNN bug will be fixed and
re-enabled.
Perf results are more complicated when comparing to freezing plus conversion to
mkldnn layout; mobilenet v2/v3 are still faster, but mnasnet and shufflenet are
not. Landing this doesn't prevent MKLDNN freezing from kicking in though, so
there's no harm (although landing mkldnn freezing will regress mobilenet, but
cest la vie).
ghstack-source-id: 126076112
Test Plan: New unit test, plus torchvision
Reviewed By: ZolotukhinM
Differential Revision: D27530272
fbshipit-source-id: 92153fad234bc9f1eaa4f7624c543168d1294a87