Summary:
This PR adds an implementation for `aten::cat` in NNC without any conditionals. This version is not enabled by default.
Here is the performance of some micro benchmarks with and without conditionals. There is up to 50% improvement in performance without conditionals for some of the shapes.
aten::cat implementation in NNC **with** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 5.44 us, SOL 0.26 GB/s, algorithmic 0.51 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.75 us, SOL 1.05 GB/s, algorithmic 2.10 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.87 us, SOL 4.05 GB/s, algorithmic 8.11 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 14.52 us, SOL 8.31 GB/s, algorithmic 16.62 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 9.58 us, SOL 6.84 GB/s, algorithmic 13.68 GB/s
```
aten::cat implementation in NNC **without** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion --cat_wo_conditionals concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 4.67 us, SOL 0.30 GB/s, algorithmic 0.60 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.65 us, SOL 1.07 GB/s, algorithmic 2.14 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.10 us, SOL 4.56 GB/s, algorithmic 9.12 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 7.44 us, SOL 16.22 GB/s, algorithmic 32.44 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 6.46 us, SOL 10.14 GB/s, algorithmic 20.29 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53128
Reviewed By: bertmaher
Differential Revision: D26758613
Pulled By: navahgar
fbshipit-source-id: 00f56b7da630b42bc6e7ddd4444bae0cf3a5780a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52786
Previously, NNC did not sanitize input names. I ran into this in the next PR when making subgraph creation preserve debug names caused a number of NNC cuda failures. I also previously ran into this with some masked_fill failures internally, which led me to disable the operator.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26696699
Pulled By: eellison
fbshipit-source-id: 7c3af4d559d58762fb8332666784a4d5cd6a4167
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264
When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.
This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314
Reviewed By: ejguan
Differential Revision: D26491294
Pulled By: navahgar
fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264
Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.
Fix for https://github.com/pytorch/pytorch/issues/45604
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D25286213
Pulled By: eellison
fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47813
We have some code paths that at kernel invocation seem to handle dynamic sizes, but I'm not sure how well it works because we have other parts of our code base that assume that tenso shapes are always fully specified. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/kernel.cpp#L1572
As with some other PRs in the stack, I think it would be good to remove the features that aren't on/actively being worked on while they are not used.
I initially did this PR to try to speed up perf. I couldn't observe too much of a speed up, so we can decide to keep drop this PR if we want.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25286212
Pulled By: eellison
fbshipit-source-id: 4ae66e0af88d649dd4e592bc78686538c2fdbaeb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160
We no longer use the custom c++ test infra anyways, so move to pure
gtest.
Fixes#45703
ghstack-source-id: 116977283
Test Plan: `buck test //caffe2/test/cpp/tensorexpr`
Reviewed By: navahgar, nickgg
Differential Revision: D25046618
fbshipit-source-id: da34183d87465f410379048148c28e1623618553
Summary:
This diff adds support for `log_softmax` op in NNC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47409
Reviewed By: ejguan
Differential Revision: D24750203
Pulled By: navahgar
fbshipit-source-id: c4dacc7f62f9df65ae467f0d578ea03d3698273d
Summary:
This diff enables inlining for all non-output buffers, including the intermediate buffers that are created as part of an op. However, the buffers that correspond to reductions will not be inlined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47258
Reviewed By: anjali411
Differential Revision: D24707015
Pulled By: navahgar
fbshipit-source-id: ad8b03e38497600cd69980424db6d586bf93db74
Summary:
This diff enables inlining producers into reductions. It also guards against inlining reductions themselves.
Prior to this diff, if there was a reduction in the loopnest, no inlining was happening. After this change, we will inline all non-output buffers that do not correspond to a reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47020
Reviewed By: albanD
Differential Revision: D24644346
Pulled By: navahgar
fbshipit-source-id: ad234a6877b65be2457b734cbb7f3a1800baa6a5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388
Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.
Differential Revision: D23952867
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41451
Since TE operates on a limited subset of ops with a well-defined
semantics, we can easily infer shapes of intermediate and output tensors
given shapes of the inputs.
There is a couple of ops that are not yet supported in the shape
inference, once we add them we could relax the shape info requirements
in the TE fuser: currently it requires all values in the fusion group to
have shapes known and we can change it to only inputs.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22543470
Pulled By: ZolotukhinM
fbshipit-source-id: 256bae921028cb6ec3af91977f12bb870c385f40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41507
These fields have always been a part of tensor types, this change just
makes them serializable through IR dumps.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, ngimel
Differential Revision: D22563661
Pulled By: ZolotukhinM
fbshipit-source-id: f01aaa130b7e0005bf1ff21f65827fc24755b360
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37948
The input JIT graph has all the information we need to perform the
entire compilation at the construction time. We don't need to postpone
any steps until the execution time. Also, from the graph we always know
what device we will be executing on and thus we don't need to have a
CodeGen cache in TensorExprKernel - we always have one and only one
CodeGen.
Test Plan: Imported from OSS
Reviewed By: protonu
Differential Revision: D21432145
Pulled By: ZolotukhinM
fbshipit-source-id: 8dc86b891713056b2c62f30170cd4a168912f027