Summary:
Fix an issue with the TensorExpr lowering of aten::remainder with integral inputs. We were always lowering to fmod and never to Mod.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47611
Reviewed By: bertmaher, heitorschueroff
Differential Revision: D24846929
Pulled By: nickgg
fbshipit-source-id: adac4322ced5761a11a8e914debc9abe09cf5637
Summary:
This diff adds support for `log_softmax` op in NNC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47409
Reviewed By: ejguan
Differential Revision: D24750203
Pulled By: navahgar
fbshipit-source-id: c4dacc7f62f9df65ae467f0d578ea03d3698273d
Summary:
Take 2 of this fix, I removed the repro from the issue which is a bit flaky due to parallelism. It broke on Windows but isn't specific to Windows or this fix, I think. I'll make sure all the tests pass this time (cc zou3519).
Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like float(half(float(X))) and added a pass to collapse those down inside the CudaHalfScalarRewriter.
Fixes https://github.com/pytorch/pytorch/issues/47138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47448
Reviewed By: glaringlee
Differential Revision: D24765070
Pulled By: nickgg
fbshipit-source-id: 5297e647534d53657bef81f4798e8aa6a93d1fbd
Summary:
Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like `float(half(float(X)))` and added a pass to collapse those down inside the CudaHalfScalarRewriter.
Fixes https://github.com/pytorch/pytorch/issues/47138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47229
Reviewed By: agolynski
Differential Revision: D24706475
Pulled By: nickgg
fbshipit-source-id: 9df72bbbf203353009e98b9cce7ab735efff8b21
Summary:
Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats:
1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules.
2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in.
Fixes https://github.com/pytorch/pytorch/issues/45953.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129
Reviewed By: glaringlee
Differential Revision: D24253639
Pulled By: nickgg
fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45791
Most of the lowering for log1p and lgamma already existed, add JIT integration.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169536
Pulled By: eellison
fbshipit-source-id: a009c77a3471f3b5d378bad5de6d8e0880e9da3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45790
Making sure that more tests invoke a run with a Fusion Group.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169534
Pulled By: eellison
fbshipit-source-id: a2666df53fbb12c64571e960f59dbe94df2437e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789
Making sure that more tests invoke a run with a Fusion Group.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D24169535
Pulled By: eellison
fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788
We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169537
Pulled By: eellison
fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
Summary:
Some tests for alias analysis.
The first aliases at the module level and the second at the input level.
Please let me know if there are other alias situations!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44110
Reviewed By: nickgg
Differential Revision: D23509473
Pulled By: bwasti
fbshipit-source-id: fbfe71a1d40152c8fbbd8d631f0a54589b791c34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173
With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.
Differential Revision: D23178230
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43365
We don't have shape inference for them yet.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23253418
Pulled By: ZolotukhinM
fbshipit-source-id: 9c38778b8a616e70f6b2cb5aab03d3c2013b34b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097
Boolean arguments weren't promoted, so if you tried to write a comparison with
types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE
engine.
Test Plan: Imported from OSS
Reviewed By: protonu, zheng-xq
Differential Revision: D23167926
Pulled By: bertmaher
fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766
**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.
According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection. Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.
Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**
```
# jit target
real 33m7.694s
user 653m1.181s
sys 58m14.160s
--------- Compare to Initial Jit Target runtime: ----------------
real 32m13.057s
user 613m52.843s
sys 54m58.678s
```
```
# others target
real 9m2.920s
user 164m21.927s
sys 12m54.840s
```
```
# serialization target
real 4m21.090s
user 23m33.501s
sys 1m53.308s
```
```
# tensorexpr
real 11m28.187s
user 33m36.420s
sys 1m15.925s
```
```
# type target
real 3m36.197s
user 51m47.912s
sys 4m14.149s
```
Reviewed By: malfet
Differential Revision: D22979219
fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
Summary:
Adding `test_tensorexpr.py` to our CI. There's a few complications: the first one is that we now always run `SimpleIREVal` as a part of simplifier, so the counts will always be greater than one. We can potentially invest some effort to differentiate between a real codegen call to `SimpleIREval` and calls in simplifier, but it's probably not that important and the second change to turn not being able to retrieve a counter into a default value of 0 since the test are structured to test for either an llvm or simpleireval backends, so it only seems appropriate to not fail the test too early.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35776
Differential Revision: D20799333
Pulled By: Krovatkin
fbshipit-source-id: 2a94ff98e647180c6e6aea141a411c3376c509f9
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/
```
PYTORCH_TENSOREXPR=1 python benchmark.py
```
This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341
Reviewed By: ZolotukhinM
Differential Revision: D20676348
Pulled By: bwasti
fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34842
This PR (hopefully the last one of such kind) is merging changes from a
side branch where tensor expessions based fuser work has been done so
far. This PR is is a squashed version of changes in the side branch,
which is available here: https://github.com/bertmaher/pytorch
Differential Revision: D20478208
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 21556e009f1fd88099944732edba72ac40e9b9c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34228
This PR adds LLVM codegen to tensor expressions. LLVM is added as an
optional build dependency specified with `USE_LLVM=<path_to_llvm>`
variable. If this variable is not set or LLVM is not found in the
specified path, the LLVM codegen is completely disabled.
Differential Revision: D20251832
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 77e203ab4421eb03afc64f8da17e0daab277ecc2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34227
This PR adds a CUDA support to tensor expressions.
Differential Revision: D20251836
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: ab36a55834cceff30c8371fef6cca1054a32f017
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34226
LLVM and Cuda backends are added in subsequent PRs, so at this point the fuser is pretty useless, but it still can be tested and its logic is not going to change with addition of the codegens.
Differential Revision: D20251838
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 82b0d221fa89904ed526689d02a6c7676a8ce8de