Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46950
Make sure that we're fusing in a fuse tests, and refactor to more concise API to check if fusions have happened.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805250
Pulled By: eellison
fbshipit-source-id: f898008a64b74e761bb5fe85f91b3cdf2dbdf878
Summary:
References https://github.com/pytorch/pytorch/issues/42515
> Enable integer -> float unary type promotion for ops like sin
Will follow-up for other such Ops once this PR is merged.
cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45733
Reviewed By: zou3519
Differential Revision: D24431194
Pulled By: mruberry
fbshipit-source-id: db600bc5de0e535b538d2aa301c3526b7c75ed17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788
We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169537
Pulled By: eellison
fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213
Reviewed By: ezyang
Differential Revision: D23885287
Pulled By: nickgg
fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984
Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops
Reviewed By: ezyang
Differential Revision: D23885259
Pulled By: asuhan
fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677
Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py
Reviewed By: agolynski
Differential Revision: D23801412
Pulled By: asuhan
fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.
Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209
Reviewed By: izdeby
Differential Revision: D23575577
Pulled By: nickgg
fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073
We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23487905
Pulled By: ZolotukhinM
fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631
I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358810
Pulled By: eellison
fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173
With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.
Differential Revision: D23178230
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766
**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.
According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection. Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.
Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**
```
# jit target
real 33m7.694s
user 653m1.181s
sys 58m14.160s
--------- Compare to Initial Jit Target runtime: ----------------
real 32m13.057s
user 613m52.843s
sys 54m58.678s
```
```
# others target
real 9m2.920s
user 164m21.927s
sys 12m54.840s
```
```
# serialization target
real 4m21.090s
user 23m33.501s
sys 1m53.308s
```
```
# tensorexpr
real 11m28.187s
user 33m36.420s
sys 1m15.925s
```
```
# type target
real 3m36.197s
user 51m47.912s
sys 4m14.149s
```
Reviewed By: malfet
Differential Revision: D22979219
fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
Summary:
Remove `skipIfRocm` from most jit tests and enable `RUN_CUDA_HALF` tests for ROCm.
These changes passed more than three rounds of CI testing against the ROCm CI.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40447
Differential Revision: D22190711
Pulled By: xw285cornell
fbshipit-source-id: bac44825a2675d247b3abe2ec2f80420a95348a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40142
test_jit is becoming huge again, which makes editor hard to load and
write new tests, this split out the tracer related tests.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D22085035
Pulled By: wanchaol
fbshipit-source-id: 696bee84985ecfbfeac8e2ee5c27f1bdda8de394
Summary:
After an early return, we conditionalize all further execution. This means that currently the pattern of
`if return elif return elif return` generates better code than `if return if return if return`. It's obviously not good to have semantically equivalent code generate worse IR, so we should rewrite the graph to handle this case. This came up in https://github.com/pytorch/pytorch/pull/37171
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
_0 = uninitialized(int)
if x:
_1, _2 = True, 1
else:
_1, _2 = False, _0
if _1:
_3 = _2
else:
_3 = 2
return _3
```
while
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
else:
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
if x:
_0 = 1
else:
_0 = 2
return _0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38282
Differential Revision: D21576733
Pulled By: eellison
fbshipit-source-id: 80cf1ad7fbda6d8d58557abbfb21c90eafae7488
Summary:
The existing contextmanager only conditionally enabled_profiling_mode, which was counter intuitive. When we changed the default executor it broke internal benchmarking as a result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37825
Differential Revision: D21404611
Pulled By: eellison
fbshipit-source-id: 306b3c333ef4eb44ab6a6e5ab4e0682e5ce312ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35913
The pass itself is still disabled by default, but with this change we
don't need to register it as a custom pass anymore. It allows us to
control its behavior with env variables more easily.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D20827189
Pulled By: ZolotukhinM
fbshipit-source-id: e74d90b5e46422e7ab7bc40974a805220da50fbc
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/
```
PYTORCH_TENSOREXPR=1 python benchmark.py
```
This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341
Reviewed By: ZolotukhinM
Differential Revision: D20676348
Pulled By: bwasti
fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464