Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.
Fixes https://github.com/pytorch/pytorch/issues/49299
I still need to look into a handful of failing tests, but maybe it can be a discussion basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433
Reviewed By: ngimel
Differential Revision: D25681374
Pulled By: Krovatkin
fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary:
There is an internal user who is experiencing a bug with masked_fill. While I am almost certain this corresponds to an old pytorch version with the bug, the model that is breaking is important and time-sensitive and we are covering all bases to try to get it to work again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50147
Reviewed By: nhsoukai
Differential Revision: D25806541
Pulled By: eellison
fbshipit-source-id: 131bd71b5db9717a8a9cb97973d0b4f0e96455d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49627
There was a bug in the test that was hidden by the `If eager mode doesn't support a dtype/op/device combo` try / catch, so cuda wasn't being tested � The fix is just to rename `aten::masked_fill` to `aten_masked_fill`.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D25696409
Pulled By: eellison
fbshipit-source-id: 83de1f5a194df54fe317b0035d4a6c1aed1d19a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357
This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)
Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`
Imported from OSS
Reviewed By: eellison
Differential Revision: D25545791
fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49396
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49271
Two things:
1. These throw exceptions in their constructor, which causes a segfault (*), so
move the exceptions to ::make.
2. They technically support FP types but the rules are complicated so let's not
bother.
(*) The reason for the segfault: all Exprs including these inherit from
KernelScopedObject, whose constructor adds the object to a list for destruction
at the end of the containing KernelArena's lifetime. But if the derived-class
constructor throws, the object is deleted even though it's still in the
KernelArena's list. So when the KernelArena is itself deleted, it double-frees
the pointer and dies. I've also fixed And, Or, and Xor in this diff.
ghstack-source-id: 118594998
Test Plan: `buck test //caffe2/test:jit`
Reviewed By: bwasti
Differential Revision: D25512052
fbshipit-source-id: 42670b3be0cc1600dc5cda6811f7f270a2c88bba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49247
uint8's expose all kind of corner cases in type promotion. As an example, consider:
```
>>> torch.tensor([1], dtype=torch.uint8).lt(-1)
tensor([True])
>>> torch.tensor([1], dtype=torch.uint8).lt(torch.tensor(-1))
tensor([True])
>>> torch.tensor([1], dtype=torch.uint8).lt(torch.tensor([-1]))
tensor([False])
```
the difference is how promotions involving scalars (or 0-dim tensors, which are treated like scalars) are prioritized compared to tensor dtypes.
Per eellison, the order is something like:
1. Tensor FP types
2. Scalar FP types
3. Tensor Int types
4. Scalar Int types
The logic for this is here: c73e97033a/aten/src/ATen/native/TypeProperties.cpp (L93)
AFAICT the effects are mainly visible for the unsigned byte type (the only unsigned type, besides bool) since the others degrade more or less gracefully.
It's hard to re-use this logic as is in TensorIterator/TypeProperties, and it's complicated enough that it's not worth re-implementing in TE unless there's evidence that it matters for real models.
ghstack-source-id: 118555597
Test Plan: `buck test //caffe2/test:jit`
Reviewed By: eellison
Differential Revision: D25489035
fbshipit-source-id: db3ab84286d472fd8a247aeb7b36c441293aad85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49143
Riddle me this, batman: how could `torch.clamp(torch.tensor([0], dtype=torch.uint8), -10, 10)` equal `10`? The answer: the min/max args are first cast to the dtype of the input, giving min=246 and max 10. Then you have to apply Min and Max in the right order: `Min(Max(in, min), max)`. Differ in any way and you're doomed. Hooray.
This PR makes TE match eager mode for this operator, plus fixes a major facepalm in the llvm min/max codegen where we were always generating signed comparisons.
ghstack-source-id: 118415318
Test Plan: `buck test //caffe2/test:{jit,tensorexpr}`
Reviewed By: robieta
Differential Revision: D25456366
fbshipit-source-id: dde3c26c2134bdbe803227601fa3d23eaac750fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48679
This addresses the remaining problem reported in issue #48053
Data type supports for aten kernels in SimpleIREvaluator are not
consistent w/ aten::native library implementation. In SimpleIREvaluator,
- only float/double are supported on aten::abs (integral types and half
are missing)
- only float/double are supported on aten::frac (half are missing)
It is also not clear from kernel.cpp source code what are the expected
input data types for an aten kernel, leading to potential missing data
type issues down the road.
This commit addresses both issues in a limited way by
- Added type promotion ops from half/integral input types to float
- Added a skeleton support for some type checking for aten kernels,
currently, only check for valid data types for frac and abs to limit
the scope of the change; but the utility function can be used for
consistently adding type checking for all aten functions
Known limitations:
- abs support for integral types can be made more effective by invoking
std::abs for integral tensors (currently kFabs maps to std::fabs).
Since that change is a bit more involved (e.g., changing IntrinsicsOp
kFabs to kAbs and other code generators accordingly), will leave it to
another issue
- other aten kernels may need similar type checking and some scrutiny
on the use of promoteToFloat to detect invalid data types early on.
That is also left for another issue
Test Plan:
test_jit_fuser_te.test_unary_ops
Imported from OSS
Reviewed By: asuhan
Differential Revision: D25344839
fbshipit-source-id: 95aca04c99b947dc20f11e4b3bae002f0ae37044
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264
Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.
Fix for https://github.com/pytorch/pytorch/issues/45604
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D25286213
Pulled By: eellison
fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48700
fmod and remainder on int tensors will raise ZeroDivisionError if their divisors are 0. I don't think we should try to generate code that raises exceptions. If at some point we really wanted to fuse these, I might lean towards calling a C++ helper function from the generated code.
ghstack-source-id: 117845642
Test Plan: `buck test //caffe2/test:jit -- test_binary_ops`
Reviewed By: eellison
Differential Revision: D25265792
fbshipit-source-id: 0be56ba3feafa1dbf3c37f6bb8c1550cb6891e6d
Summary:
Add missing types for bitwise_ops in `SimpleIREvaluator`
This is the first part of fixes for issue https://github.com/pytorch/pytorch/issues/48053.
- Original implementation of bitwise_ops supports only int operands, the
fix all support for integral types supported by the IR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48179
Test Plan: `python test/test_jit_fuser_te.py TestTEFuser.test_bitwise_ops`
Reviewed By: ZolotukhinM
Differential Revision: D25126944
Pulled By: penguinwu
fbshipit-source-id: 04dc7fc00c93b2bf1bd9f9cd09f7252357840b85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48326
The PR introduces a set of 'cuda-only' ops into `isSupported` function.
It is done to disable `pow` lowering on CPU where it's tricky to support
integer versions.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D25129211
Pulled By: ZolotukhinM
fbshipit-source-id: c62ae466e1d9ba9b3020519aadaa2a7fe7942d84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48213
it was completely broken unless rhs was a constant.
Test Plan: new unit test in test_jit_fuser_te.py
Reviewed By: eellison
Differential Revision: D25071639
fbshipit-source-id: ef1010a9fd551db646b83adfaa961648a5c388ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48085
We were treating it as a binary operator, which implies shape
broadcasting, even though the second arg is thrown away aside from the type.
Treating it as a unary is the proper approach.
ghstack-source-id: 116873680
Test Plan: new unit test
Reviewed By: ZolotukhinM
Differential Revision: D25017585
fbshipit-source-id: 0cfa89683c9bfd4fbb132617c74b47b268d7f368
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48084
as title
ghstack-source-id: 116870328
Test Plan: new unit test
Reviewed By: Krovatkin
Differential Revision: D25017489
fbshipit-source-id: 0d1998fccad6f509db04b6c67a4e4e4093d96751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47884
We need to know output types of everything in a fusion group to ensure
that we generate correctly-typed tensors. We were incorrectly starting a
fusion group with an unknown-typed output.
Test Plan:
New unit tests:
```
buck test //caffe2/test:jit //caffe2/test/cpp/tensorexpr:tensorexpr
```
Reviewed By: eellison
Differential Revision: D24932786
fbshipit-source-id: 83978a951f32c1207bbc3555a7d3bd94fe4e70fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47374
A few small fixes needed to enable unary op cpu testing. If reviewers would prefer I split them up let me know.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805248
Pulled By: eellison
fbshipit-source-id: c2cfe2e3319a633e64da3366e68f5bf21d390cb7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46951
If e.g. we're casting from torch.int -> torch.bool, previously we would just truncate from int32 -> i8. Since torch.bool has 8 bits but only uses one of them, we need to makes sure that one bit is set.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805253
Pulled By: eellison
fbshipit-source-id: af3aa323f10820d189827eb51037adfa7d80fed9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46950
Make sure that we're fusing in a fuse tests, and refactor to more concise API to check if fusions have happened.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805250
Pulled By: eellison
fbshipit-source-id: f898008a64b74e761bb5fe85f91b3cdf2dbdf878
Summary:
References https://github.com/pytorch/pytorch/issues/42515
> Enable integer -> float unary type promotion for ops like sin
Will follow-up for other such Ops once this PR is merged.
cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45733
Reviewed By: zou3519
Differential Revision: D24431194
Pulled By: mruberry
fbshipit-source-id: db600bc5de0e535b538d2aa301c3526b7c75ed17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788
We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169537
Pulled By: eellison
fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213
Reviewed By: ezyang
Differential Revision: D23885287
Pulled By: nickgg
fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984
Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops
Reviewed By: ezyang
Differential Revision: D23885259
Pulled By: asuhan
fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677
Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py
Reviewed By: agolynski
Differential Revision: D23801412
Pulled By: asuhan
fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.
Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209
Reviewed By: izdeby
Differential Revision: D23575577
Pulled By: nickgg
fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073
We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23487905
Pulled By: ZolotukhinM
fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631
I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358810
Pulled By: eellison
fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173
With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.
Differential Revision: D23178230
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766
**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.
According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection. Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.
Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**
```
# jit target
real 33m7.694s
user 653m1.181s
sys 58m14.160s
--------- Compare to Initial Jit Target runtime: ----------------
real 32m13.057s
user 613m52.843s
sys 54m58.678s
```
```
# others target
real 9m2.920s
user 164m21.927s
sys 12m54.840s
```
```
# serialization target
real 4m21.090s
user 23m33.501s
sys 1m53.308s
```
```
# tensorexpr
real 11m28.187s
user 33m36.420s
sys 1m15.925s
```
```
# type target
real 3m36.197s
user 51m47.912s
sys 4m14.149s
```
Reviewed By: malfet
Differential Revision: D22979219
fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
Summary:
Remove `skipIfRocm` from most jit tests and enable `RUN_CUDA_HALF` tests for ROCm.
These changes passed more than three rounds of CI testing against the ROCm CI.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40447
Differential Revision: D22190711
Pulled By: xw285cornell
fbshipit-source-id: bac44825a2675d247b3abe2ec2f80420a95348a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40142
test_jit is becoming huge again, which makes editor hard to load and
write new tests, this split out the tracer related tests.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D22085035
Pulled By: wanchaol
fbshipit-source-id: 696bee84985ecfbfeac8e2ee5c27f1bdda8de394
Summary:
After an early return, we conditionalize all further execution. This means that currently the pattern of
`if return elif return elif return` generates better code than `if return if return if return`. It's obviously not good to have semantically equivalent code generate worse IR, so we should rewrite the graph to handle this case. This came up in https://github.com/pytorch/pytorch/pull/37171
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
_0 = uninitialized(int)
if x:
_1, _2 = True, 1
else:
_1, _2 = False, _0
if _1:
_3 = _2
else:
_3 = 2
return _3
```
while
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
else:
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
if x:
_0 = 1
else:
_0 = 2
return _0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38282
Differential Revision: D21576733
Pulled By: eellison
fbshipit-source-id: 80cf1ad7fbda6d8d58557abbfb21c90eafae7488
Summary:
The existing contextmanager only conditionally enabled_profiling_mode, which was counter intuitive. When we changed the default executor it broke internal benchmarking as a result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37825
Differential Revision: D21404611
Pulled By: eellison
fbshipit-source-id: 306b3c333ef4eb44ab6a6e5ab4e0682e5ce312ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35913
The pass itself is still disabled by default, but with this change we
don't need to register it as a custom pass anymore. It allows us to
control its behavior with env variables more easily.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D20827189
Pulled By: ZolotukhinM
fbshipit-source-id: e74d90b5e46422e7ab7bc40974a805220da50fbc
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/
```
PYTORCH_TENSOREXPR=1 python benchmark.py
```
This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341
Reviewed By: ZolotukhinM
Differential Revision: D20676348
Pulled By: bwasti
fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464