Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58974
I don't know how we overlooked this for so long...
ghstack-source-id: 129932134
Test Plan:
Predictor test of model 184778294_0 using multiple request replay
threads. It's not clear to me why multithreading matters, except that perhaps
it makes it easier to get an unknown shape in the profile.
Reviewed By: navahgar
Differential Revision: D28702660
fbshipit-source-id: 565550b1d2e571d62d0c8b21150193f2a7ace334
Summary:
This gets rid of a lot of the try/else rigamarole.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58788
Reviewed By: ZolotukhinM
Differential Revision: D28621054
Pulled By: Chillee
fbshipit-source-id: d0d8a1b6466eb318d939a1ed172b78f492ee0d5b
Summary:
Finds a couple of bugs:
1. permute needs to wrap dimensions
2. slice needs to wrap dimensions
3. frac doesn't work correctly for negative values
4. Permute has some other failures.
This PR also fixes 1 + 2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58719
Reviewed By: SplitInfinity
Differential Revision: D28590457
Pulled By: Chillee
fbshipit-source-id: a67fce67799602f9396bfeef615e652364918fbd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58346
If `dim` is a variable, NNC doesn't know how to translate the result,
since the shape is unknown. This issue manifested as a `bad_variant_access`
when we try to pull an int constant out of that arg.
Note that, while the PE will pick up the resultant shape, it won't set guards accordingly.
ghstack-source-id: 129078971
Test Plan: new fuser test
Reviewed By: navahgar
Differential Revision: D28460956
fbshipit-source-id: 57ef918ef309ee57bfdf86717b910b6549750454
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58256
Size-1 dims mess up our output restriding logic, because they're
technically "dense" no matter what stride the dimension has. In this example a
size-1 dim has stride 1, which causes all the indices to be taken mod 1 (i.e.,
all indices become 0). We work around this peculiar case by skipping size-1 in
our layout logic, since it has no impact on the rest of the tensor's indexing.
ghstack-source-id: 128932739
Test Plan:
new unit test, plus
```
buck test mode/dev //langtech/mobile/audio_stream_processor:audio_stream_processor_test -- --exact 'langtech/mobile/audio_stream_processor:audio_stream_processor_test - AudioStreamProcessorTest.DemucsReadWriteFloat'
```
Reviewed By: eellison
Differential Revision: D28424388
fbshipit-source-id: e33e39eef2a5bf2797bee78a5987558308b6d110
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57749
add to a fx test
Test Plan: Imported from OSS
Reviewed By: huiguoo
Differential Revision: D28425974
fbshipit-source-id: 195c7a1944decb7a2a99c2831cab38485f32be17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58207
We probably don't even know what these tests check and there are no
plans on re-enabling them - let's just nuke them to keep the code clean.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28403251
Pulled By: ZolotukhinM
fbshipit-source-id: fe12e978636a74f309f57e3408ab78d459fe4d29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58206
Tested on CUDA with and without `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1`.
Closes#48053.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28403250
Pulled By: ZolotukhinM
fbshipit-source-id: 1ae1cfed691e0077a37db646937e580fbd32b23f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58028
We were trying to translate the device argument and thus throwing an
unsupported dtype.
ghstack-source-id: 128748658
Test Plan: predictor models
Reviewed By: navahgar
Differential Revision: D28347704
fbshipit-source-id: 331a5786339e01f9df1b1878970b0c5983a92980
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57798
Our instruction sequence was just plain wrong, instead of `fcmp une %x, +0.0`
(unordered equal 0.0) we were doing `fcmp uno`, which is just an unordered check
(i.e., is either side NaN).
ghstack-source-id: 128586464
Test Plan: New unit test against the full cross-product of dtypes.
Reviewed By: navahgar
Differential Revision: D28276269
fbshipit-source-id: ba5e59778e07770fb78ef02309f10edde333a800
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57383
Notes: I picked up an activation from https://github.com/pytorch/pytorch/issues/56969. You can look at the [activations.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/Activation.cpp#L429) file which has both forward and backward kernel code to help you write the NNC lowering and the symbolic gradient.
I added a test in test_jit_fuser_te for the fusion, and I added an OpInfo and asserted that we expect to see autodiffable nodes to test the symbolic gradient.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D28197820
Pulled By: eellison
fbshipit-source-id: 05305d85c5bb0847c8f911b95ba47b137dca7e90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56308
But only for float tensors. Even on CUDA, int tensors just have weird
behavior with pow, and I bet FP is so much more common that it's just not worth
trying to fuse ints here.
ghstack-source-id: 126769637
Test Plan: `pytest test_jit_fuser_te.py -k test_binary_pow`
Reviewed By: navahgar
Differential Revision: D27834694
fbshipit-source-id: 7274d72cf02ab95d63574b6c17995b8f34560810
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54605
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D27298364
Pulled By: ZolotukhinM
fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56120
This reverts commit ad17fadbfc (D27786457).
The big annoyance here is that depending on the threading mode you may not be
able to toggle num_threads at will, so the fusion tests won't fail.
I hate this solution, but I'm adding a secondary override for the TE fuser.
Now you need to both turn on fusion (_jit_override_can_fuse_on_cpu), and you're
OK if you're running with 1 thread, or you can add
`_jit_set_texpr_parallel_cpu_enabled` to enable it anyways.
This is (a) mainly for tests, since a real user probably won't fiddle aimlessly
with the thread count, and (b) will go away once NNC's threading support is
fully baked.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D27788199
Pulled By: bertmaher
fbshipit-source-id: 070d04474f15e9689dbdf8cc1fde43050c6506b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56119
There are apparently still more issues with fp16 on LLVM so let's just
nuke it from orbit while we develop a robust workaround.
ghstack-source-id: 126619411
Test Plan: compile
Reviewed By: ZolotukhinM
Differential Revision: D27787080
fbshipit-source-id: 9e771211fe48266f50fca1de8d40295922da5bca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55970
LLVM's support for float16 is not great, and we were seeing assertion
failures trying to generate code for vectorized uses. I note that clang
doesn't even try to vectorize operations involving half:
https://gcc.godbolt.org/z/86MW4xr17, so that's a good sign we shouldn't either.
Fixes#55905
ghstack-source-id: 126511474
Test Plan: pytest test_jit_fuser_te.py -k test_isnan
Reviewed By: asuhan
Differential Revision: D27752279
Pulled By: bertmaher
fbshipit-source-id: ac115080bf2a4a73d52b396d64a5bce0cf13abfe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55621
Fuser support for thread-level parallelism is a work in progress, so
only fuse when the program is running single-threaded.
ghstack-source-id: 126069259
Test Plan: observe fusion groups formed when torch.get_num_threads==1 vs not
Reviewed By: ZolotukhinM
Differential Revision: D27652485
fbshipit-source-id: 182580cf758d99dd499cc4591eb9d080884aa7ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55213
Adds the integration of conv2d with the TE fuser. A few things of interest:
- I'm *super* selective of what convs get lowered. Only 3x3 depthwise, because
I've benchmarked those to death and I'm pretty sure it's a good change.
- I'm allowing single-node "fusion" groups for supported convs. (Maybe this is
a sign that conv2d codegen should go through a different path entirely, but
it seems to basically work).
I'll shared full benchmarkr results once I clean them up a little. To
summarize, I tested the following torchvision models containing depthwise
convolutions. Results are single-core on a skylake-avx512:
mobilenet_v2: 8% improvement
mobilenet_v3: 9% improvement
mnasnet: 10% improvement
shufflenet: 18% improvement
Note these are comparing against a baseline with a fast-but-buggy grouped
convolution implementation in MKLDNN. So perf results will be better if
compared on master, but I'm going to assume the MKLDNN bug will be fixed and
re-enabled.
Perf results are more complicated when comparing to freezing plus conversion to
mkldnn layout; mobilenet v2/v3 are still faster, but mnasnet and shufflenet are
not. Landing this doesn't prevent MKLDNN freezing from kicking in though, so
there's no harm (although landing mkldnn freezing will regress mobilenet, but
cest la vie).
ghstack-source-id: 126076112
Test Plan: New unit test, plus torchvision
Reviewed By: ZolotukhinM
Differential Revision: D27530272
fbshipit-source-id: 92153fad234bc9f1eaa4f7624c543168d1294a87
Summary:
This tests a simple failure mode for a TypeCheck when a shape changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52933
Reviewed By: H-Huang
Differential Revision: D26727583
Pulled By: Krovatkin
fbshipit-source-id: b277218af9572cd6f89f2ece044f7d84d4c10283
Summary:
This is a second attempt to use graph executor to run forward on a gradient. This allows a secondary chance to profile intermediate tensor introduced by autodiff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52136
Reviewed By: pbelevich
Differential Revision: D26693978
Pulled By: Krovatkin
fbshipit-source-id: 91dde8009a210950af8e5173668ada241e16dd52
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264
When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.
This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314
Reviewed By: ejguan
Differential Revision: D26491294
Pulled By: navahgar
fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.
Fixes https://github.com/pytorch/pytorch/issues/49299
I still need to look into a handful of failing tests, but maybe it can be a discussion basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433
Reviewed By: ngimel
Differential Revision: D25681374
Pulled By: Krovatkin
fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary:
There is an internal user who is experiencing a bug with masked_fill. While I am almost certain this corresponds to an old pytorch version with the bug, the model that is breaking is important and time-sensitive and we are covering all bases to try to get it to work again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50147
Reviewed By: nhsoukai
Differential Revision: D25806541
Pulled By: eellison
fbshipit-source-id: 131bd71b5db9717a8a9cb97973d0b4f0e96455d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49627
There was a bug in the test that was hidden by the `If eager mode doesn't support a dtype/op/device combo` try / catch, so cuda wasn't being tested � The fix is just to rename `aten::masked_fill` to `aten_masked_fill`.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D25696409
Pulled By: eellison
fbshipit-source-id: 83de1f5a194df54fe317b0035d4a6c1aed1d19a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357
This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)
Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`
Imported from OSS
Reviewed By: eellison
Differential Revision: D25545791
fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49396
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49271
Two things:
1. These throw exceptions in their constructor, which causes a segfault (*), so
move the exceptions to ::make.
2. They technically support FP types but the rules are complicated so let's not
bother.
(*) The reason for the segfault: all Exprs including these inherit from
KernelScopedObject, whose constructor adds the object to a list for destruction
at the end of the containing KernelArena's lifetime. But if the derived-class
constructor throws, the object is deleted even though it's still in the
KernelArena's list. So when the KernelArena is itself deleted, it double-frees
the pointer and dies. I've also fixed And, Or, and Xor in this diff.
ghstack-source-id: 118594998
Test Plan: `buck test //caffe2/test:jit`
Reviewed By: bwasti
Differential Revision: D25512052
fbshipit-source-id: 42670b3be0cc1600dc5cda6811f7f270a2c88bba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49247
uint8's expose all kind of corner cases in type promotion. As an example, consider:
```
>>> torch.tensor([1], dtype=torch.uint8).lt(-1)
tensor([True])
>>> torch.tensor([1], dtype=torch.uint8).lt(torch.tensor(-1))
tensor([True])
>>> torch.tensor([1], dtype=torch.uint8).lt(torch.tensor([-1]))
tensor([False])
```
the difference is how promotions involving scalars (or 0-dim tensors, which are treated like scalars) are prioritized compared to tensor dtypes.
Per eellison, the order is something like:
1. Tensor FP types
2. Scalar FP types
3. Tensor Int types
4. Scalar Int types
The logic for this is here: c73e97033a/aten/src/ATen/native/TypeProperties.cpp (L93)
AFAICT the effects are mainly visible for the unsigned byte type (the only unsigned type, besides bool) since the others degrade more or less gracefully.
It's hard to re-use this logic as is in TensorIterator/TypeProperties, and it's complicated enough that it's not worth re-implementing in TE unless there's evidence that it matters for real models.
ghstack-source-id: 118555597
Test Plan: `buck test //caffe2/test:jit`
Reviewed By: eellison
Differential Revision: D25489035
fbshipit-source-id: db3ab84286d472fd8a247aeb7b36c441293aad85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49143
Riddle me this, batman: how could `torch.clamp(torch.tensor([0], dtype=torch.uint8), -10, 10)` equal `10`? The answer: the min/max args are first cast to the dtype of the input, giving min=246 and max 10. Then you have to apply Min and Max in the right order: `Min(Max(in, min), max)`. Differ in any way and you're doomed. Hooray.
This PR makes TE match eager mode for this operator, plus fixes a major facepalm in the llvm min/max codegen where we were always generating signed comparisons.
ghstack-source-id: 118415318
Test Plan: `buck test //caffe2/test:{jit,tensorexpr}`
Reviewed By: robieta
Differential Revision: D25456366
fbshipit-source-id: dde3c26c2134bdbe803227601fa3d23eaac750fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48679
This addresses the remaining problem reported in issue #48053
Data type supports for aten kernels in SimpleIREvaluator are not
consistent w/ aten::native library implementation. In SimpleIREvaluator,
- only float/double are supported on aten::abs (integral types and half
are missing)
- only float/double are supported on aten::frac (half are missing)
It is also not clear from kernel.cpp source code what are the expected
input data types for an aten kernel, leading to potential missing data
type issues down the road.
This commit addresses both issues in a limited way by
- Added type promotion ops from half/integral input types to float
- Added a skeleton support for some type checking for aten kernels,
currently, only check for valid data types for frac and abs to limit
the scope of the change; but the utility function can be used for
consistently adding type checking for all aten functions
Known limitations:
- abs support for integral types can be made more effective by invoking
std::abs for integral tensors (currently kFabs maps to std::fabs).
Since that change is a bit more involved (e.g., changing IntrinsicsOp
kFabs to kAbs and other code generators accordingly), will leave it to
another issue
- other aten kernels may need similar type checking and some scrutiny
on the use of promoteToFloat to detect invalid data types early on.
That is also left for another issue
Test Plan:
test_jit_fuser_te.test_unary_ops
Imported from OSS
Reviewed By: asuhan
Differential Revision: D25344839
fbshipit-source-id: 95aca04c99b947dc20f11e4b3bae002f0ae37044
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264
Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.
Fix for https://github.com/pytorch/pytorch/issues/45604
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D25286213
Pulled By: eellison
fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48700
fmod and remainder on int tensors will raise ZeroDivisionError if their divisors are 0. I don't think we should try to generate code that raises exceptions. If at some point we really wanted to fuse these, I might lean towards calling a C++ helper function from the generated code.
ghstack-source-id: 117845642
Test Plan: `buck test //caffe2/test:jit -- test_binary_ops`
Reviewed By: eellison
Differential Revision: D25265792
fbshipit-source-id: 0be56ba3feafa1dbf3c37f6bb8c1550cb6891e6d
Summary:
Add missing types for bitwise_ops in `SimpleIREvaluator`
This is the first part of fixes for issue https://github.com/pytorch/pytorch/issues/48053.
- Original implementation of bitwise_ops supports only int operands, the
fix all support for integral types supported by the IR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48179
Test Plan: `python test/test_jit_fuser_te.py TestTEFuser.test_bitwise_ops`
Reviewed By: ZolotukhinM
Differential Revision: D25126944
Pulled By: penguinwu
fbshipit-source-id: 04dc7fc00c93b2bf1bd9f9cd09f7252357840b85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48326
The PR introduces a set of 'cuda-only' ops into `isSupported` function.
It is done to disable `pow` lowering on CPU where it's tricky to support
integer versions.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D25129211
Pulled By: ZolotukhinM
fbshipit-source-id: c62ae466e1d9ba9b3020519aadaa2a7fe7942d84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48213
it was completely broken unless rhs was a constant.
Test Plan: new unit test in test_jit_fuser_te.py
Reviewed By: eellison
Differential Revision: D25071639
fbshipit-source-id: ef1010a9fd551db646b83adfaa961648a5c388ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48085
We were treating it as a binary operator, which implies shape
broadcasting, even though the second arg is thrown away aside from the type.
Treating it as a unary is the proper approach.
ghstack-source-id: 116873680
Test Plan: new unit test
Reviewed By: ZolotukhinM
Differential Revision: D25017585
fbshipit-source-id: 0cfa89683c9bfd4fbb132617c74b47b268d7f368
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48084
as title
ghstack-source-id: 116870328
Test Plan: new unit test
Reviewed By: Krovatkin
Differential Revision: D25017489
fbshipit-source-id: 0d1998fccad6f509db04b6c67a4e4e4093d96751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47884
We need to know output types of everything in a fusion group to ensure
that we generate correctly-typed tensors. We were incorrectly starting a
fusion group with an unknown-typed output.
Test Plan:
New unit tests:
```
buck test //caffe2/test:jit //caffe2/test/cpp/tensorexpr:tensorexpr
```
Reviewed By: eellison
Differential Revision: D24932786
fbshipit-source-id: 83978a951f32c1207bbc3555a7d3bd94fe4e70fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47374
A few small fixes needed to enable unary op cpu testing. If reviewers would prefer I split them up let me know.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805248
Pulled By: eellison
fbshipit-source-id: c2cfe2e3319a633e64da3366e68f5bf21d390cb7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46951
If e.g. we're casting from torch.int -> torch.bool, previously we would just truncate from int32 -> i8. Since torch.bool has 8 bits but only uses one of them, we need to makes sure that one bit is set.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805253
Pulled By: eellison
fbshipit-source-id: af3aa323f10820d189827eb51037adfa7d80fed9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46950
Make sure that we're fusing in a fuse tests, and refactor to more concise API to check if fusions have happened.
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24805250
Pulled By: eellison
fbshipit-source-id: f898008a64b74e761bb5fe85f91b3cdf2dbdf878
Summary:
References https://github.com/pytorch/pytorch/issues/42515
> Enable integer -> float unary type promotion for ops like sin
Will follow-up for other such Ops once this PR is merged.
cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45733
Reviewed By: zou3519
Differential Revision: D24431194
Pulled By: mruberry
fbshipit-source-id: db600bc5de0e535b538d2aa301c3526b7c75ed17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788
We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169537
Pulled By: eellison
fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213
Reviewed By: ezyang
Differential Revision: D23885287
Pulled By: nickgg
fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984
Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops
Reviewed By: ezyang
Differential Revision: D23885259
Pulled By: asuhan
fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677
Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py
Reviewed By: agolynski
Differential Revision: D23801412
Pulled By: asuhan
fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.
Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209
Reviewed By: izdeby
Differential Revision: D23575577
Pulled By: nickgg
fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073
We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23487905
Pulled By: ZolotukhinM
fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631
I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358810
Pulled By: eellison
fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173
With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.
Differential Revision: D23178230
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766
**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.
According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection. Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.
Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**
```
# jit target
real 33m7.694s
user 653m1.181s
sys 58m14.160s
--------- Compare to Initial Jit Target runtime: ----------------
real 32m13.057s
user 613m52.843s
sys 54m58.678s
```
```
# others target
real 9m2.920s
user 164m21.927s
sys 12m54.840s
```
```
# serialization target
real 4m21.090s
user 23m33.501s
sys 1m53.308s
```
```
# tensorexpr
real 11m28.187s
user 33m36.420s
sys 1m15.925s
```
```
# type target
real 3m36.197s
user 51m47.912s
sys 4m14.149s
```
Reviewed By: malfet
Differential Revision: D22979219
fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
Summary:
Remove `skipIfRocm` from most jit tests and enable `RUN_CUDA_HALF` tests for ROCm.
These changes passed more than three rounds of CI testing against the ROCm CI.
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40447
Differential Revision: D22190711
Pulled By: xw285cornell
fbshipit-source-id: bac44825a2675d247b3abe2ec2f80420a95348a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40142
test_jit is becoming huge again, which makes editor hard to load and
write new tests, this split out the tracer related tests.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D22085035
Pulled By: wanchaol
fbshipit-source-id: 696bee84985ecfbfeac8e2ee5c27f1bdda8de394
Summary:
After an early return, we conditionalize all further execution. This means that currently the pattern of
`if return elif return elif return` generates better code than `if return if return if return`. It's obviously not good to have semantically equivalent code generate worse IR, so we should rewrite the graph to handle this case. This came up in https://github.com/pytorch/pytorch/pull/37171
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
_0 = uninitialized(int)
if x:
_1, _2 = True, 1
else:
_1, _2 = False, _0
if _1:
_3 = _2
else:
_3 = 2
return _3
```
while
```
torch.jit.script
def test_foo(x: bool, y: bool):
if x:
return 1
else:
return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
y: bool) -> int:
if x:
_0 = 1
else:
_0 = 2
return _0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38282
Differential Revision: D21576733
Pulled By: eellison
fbshipit-source-id: 80cf1ad7fbda6d8d58557abbfb21c90eafae7488
Summary:
The existing contextmanager only conditionally enabled_profiling_mode, which was counter intuitive. When we changed the default executor it broke internal benchmarking as a result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37825
Differential Revision: D21404611
Pulled By: eellison
fbshipit-source-id: 306b3c333ef4eb44ab6a6e5ab4e0682e5ce312ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35913
The pass itself is still disabled by default, but with this change we
don't need to register it as a custom pass anymore. It allows us to
control its behavior with env variables more easily.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D20827189
Pulled By: ZolotukhinM
fbshipit-source-id: e74d90b5e46422e7ab7bc40974a805220da50fbc
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/
```
PYTORCH_TENSOREXPR=1 python benchmark.py
```
This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341
Reviewed By: ZolotukhinM
Differential Revision: D20676348
Pulled By: bwasti
fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464