If MLIR generated kernels are enabled, some CastFunctor templates don't need to
be instantiated.
PiperOrigin-RevId: 383597713
Change-Id: Ia85077abbcf4b00132290f1120ea950a638bf7c0
It seems they don't benefit from it, and are faster without vectorization.
PiperOrigin-RevId: 383595164
Change-Id: If3ebe5ec1e7efd6e7853c4890b9545be7e0eb873
This requires a mapping from mhlo::Log1pOp to complex::Log1pOp for complex types.
PiperOrigin-RevId: 383585333
Change-Id: Ia4d3f0c7c18b66db39638322e755bc9ef426c081
This is useful for e.g. saved_model_cli aot_compile_cpu with a target
architecture of aarch64.
Previously this failed to build but now it seems OK.
While we're at it, remove the manual tag from saved_model_cli_test.
PiperOrigin-RevId: 383569229
Change-Id: I91dcfccfbd0c817e12dee7e45985437001f56b50
This op reads preceding cast op that casts quantized tensor to tfr.tensor. From the quantization parameters, the scale and zero points are converted to constants.
PiperOrigin-RevId: 383541459
Change-Id: I1c5d21f8f829c25813d1adfe91a59194d30b2249
Currently, the CallOnce operator can be invoked once in a single subgraph. When
we consider multiple entry point subgraphs, the current implementation of the
CallOnce will invoke the same initialization per entry point subgraph
separately.
To avoid the situation, the CallOnce operator should cache the invocation
information across the subgraphs. This requirement isn’t simply addressed by
the operator level. Like resource sharing in TensorFlow Lite interpreter class
level, it requires the interpreter level status sharing.
PiperOrigin-RevId: 383540722
Change-Id: Ia9f6dc6ed3631b2c37ffc65352cfc1a66ba2581e
The buffer allocations and sizes info is useful for debug and writing memory unit tests.
PiperOrigin-RevId: 383524130
Change-Id: I93d2c1e982a7c27fd6e2e782842cb07c6e73f9ca
Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/50102
This PR is currently WIP for a discussion in advance. More UTs will be added.
The basic codegen flow will be:
step 1, LhloLegalizeRootsToParallelLoops
step 2, InputInlineFusion
step 3, canonicalize & CSE to optimize the redundant index calculations, we will need a MemRefLoadCSE here since the general cse will not process memref.load (this is only an optimization and will sent PRs as second priority)
step 4, Inline the lmhlo.Fusion
step 5, ParallelToGPULaunch/ParallelToOpenMP
Codegen schedules other than the basic loop schedule are seperated into incoming PRs. And lmhlo Ops support other than RealDynamicSlice, DynamicBroadcastInDim & BroadcastInDim in lhlo_elemental_util are also seperated into subsequent PRs.
Other known TODOs includes:
(1) There is potential redundant Linearize/Delinearize calculations in the initial version. memref.LinearizeIndex/Delinearize op will be brought in future PRs to optimize the index calculation. As discussed in https://llvm.discourse.group/t/add-an-expanded-load-store-op-in-memref-dialect/3503/17.
fusion_utils depends on https://github.com/tensorflow/tensorflow/pull/50020 and will be removed after it's done.
Copybara import of the project:
--
056519dd90e60de865657908807c24a943ead385 by tashuang.zk <tashuang.zk@alibaba-inc.com>:
[MLIR][DISC] Add initial version of LhloLegalizeRootsToParallelLoops and InputInlineFusion
Pass for codegen
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/tensorflow/pull/50102 from linearhit:disc_dev 056519dd90e60de865657908807c24a943ead385
PiperOrigin-RevId: 383518259
Change-Id: Ifc18e31c06c9faae56d4363a6f88c5292442dba1
The shapes are the same for each of the elements of the tuple, but the actual
output lives in index {1}. It can be confusing when reading and one might assume
that the actual buffer where the output lives is in {0} instead.
PiperOrigin-RevId: 383515890
Change-Id: Ic153d7de9cba57aec03f4981518ff8d41d1392a1
It fails with "Internal: Failed to launch gpuprim::DeviceRadixSort::SortPairs, temp_storage_bytes: 3327status: invalid configuration argument"
Disable the kernel while we're triaging the issue.
PiperOrigin-RevId: 383492772
Change-Id: I6cf99f6d5cc8b39b081c48893a6db601ead6817c