Fix forward for cl/566274725: IsEqualAt is templated and was not written correctly. The check was not related to the CL anyway.
PiperOrigin-RevId: 566298256
Imported from GitHub PR https://github.com/openxla/xla/pull/5300
This is a new GPU SPMD optimization pass for the following pattern:
binary-op(all-gather(a), all-gather(b))
to
all-gather(binary-op(a, b))
Copybara import of the project:
--
77aafc0686fb98a6e13b6664ee537ed3cde5e24f by kushanam <kahmadian@nvidia.com>:
adding a new pass to optimize reduce_scatter->all_gather->binary_op sequence
--
0b1e8eb599f8a7334b7c9826746db67e0923f2f7 by kushanam <kahmadian@nvidia.com>:
applying review refactors
--
9b181ec7487e7ded4610a779f8929d2e2a199e0d by kushanam <kahmadian@nvidia.com>:
removing reduce-scatter from the all-gather optimization
--
a8c49eb58f3b370627cd57c62f456696567ba60a by kushanam <kahmadian@nvidia.com>:
remove traversal all-gather search and rely on immediate parent
--
d90f5a148bc099455724450b84f1af8fb83ffc66 by kushanam <kahmadian@nvidia.com>:
remove extra gpu word from the directive
Merging this change closes#5300
PiperOrigin-RevId: 566298114
Imported from GitHub PR https://github.com/openxla/xla/pull/5670
revert ROCm side same as 5eb7734505
@anlunx Thanks in advance!
Copybara import of the project:
--
9dedb1ce2a620bae69c0fbaa8e5822ababfd52bc by Chao Chen <cchen104@amd.com>:
ROCm revert 48cf922
Merging this change closes#5670
PiperOrigin-RevId: 566284041
With this logic slices will be fused quite rarely (only as inputs, either slices of direct computation parameters or tiny ones) because generally slices are better to fuse to producers to reduce DRAM traffic.
PiperOrigin-RevId: 566274725
mlir::ModuleOp::create returns a non-owning reference which is almost never the intended usage. We may leak memory if we don't assign it manually to an mlir::OwningOpRef. We actually had an error like this a few weeks ago.
CreateMlirModuleOp returns an owning reference by default.
I added a check to our internal presubmit, which will fail for mlir::ModuleOp::create calls.
We can opt-out of the check by adding /*ALLOW_MLIR_MODULE_OP_CREATE*/ to the same line as mlir::ModuleOp::create. I recommend only doing this if really needed and doing this in an utility function not in general code.
PiperOrigin-RevId: 566231363
The recursive transpose algorithm is pretty fundamental. We can implement it on Neon by just implementing some primitives.
While we are here, reduce code bloat by skipping instantiation of unspecialized micro-kernels.
PiperOrigin-RevId: 566137150
Imported from GitHub PR https://github.com/openxla/xla/pull/5634
Fixed ROCm build error due to 7be97ae6ea forgot to include ROCm change.
Thanks in advance! @tdanyluk @cheshire
Copybara import of the project:
--
9ea7cbda5746cab11348246ebe5b343a80a0f373 by Chao Chen <cchen104@amd.com>:
rocm updated graph api and fixed hlo_op_profiler_test
--
d5576d44459bed0424fb9c1dad57285562889354 by Chao Chen <cchen104@amd.com>:
fixed PluginConfig error
Merging this change closes#5634
PiperOrigin-RevId: 565732648
This CL combines several optimizations:
1. If the combiner is "sum", we avoid all computation and allocation related to gain-rescaling.
2. If the weights are a scalar, we broadcast the same weight to all tokens. This will avoid the need to execute a `Shape`->`Fill` to generate a uniform-weight vector.
3. We use an array instead of a `Tensor` to store the temporary vector of rows.
4. A minor improvement to the code for extracting the row IDs from a `SparseTensor`.
PiperOrigin-RevId: 565721609
The comparator needs to satisfy the strict weak ordering requirement, otherwise std::sort() may crash.
We can't really robustly verify this without considering all triples, but at least we can smoke-test on the first element that the comparator is not reflexive.
PiperOrigin-RevId: 565716297