and run it only once.
The plan for the follow up changes is to remove vec×matrix reduction (currently regresses some models for unrelated reasons), and only keep vec×vec.
PiperOrigin-RevId: 784555311
- The policy is not enforced anywhere but changes not following the deprecation policy for flags marked [Stable] can be rolled back.
PiperOrigin-RevId: 784532479
This is adding the missing serialization support for the `TmaMetadata` field in the `KernelThunk`. With this change we can serialize GPU programs that use NVIDIA's TMA for memory access.
PiperOrigin-RevId: 784514465
FollowTupleIndirection() is also used by InstructionFusion, so we could decide
to move it to a separate header. However InstructionFusion also calls
GetInPlaceInputOutputPairs, so will have to use AliasInfo in the future,
anyway.
PiperOrigin-RevId: 784486223
Example of dumps without explicit collectives:
```
00.input_module.mlir
01.before_propagation.mlir
02.after_propagation.mlir
03.after_post_propagation_optimizations.mlir
04.output_module.mlir
```
PiperOrigin-RevId: 784338779
This update continues the development of the Triton block-level fusion emitter backend, which enables autotuning of tile configurations for custom Triton fusions in XLA.
This backend implements the following core interfaces:
GetSupportedConfigs: Enumerates all supported combinations of tile sizes for the output tensors. The generated configs can be used during autotuning to explore different performance candidates. (Implemented in a previous PR-28808)
GetDefaultConfig: Provides a default tile configuration for a given Triton fusion, used as a fallback when no tuning data is available. (Implemented in a previous PR-28515)
ApplyConfig: Applies a selected block-level fusion configuration to a Triton fusion instruction by updating its GpuBackendConfig.
PiperOrigin-RevId: 784338001
`xla::ifrt::UserContextScope` provides tracking of the currently active
`xla::ifrt::UserContext` on the current thread. It gives a mechanism for IFRT
APIs to take a user-provided context and associate it with IFRT runtime objects
(`Array`, `LoadedExecutable`, etc.).
We begin the changes by first using this thread-local scoping mechanism, as
this allows more incremental steps. A helper function
`xla::ifrt::GetUserContext()` provides a way to get the current
`xla::ifrt::UserContext` in a uniform way across IFRT implementations.
The long-term plan remains to be making the propagation of context objects
explicit by making IFRT APIs to take a `user_context` argument. This
thread-local scoping mechanism can be transferred from the IFRT API to IFRT
users or runtimes in case they still prefer using this mechanism.
PiperOrigin-RevId: 784333019
This change continues the work on the Triton block-level fusion emitter backend, which enables autotuning of tile configurations for custom Triton fusions in XLA.
This backend implements the following core interfaces:
- GetSupportedConfigs: Enumerates all supported combinations of tile sizes for the output tensors. The generated configs can be used during autotuning to explore different performance candidates.
- GetDefaultConfig: Provides a default tile configuration for a given Triton fusion, used as a fallback when no tuning data is available. (Implemented in a previous PR-28515)
- ApplyConfig: Applies a selected block-level fusion configuration to a Triton fusion instruction by updating its GpuBackendConfig. (will be added in the next PR)
PiperOrigin-RevId: 784233964
This PR updates test assertions in two XLA C++ test files by replacing `EXPECT_THAT(..., IsOkAndHolds(true)) with ASSERT_THAT(...)`.
Rationale:
- Consistency: Aligns with other XLA tests, which use ASSERT for pass.Run() calls when subsequent checks depend on successful execution.
- Correctness: Ensures test failures are caught immediately, as ASSERT_THAT is fatal and prevents further checks from running on invalid state.
PiperOrigin-RevId: 784228038
- **Reshape Operations:** The pass now handles `tt.reshape` operations that add unit dimensions by converting them into `tt.expand_dims` operations.
- **Pointer Calculation:** A bug in pointer offset calculation within `SqueezeMakeTensorPtr` is fixed, ensuring correct behavior with non-zero offsets.
- **Load/Store Operations:**
- The pass now correctly disables rewriting `tt.load` and `tt.store` operations that have masks.
- A new safety check prevents folding a `tt.load` if one of the dimensions being squeezed is also subject to a boundary check.
- **Store Operation:** The `squeeze-store` pattern is now enabled by default, and its controlling option has been removed.
These changes are accompanied by updated and new tests to validate the improved functionality and bug fixes.
With these changes, the pass is a win in all benchmarks I've looked at.
I'm planning to collect more extensive data and enable the pass in a separate change.
PiperOrigin-RevId: 784092242