This is adding `GpuExecutuable::ToProto` and `GpuExecutable::FromProto` which allow us to [de]serialize an instance of `GpuExecutable` and later reconstruct it.
PiperOrigin-RevId: 826470601
This CL introduces a new helper method SymbolicExpr::IsBinaryOp() to quickly determine if a SymbolicExpr is a binary operation (i.e., not a constant or a variable). This is used in indexing_map.cc in several places for AffineMap and it will simplify the refactor.
PiperOrigin-RevId: 826468454
Checking all buffers is way too heavy and causes timeouts, so we need the ability to focus on interesting parts of the thunk graph.
`--xla_gpu_experimental_thunk_buffer_debug_filter_by_thunk_id_ranges` allows limiting thunk IDs to selected ranges or values. The IDs are assigned in the order of emitting thunks, which should (TM) be stable and allow bisecting to find culprit thunk(s). The IDs are given as comma-separated list of integers, closed or half-open ranges (e.g. `:2,5,7:8,12:` to match <=2, 5, 7, 8 and >=12).
`--xla_gpu_experimental_thunk_buffer_debug_filter_by_profile_annotation_re` allows matching by thunk's profile annotation. This is a comma-separated list of regexes that will be matched against `ThunkInfo::profile_annotation`. The thunk's profile annotation needs to match any of the regexes.
They are meant to work with all thunk debug buffer instrumentation (currently: checksums, NaNs). If both flags are defined, the thunk will have to pass both the ID and profile annotation filters to get instrumented.
Implementation of the filtering logic is not included in this CL.
PiperOrigin-RevId: 826457166
`BufferAllocation::Slice` stores a raw pointer to the corresponding `BufferAllocation`. Now we keep the embedded thunk allocations alive by stroing unique_ptrs in the wrapping DynamicSliceThunk. The current design makes it hard to reuse the existing infrastructure, specifically to serialize `DynamicSliceThunk`. To address this, I'm changing fake_allocations to be `std::vector<BufferAllocation>`.
The move constructor `std::vector::vector(std::vector&&)` is guaranteed to have constant time complexity and therefore it steals the internal data buffer from the source vector. This infers that the pointers to allocations are kept stable as long as:
* we preallocate the vector size
* we never copy the vector, but move
To make it safer for later usage, we can explicitely prohibid BufferAllocation to be copyable/moveable. I'm going to do this in the following cl.
PiperOrigin-RevId: 826440060
The `CubSortThunk` constructor was calling a function that returns a `absl::StatusOr`, and ignoring non-ok statuses and just accessing the value.
Presumably in prod the status is always ok, but making this failure case explicit.
PiperOrigin-RevId: 826410861
This change moves the initialization of commonly used `SymbolicExpr` and a sample `SymbolicMap` into the `SymbolicMapTest` fixture to reduce code duplication across tests.
PiperOrigin-RevId: 826161168
I originally assumed the caller was always providing a full list of replacements but IndexingMap have some uses where the dim_replacement list is empty, resulting in a CHECK-fail.
So, I'm allowing the user to provide either dim or symbol empty lists to ReplaceDimsAndSymbols. In that case, the dims/symbols won't be replaced.
PiperOrigin-RevId: 826138814
The GlobalToLocal and LocalToGlobal custom-calls are for Shardy round trip. These get-tuple-elements will be removed when we import the Shardy dialect and thus they do not need to hold frontend attributes.
This can reduce the size of the generated HLO module text.
PiperOrigin-RevId: 826134489
Instead of 2 ra2a + concat, we can double the output buffer and adjust output offsets. This way we can save on latency by having only one multi-GPU synchronization.
PiperOrigin-RevId: 826122665