Change `gpu_version` parameter to const reference in `IntelGpuCompiler`.
This aligns the parameter type in `OptimizeHloConvolutionCanonicalization` with the base class signature.
PiperOrigin-RevId: 825083863
When a function has multiple instances with different manual axes, and dedup-functions-fully is on, it will have different copies of the same function.
For example:
sdy.manual_computation(%arg0) manual_axes={"x"} (%arg1: tensor<4xf32>) {
sdy.named_computation<"foo">(%arg1) (%arg2: tensor<4xf32>) {}
}
sdy.manual_computation(%arg0) manual_axes={"y"} (%arg1: tensor<4xf32>) {
sdy.named_computation<"foo">(%arg1) (%arg2: tensor<4xf32>) {}
}
sdy.named_computation<"foo">(%arg0) (%arg1: tensor<8xf32>) {}
----->
sdy.manual_computation(%arg0) manual_axes={"x"} (%arg1: tensor<4xf32>) {
call @foo(%arg1)
}
sdy.manual_computation(%arg0) manual_axes={"y"} (%arg1: tensor<4xf32>) {
call @foo_0(%arg1)
}
call @foo_1(%arg0)
The order of the iteration on the map/vector determines the which 'foo' will become 'foo_0', 'foo_1', or stay as 'foo'.
PiperOrigin-RevId: 825074314
Imported from GitHub PR https://github.com/openxla/xla/pull/33149📝 Summary of Changes
add cuda graph dump option that only prints out primary graph, so not to flush the screen log with nested cuda graph.
🎯 Justification
Easy debug read
🚀 Kind of Contribution
Please remove what does not apply📚 Documentation
Copybara import of the project:
--
18d6939170fd5bf4fa9228d4f74ca3ff4e83ec17 by Shawn Wang <shawnw@nvidia.com>:
add cuda graph dump option that only prints out primary graph
Merging this change closes#33149
PiperOrigin-RevId: 825049186
The UnstableReductionDetector now considers reductions where all reduced dimensions have a size of 1 to be stable, as these operations are effectively no-ops and do not introduce numerical instability. A test case is added to verify this behavior.
PiperOrigin-RevId: 825045042
When implementing this it turned out that the log is currently missing some information needed to reliably distinguish input/output checksums and different thunk executions. This adds the needed fields to the proto, but emitting them in the log will be a separate change.
With the extra data missing, the tool assumes all checksums refer to outputs, and each thunk execution is going to give the same results each time. The tests include the extra data, so once that's implement it should(TM) just work.
PiperOrigin-RevId: 825040798
**Without this change**, when using `--use_xnnpack`, either:
1. `--use_xnnpack=true`: the **default resolver** (that automatically applies an XNNPack delegate) is used and an XNNPack delegate that follows the options that are given on the command line is explicitly applied.
2. `--use_xnnpack=false`: the **resolver without the default XNNPack delegate** is used and no delegate is explicitly applied, i.e. no delegate is applied.
3. No `--use_xnnpack` is specified: the **default resolver** (that automatically applies an XNNPack delegate) is used.
Case 1 has issues because the custom and default delegates are applied and
these may interfere during the initialization.
- Depending on the XNNPack options some operations may be delegated or not.
- This leads to one or the other delegate to take the ops.
- This makes the benchmarking of initialization completely wrong since two
delegates are applied.
- This messes up with the XNNPack weight cache since it can never be enabled
for the default delegate.
To solve this, the new behaviour is:
1. `--use_xnnpack=true`: the **resolver without the default XNNPack delegate**
is used **and** an XNNPack delegate that follows the options that are given on the command line is explicitly applied.
2. `--use_xnnpack=false`: the **resolver without the default XNNPack delegate** is used and no delegate is explicitly applied, i.e. no delegate is applied.
3. No `--use_xnnpack` is specified: the **default resolver** (that automatically applies an XNNPack delegate) is used.
Cases 2 and 3 are not affected by this change.
PiperOrigin-RevId: 825018995
So far, the output could be non-deterministic if multiple reductions are
grouped together. This change makes it deterministic.
PiperOrigin-RevId: 824965037
These new methods allow printing or converting to a string only the array representation of the tile assignment, without including the tile dimensions. The existing `Print` and `ToString` methods are updated to use these new array-specific printing functions.
PiperOrigin-RevId: 824816702
This currently happens implicitly in `ynn_create_runtime`, but that will not be the case soon. (Calling it multiple times is harmless.)
PiperOrigin-RevId: 824754921
Explicitly cast the operands of the size calculation to `size_t` to prevent potential integer overflow before calling `memcpy` under 64bit system.
PiperOrigin-RevId: 824732102
`arith.trunci` for i1 will simply take the last bit, but HLO expects convert to i1 to be value != 0. Emit this conversion a a compare not equal to 0 instead. This is already done correctly for floats.
PiperOrigin-RevId: 824716165
This change introduces `is_warp_specialization_allowed` to `TritonGemmConfig` and `BlockLevelFusionConfig`. The autotuner now explores configurations with warp specialization enabled, but only on Blackwell+ devices and when TMA is also enabled. The fusion emitter uses this new parameter to set the `tt.warp_specialize` attribute.
PiperOrigin-RevId: 824601781
This change removes the GetMutableAffineMap() method from xla::IndexingMap. The mutable access to the underlying mlir::AffineMap can't be used because we will use a different internal implementation (SymbolicMap). I also think it's cleaner to not provide this method.
PiperOrigin-RevId: 824536996