We use this field for two different buffer debug kernels that have different semantic. Technically we could have two different structures but it does not makes much sense at the moment. Let's use the one that we already have with the generic name.
PiperOrigin-RevId: 824532743
Add an API to lookup type id and info by type name. We can't rely on type ids for serialization, as they are not stable and assigned at run time depending on the type registration order. Type names on the other hand must be stable.
PiperOrigin-RevId: 824512487
- Renamed and make public SymbolicToAffine to SymbolicExprToAffineExpr (needed for IndexingMap::GetConstraints)
- Renamed AffineToSymbolicExpr to AffineExprToSymbolicExpr
- Added AffineExprsToSymbolicExprs to convert a list of mlir::AffineExpr to a vector of xla::gpu::SymbolicExpr (needed for IndexingMap::ConstraintsSatisfied)
PiperOrigin-RevId: 824492246
In the follow up cl we will need to add this thunk to the buffer debug pass.
Also there we will need to infer the buffer element type.
Another refactoring would be to change the name of the payload which is the checksum at the moment to something more generic like 'value' or 'result'.
One more thing we could do is to reduce the code duplication by merging together both thunks, the checksum one and nan counter one.
PiperOrigin-RevId: 824491914
The `DotDecomposer` pass runs ahead of layout assignment. Introducing non-default layouts at this stage causes complications for subsequent passes, in particular the `DotMerger` pass.
PiperOrigin-RevId: 824476578
Imported from GitHub PR https://github.com/openxla/xla/pull/32439
…inprocess_lld
📝 Summary of Changes
Enable embedded device libs and in-process lld by default.
🎯 Justification
Moves amdgpu backend to be more filesystem layout independent.
🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
N\A
🧪 Unit Tests:
None
🧪 Execution Tests:
None
Copybara import of the project:
--
46a100377d00d30dbc79e34c977b9219c54bda4b by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:
[ROCm] Fix and enable xla_gpu_use_embeded_device_lib and xla_gpu_use_inprocess_lld
Merging this change closes#32439
PiperOrigin-RevId: 824476138
absl::Hash is not deterministic over different runs of the same program. Use
Fingerprint128 instead, and don't include the address of the computation.
PiperOrigin-RevId: 824460524
Imported from GitHub PR https://github.com/openxla/xla/pull/31886📝 Summary of Changes
This enhances the search for the CUDA libdevice path:
- Fix an invalid empty path added when `TF_CUDA_TOOLKIT_PATH` which may be empty
- Fix invalid paths based on runtime folders: `runfiles_dir.substr(0, runfiles_ind + runfiles_suffix.length())` is not meaningful when `runfiles_ind` isn't valid, i.e. `std::string::npos`
- Add `$CUDA_HOME` to the search paths. This is also used in TensorFlow already
🎯 Justification
Without this the libdevice file won't be found if CUDA isn't installed in a standard location or e.g. an updated version is available in a different location.
This is the case for e.g. HPC systems where multiple CUDA versions are available side-by-side.
🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup
Fixes#28590🧪 Unit Tests:
Simple test that when `CandidateCudaRoots` returns anything it contains `$CUDA_HOME`
Copybara import of the project:
--
01788b896900717ee916377a71d5c14963e0176d by Alexander Grund <alexander.grund@tu-dresden.de>:
Fix libdevice search when outside test environment
When there is no `runfiles_suffix` the `rfind` returns
`std::string::npos` which should be handled to not add meaningless paths.
--
900715a846102bacdfc7688f14713cbe6101506d by Alexander Grund <alexander.grund@tu-dresden.de>:
Use `$CUDA_HOME` when searching for libdevice.
With a CUDA installed to a non-default location XLA/TF fails with:
> gpu_backend_lib.cc:579] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
> Searched for CUDA in the following directories:
> ./cuda_sdk_lib
> /builddir/TensorFlow/TensorFlow-2.x_mnist-test.py.runfiles/cuda_nvcc
> /buildi/cuda_nvcc
>
> /usr/local/cuda
> /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
> /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
> /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../cuda
Consider $CUDA_HOME as an additional location after the runfiles dirs (used for tests)
--
905d0596d199598036032f0f84b4487e9afd2bef by Alexander Grund <alexander.grund@tu-dresden.de>:
Don't add empty TF_CUDA_TOOLKIT_PATH to libdevice search
At least in some environments that define is the empty string which
doesn't make sense to add to the search paths.
Add a check for that.
--
23eb59bfabd570caabf0b9ec3515233f46a4fae7 by Alexander Grund <alexander.grund@tu-dresden.de>:
Add test for $CUDA_HOME in CandidateCudaRoots
--
a8c215bc222b4ba8581f2f44549613ebd59b9cbb by Alexander Grund <alexander.grund@tu-dresden.de>:
Add braces to loops/conditions
--
39efc67f8b1d44e131f993c8040b7eb69ff52f0c by Alexander Grund <alexander.grund@tu-dresden.de>:
Use kIsOpenSource in skip condition
Merging this change closes#31886
PiperOrigin-RevId: 824450284
This is moving `Scalar`, `Array`, `Dictionary`, `FlatAttribute`, `FlatAttributeMap`, and `AttributeMap` from `CallFrameBuilder` into the `xla::ffi` namespace.
It also moves the code into `attribute_map.{cc|h}`.
All these types are basically aliases from some kind of `std::variant` type. This change is a preparation for making them proper types and add `ToProto` and `FromProto` methods.
PiperOrigin-RevId: 824435281
Also fixed the round trip test to not ignore `kInvalid` returned from proto conversion, which is why we didn't catch this bug.
PiperOrigin-RevId: 824419619
The meaning of AsyncValue::IsUnique() is fuzzy for the chain of indirect async values. Prefer simpler check for uniqueness in Future/Promise library.
Also update AsyncValue::IsUnique() documentation.
PiperOrigin-RevId: 824256830
This change invalidates the autotune cache, which is necessary because enabling the generic emitter (cl/823475406) affected autotuning results.
PiperOrigin-RevId: 823818338
It is no-op behaviorally for shardy. Because the call output and func result may mismatch only if dedup-functions-fully options is true, and this option is false by default.
Shardy will add explicit reshards (during shardy partitioner) on those operations that use the output of named computation and it will do so assuming the sharding of the named computation is sharded as specified in the out shardings of the named computation.
When dedup-functions-fully option is true, however, the function that is actually called may end up having a different output sharding than the corresponding named computation. So, the users of the output shardings should still use sharding as in the output shardings the named computation. Hence, if there is a mismatch between the output sharding of the named computation and the result sharding of the function, we add a reshard on the output of the call.
PiperOrigin-RevId: 823494391
Explicitly set the operand precisions to `PrecisionConfig::DEFAULT` when creating a `ScaledDot` instruction from a composite call.
PiperOrigin-RevId: 823488638
+ use `ptr` when using `AsPtr()` for consistency
+ rename `Wrap` to `AndThen` as it's more meaningful and makes profiles readable
PiperOrigin-RevId: 823476695
According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default. Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future.
Reverts 85c99b1ecb
PiperOrigin-RevId: 823475406
Previously, we would never allow simplification when encountering a `dot`
instruction. But this constraint was overly conservative; the only dimensions
that we shouldn't simplify are those along which we intend to perform
non-standard padding to fit to hardware restrictions, i.e. the non-contracting
and contracting dimensions.
Restricting this pattern further works around a bug whereby expanding a
non-standardly padded dimension into a `1` dim can result in propagating a
tile with the wrong size.
The underlying reason for this is a bug in the `kPreserve` behaviour of
`IndexingMap` simplification, which will need to be fixed separately (the new
tiling should avoid this issue, since it shouldn't rely on the correctness of
`IndexingMap` simplification at this level).
PiperOrigin-RevId: 823258725
Note that, in order to maintain parity with MHLO optimizations, this enables the `assume-no-undeclared-side-effects` option. This matches the default behavior for MHLO, but StableHLO is more cautious by default. Empirically, past evidence suggests it's pretty safe given that MHLO has been doing it all this time. Disabling the flag can result in significantly larger HLO after lowering, so we enable it here.
PiperOrigin-RevId: 823234079
To avoid confusion because of different kinds of tasks we have in Worker/WorkQueue and a SlinklyThreadPool in XLA use a more generic "work item" name.
PiperOrigin-RevId: 823191886
This is just a short term solution to allow loading https://github.com/jax-ml/jax/blob/main/build/BUILD.bazel successfully. We'll need to figure out a better solution when working on supporting multiple python versions.
PiperOrigin-RevId: 823093519
Imported from GitHub PR https://github.com/openxla/xla/pull/32954📝 Summary of Changes
Introduce pool name for rbe builds
🎯 Justification
Need separate pool name for gpu tests execution.
🚀 Kind of Contribution
Please remove what does not apply: ✨ New Feature
📊 Benchmark (for Performance Improvements)
Rbe support for rocm config ci job
🧪 Unit Tests:
Not relevant
🧪 Execution Tests:
Not relevant
Copybara import of the project:
--
d675bf9efcc44a8d740c1be7537737af3cd90f0b by Alexandros Theodoridis <alexandros.theodoridis@amd.com>:
Introduce pool name for rbe
--
d5ee82757aa74785bd2a1c68e3639c49d17ba740 by Alexandros Theodoridis <atheodor@amd.com>:
Introduce rocm rbe pools
--
36bfa7b258cb3e58430087faccccb413f9bf8a7c by Alexandros Theodoridis <atheodor@amd.com>:
First check for multigpu tag
--
9efa0a7cdfa76bb0d5102ebbee1f9b6a3dab270c by Alexandros Theodoridis <atheodor@amd.com>:
Address review comments
--
5b854a7f5915d0c106fd2ba9bc6ff774a885f907 by Alexandros Theodoridis <atheodor@amd.com>:
Fix buildifier issue
Merging this change closes#32954
PiperOrigin-RevId: 823077515
This change modifies `SymbolicExprContext` to use the `mlir::StorageUniquer` provided by `mlir::MLIRContext::getAffineUniquer()` instead of maintaining its own. This produces SymbolicExprContext creation to be very lightweight.
PiperOrigin-RevId: 823052287
The old code did not update `min_duration_with_optimzed_scratch_bytes` in case the scratch sizes are equal. This could lead to subtle situation where a kernel with the most optimal time and cache is not picked, if all scratch sizes are the same, but the optimal one in terms of time does not appear at the end.
I've updated the associated test to verify this situation. The new test fails before this CL.
PiperOrigin-RevId: 823019660
This change moves `YnnThreadpool` to the runtime/ynnpack/ subfolder, and changes the runtime to use our custom YnnThreadpool, instead of using a thread pool created by `ynn_create_threadpool`.
PiperOrigin-RevId: 822883993
Setting type_id value to 0 is required for XLA to assign unique type id, otherwise type gets assigned a random value that happens to be on the caller stack.
PiperOrigin-RevId: 822782898
this now passes reference_held=true always. This is fine because the only time
this was ever passed as false was if this was already on the compute stream and
this bool is basically ignored if the stream is the compute stream (see
MaybeWaitForEventOnStream).
PiperOrigin-RevId: 822758577
This CL modifies the collective pipeliner to generate unique body and condition computations for newly generated while loop instructions.
PiperOrigin-RevId: 822719229
- Prioritize replacing `broadcast_in_dim` with `reshape` over merging nested `broadcast_in_dim` ops. The new behavior matches the relevant MHLO optimization behavior, which proved to be preferable.
- Fix an issue where `pad` ops that didn't change the dimensions would be removed even if they shifted elements around within the tensor (e.g. padding by -1 on one side and +1 on the opposite side).
PiperOrigin-RevId: 822701252
Imported from GitHub PR https://github.com/openxla/xla/pull/33008📝 Summary of Changes
Add CI-specific bazelrc that will import both `rocm.bazelrc` from `/usertools` and `rocm_xla.bazelrc`
🎯 Justification
Temporary workaround until split logic in CI (which relies on `/usertools/rocm.bazelrc`) is removed
Copybara import of the project:
--
bb4cbf0c4fbf2c171110040c5c1470bddced203b by Milica Makevic <Milica.Makevic@amd.com>:
Add CI specific bazelrc
Merging this change closes#33008
PiperOrigin-RevId: 822700005
Instead of performing four separate AllToAll operations, the metadata tensors are reshaped, concatenated, and then a single AllToAll is executed. The result is then sliced back into the individual metadata tensors. This reduces latency required to initiate separate collective operations.
PiperOrigin-RevId: 822674605
Introduce `addMissingShardingToControlFlow` option in `StablehloExportPipelineOptions` to control whether `ExportStablehloShardingsPass` adds missing shardings to control flow ops. Disable this option in `mlir_to_hlo.cc` when converting MLIR to HLO.
PiperOrigin-RevId: 822542288
Imported from GitHub PR https://github.com/openxla/xla/pull/32231📝 Summary of Changes
The changes enable native support for forward convolutions with window dilation in XLA's GPU backend. Previously, all dilated convolutions were treated as non-canonical and required explicit padding materialization. Now, forward convolutions with window dilation (but not base dilation) are preserved and handled natively by cuDNN, avoiding unnecessary padding overhead.
🎯 Justification
Performance Problem: JAX shows 15-23x slower performance than PyTorch for dilated convolutions (33.5ms vs 1.4ms at dilation rate 2). This is because XLA materializes dilated convolutions as padded convolutions instead of using cuDNN's native support.
Solution: Allow forward convolutions with window dilation to bypass padding materialization and use cuDNN's native dilated convolution kernels directly.
🚀 Kind of Contribution
Performance Improvement
📊 Benchmark (for Performance Improvements)
dilation 1:
prev: 1.08 ms
now: 1.07 ms
dilation 2:
prev: 25.79 ms
now: 0.91 ms
dilation 1024:
prev: 26.24 ms
now: 2.34 ms
Copybara import of the project:
--
b5a38df2ed4715b43fc8ca8d652005a35290d47e by Chenhao Jiang <chenhaoj@nvidia.com>:
Support forward conv with dilation and add basic heuristic for differentiating forward/backward
Merging this change closes#32231
PiperOrigin-RevId: 822482265
Imported from GitHub PR https://github.com/openxla/xla/pull/32838📝 Summary of Changes
The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input.
🎯 Justification
Previously the logic would propose an incompatible architecture in this case.
🚀 Kind of Contribution
🐛 Bug Fix
🧪 Unit Tests:
Added a new test case showing the previously-failing case (it used to propose `sm_110`)
Copybara import of the project:
--
f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>:
Fix family-conditional logic
--
fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>:
Accept CR suggestion
Merging this change closes#32838
PiperOrigin-RevId: 822284790
Imported from GitHub PR https://github.com/openxla/xla/pull/32960📝 Summary of Changes
(Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, 9d358b9b26, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now.
🎯 Justification
These changes are ROCm specific and helps with rocm internal CI validation pipelines.
🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests
📊 Benchmark (for Performance Improvements)
/
🧪 Unit Tests:
/
🧪 Execution Tests:
/
Copybara import of the project:
--
804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>:
Add missing cuda-only tag to cuda test
--
44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>:
Refactor test scripts
--
fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>:
Cover more mgpu tests
--
1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>:
Switch from rocm_gcc to rocm_ci for sgpu tests
--
00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>:
Changing file permissions
--
003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>:
Remove unnecessary import
--
214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
Add license header
Merging this change closes#32960
PiperOrigin-RevId: 822245565
Imported from GitHub PR https://github.com/openxla/xla/pull/32846📝 Summary of Changes
Allow mixed precision collective-permute in the verifier.
🎯 Justification
Partially addresses https://github.com/openxla/xla/issues/32845🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
N/A
🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute.
🧪 Execution Tests:
N/A
Copybara import of the project:
--
666c38a19005a609d4a7aa8e5e9b9842b1c87175 by Jaroslav Sevcik <jsevcik@nvidia.com>:
Allow mixed precision for collective permute
Merging this change closes#32846
PiperOrigin-RevId: 822179840
Imported from GitHub PR https://github.com/openxla/xla/pull/32904
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p>
<blockquote>
<h2>v4.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.8/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.7</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.7/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>3.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v3.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p>
<blockquote>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>[UNRELEASED]</h2>
<p>No user facing changes.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<h2>3.30.6 - 02 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.2. <a href="https://redirect.github.com/github/codeql-action/pull/3168">#3168</a></li>
</ul>
<h2>3.30.5 - 26 Sep 2025</h2>
<ul>
<li>We fixed a bug that was introduced in <code>3.30.4</code> with <code>upload-sarif</code> which resulted in files without a <code>.sarif</code> extension not getting uploaded. <a href="https://redirect.github.com/github/codeql-action/pull/3160">#3160</a></li>
</ul>
<h2>3.30.4 - 25 Sep 2025</h2>
<ul>
<li>We have improved the CodeQL Action's ability to validate that the workflow it is used in does not use different versions of the CodeQL Action for different workflow steps. Mixing different versions of the CodeQL Action in the same workflow is unsupported and can lead to unpredictable results. A warning will now be emitted from the <code>codeql-action/init</code> step if different versions of the CodeQL Action are detected in the workflow file. Additionally, an error will now be thrown by the other CodeQL Action steps if they load a configuration file that was generated by a different version of the <code>codeql-action/init</code> step. <a href="https://redirect.github.com/github/codeql-action/pull/3099">#3099</a> and <a href="https://redirect.github.com/github/codeql-action/pull/3100">#3100</a></li>
<li>We added support for reducing the size of dependency caches for Java analyses, which will reduce cache usage and speed up workflows. This will be enabled automatically at a later time. <a href="https://redirect.github.com/github/codeql-action/pull/3107">#3107</a></li>
<li>You can now run the latest CodeQL nightly bundle by passing <code>tools: nightly</code> to the <code>init</code> action. In general, the nightly bundle is unstable and we only recommend running it when directed by GitHub staff. <a href="https://redirect.github.com/github/codeql-action/pull/3130">#3130</a></li>
<li>Update default CodeQL bundle version to 2.23.1. <a href="https://redirect.github.com/github/codeql-action/pull/3118">#3118</a></li>
</ul>
<h2>3.30.3 - 10 Sep 2025</h2>
<p>No user facing changes.</p>
<h2>3.30.2 - 09 Sep 2025</h2>
<ul>
<li>Fixed a bug which could cause language autodetection to fail. <a href="https://redirect.github.com/github/codeql-action/pull/3084">#3084</a></li>
<li>Experimental: The <code>quality-queries</code> input that was added in <code>3.29.2</code> as part of an internal experiment is now deprecated and will be removed in an upcoming version of the CodeQL Action. It has been superseded by a new <code>analysis-kinds</code> input, which is part of the same internal experiment. Do not use this in production as it is subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3064">#3064</a></li>
</ul>
<h2>3.30.1 - 05 Sep 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.0. <a href="https://redirect.github.com/github/codeql-action/pull/3077">#3077</a></li>
</ul>
<h2>3.30.0 - 01 Sep 2025</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="16140ae1a1"><code>16140ae</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3213">#3213</a> from github/update-v4.30.9-70205d3d1</li>
<li><a href="30db5fee08"><code>30db5fe</code></a> Update changelog for v4.30.9</li>
<li><a href="70205d3d12"><code>70205d3</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3211">#3211</a> from github/mbg/init/starting-partial-config</li>
<li><a href="697c209bfc"><code>697c209</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="1bd53ba38c"><code>1bd53ba</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3205">#3205</a> from github/update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="cac4df0c79"><code>cac4df0</code></a> Rebuild</li>
<li><a href="77e5c0d0a2"><code>77e5c0d</code></a> Merge branch 'main' into update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="97a4f751be"><code>97a4f75</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3204">#3204</a> from github/mbg/setup-codeql</li>
<li><a href="2d5512b361"><code>2d5512b</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="fa7bdf0559"><code>fa7bdf0</code></a> Call <code>getAnalysisKinds</code> a second time, and ignore exceptions thrown during th...</li>
<li>Additional commits viewable in <a href="3599b3baa1...16140ae1a1">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
</details>
Copybara import of the project:
--
c14a0d2198bee3dcd76ee7fa733da41a6d1fcd6b by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>:
Bump github/codeql-action from 3.30.5 to 4.30.9
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](3599b3baa1...16140ae1a1)
---
updated-dependencies:
- dependency-name: github/codeql-action
dependency-version: 4.30.9
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Merging this change closes#32904
PiperOrigin-RevId: 822178959
Imported from GitHub PR https://github.com/openxla/xla/pull/32688📝 Summary of Changes
This PR enables command buffer DynamicSliceCopy command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd
🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.
🚀 Kind of Contribution
Please remove what does not apply:
✨ New Feature
🧪 Unit Tests:
xla/backends/gpu/runtime/command_buffer_cmd_test.cc: CommandBufferCmdTest:DynamicSliceCopyFusionCmd
Copybara import of the project:
--
feb2902fca397360460f6b9788ac0f7482cb547c by Shawn Wang <shawnw@nvidia.com>:
Enable command buffer DynamicSliceCopyFusion command unrolling
Merging this change closes#32688
PiperOrigin-RevId: 822104580
Imported from GitHub PR https://github.com/openxla/xla/pull/32719📝 Summary of Changes
This PR enables command buffer DynamicSliceFusion command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd
🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.
🚀 Kind of Contribution
Please remove what does not apply:
✨ New Feature
🧪 Unit Tests:
xla/backends/gpu/codegen/dynamic_slice_fusion_test.cc
Copybara import of the project:
--
daa975804cbffcc3a6bc5b37e3494b51a2dbe2ca by Shawn Wang <shawnw@nvidia.com>:
DynamicSliceFsuionCmd supports unrolling
Merging this change closes#32719
PiperOrigin-RevId: 822071751
According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default.
Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future.
PiperOrigin-RevId: 822067921
The fission autotuner previously only searched for dot instructions in the entry computation of an HLO module. This caused it to miss dot operations located in nested computations, such as the body of a while loop, preventing the autotuner from applying configurations to them.
PiperOrigin-RevId: 822037141
We adjusted the emitter for the case when the scale is missing.
Also we relaxed the hlo verifier a bit and tweaked the composite rewriter that should accept the dim indexes passed by jax.
PiperOrigin-RevId: 822036474
When removing ops, we need to do that in a deterministic order. The reason is
that removing users works by finding the position of the user in the users
vector, then swapping with the last element of the vector, then popping the
last element of the vector. So if more than one element is removed from a users
list, it matters in which order the elements are removed.
PiperOrigin-RevId: 822026351
Imported from GitHub PR https://github.com/openxla/xla/pull/32905📝 Summary of Changes
Allow mixed precision asynchronous collective-permute in the verifier.
🎯 Justification
Fixes https://github.com/openxla/xla/issues/32845🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
N/A
🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute-start and collective-permute-done.
🧪 Execution Tests:
Manually testes the JAX repro from https://github.com/openxla/xla/issues/32845
Copybara import of the project:
--
f44faa7ce7ecfbd810983cae170a118bb19a8bb3 by Jaroslav Sevcik <jsevcik@nvidia.com>:
Allow mixed precision operands for async collective permute
Merging this change closes#32905
PiperOrigin-RevId: 822023349
Imported from GitHub PR https://github.com/openxla/xla/pull/32773📝 Summary of Changes
Remove hardcoded NHWC convolution layout for fp16 precision.
🎯 Justification
Performance drops for fp16 precision on gfx11xx and gfx12xx GPUs were observed internally, as well as by the [community](https://github.com/jax-ml/jax/issues/30548).
🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark
Community member provided the script with whom the [profiling can be done](https://github.com/jax-ml/jax/issues/30548#issue-3270872993).
Significant performance improvement for fp16 on gfx12xx:
```
Running on: rocm:0
Testing float32...
Avg time: 0.092307 s, Throughput: 1.68 TFLOP/s
Testing float16...
Avg time: 0.011742 s, Throughput: 13.17 TFLOP/s
Testing bfloat16...
Avg time: 0.011989 s, Throughput: 12.90 TFLOP/s
```
Results of the profiling before the fix:
```
Running on: rocm:0
Testing float32...
Avg time: 0.092312 s, Throughput: 1.67 TFLOP/s
Testing float16...
Avg time: 0.775142 s, Throughput: 0.20 TFLOP/s
Testing bfloat16...
Avg time: 0.011990 s, Throughput: 12.90 TFLOP/s
```
@xla-rotation can you please review this PR?
Copybara import of the project:
--
c9fdba79e32c13d9cbf640e61d941d071fabba9d by Aleksa Arsic <Aleksa.Arsic@amd.com>:
Remove hardcoded convolution NCHW layout assignment for fp16 precision.
--
69660d19999a14b24d63b52e6dae310cfbdcbb6b by Aleksa Arsic <Aleksa.Arsic@amd.com>:
Add unit tests for ROCm layout assignment.
Merging this change closes#32773
PiperOrigin-RevId: 822022522
Imported from GitHub PR https://github.com/openxla/xla/pull/32724
Copybara import of the project:
--
c3f4ff8ec6af27d24b61e2aa529585697b8aa77a by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Disable only the test cases that are failing and enable 3 test targets on B200.
--
1f6e52218ec124bb52d4dba70aa7832311762465 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Disable test case in cudnn_test that fails on Google's B200.
Keep gpu_compiler_test off CI for now due to memory leak
found by ASAN, but don't revert the changes in the file,
so it can be enabled more easily in the future.
--
42e501a41e43c174538ab186c659a072101b4ab2 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Disable ConvWgradWithNHWCLayoutExecutesCorrectly only on Blackwell.
Merging this change closes#32724
PiperOrigin-RevId: 821992088
This change implements a native support for `xla::Executable::GetOutputLayouts()` in PJRT C API, when PJRT Layouts extension is available. This support does not fetch the optimized HLO, and thus this method becomes more reliable and fast.
This change strongly recommends the plugin that implemented the Layouts extension v2 to upgrade to v3 to avoid an incompatibility.
PiperOrigin-RevId: 821834116
Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future.
PiperOrigin-RevId: 821808592
We can now produce arbitrary iteration patterns for output tiles, simply by
parametrizing calls to `ComputeTiledHloInstructions` with different
`TiledHloSchedule`s.
PiperOrigin-RevId: 821796530
IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.
Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately.
PiperOrigin-RevId: 821741105
The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use.
PiperOrigin-RevId: 821737613
Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`.
This change is a no-op.
PiperOrigin-RevId: 821732030
The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra.
PiperOrigin-RevId: 821728904
- We encounter this case very often (for cublas autotuner), so it makes sense to optimize it.
- Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it.
PiperOrigin-RevId: 821716593
Imported from GitHub PR https://github.com/openxla/xla/pull/32782📝 Summary of Changes
Fix hermetic build for rocm.
🎯 Justification
Introduce missing hipblaslt dependency.
Fix invalid libs linking and align with the data directories.
🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
CI, not relevant
🧪 Unit Tests:
Not relevant
🧪 Execution Tests:
Not relevant
Copybara import of the project:
--
f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>:
Add missing hermetic lib dependency
--
fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>:
Add missing hipblaslt deps, fix the targets
--
540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>:
Make hipblaslt mandatory
--
3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>:
Fix test
--
eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>:
Ignore asnsigaltstack
--
54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:
[ROCm] Use working sha256 for latest ROCm 7.0 docker image
--
9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
[ROCm] Add ROCM_PATH repo_env to test scripts
--
1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>:
Fix buildifier warning
Merging this change closes#32782
PiperOrigin-RevId: 821614030
In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts
PiperOrigin-RevId: 821612799
This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR.
The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`.
PiperOrigin-RevId: 821593545
This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as
the API from the socket integration and subclasses can handle these
accordingly. This also moves the responsibility to destroy in the handler logic
with the contract that the event is removed from the loop on the second HalfClose event.
PiperOrigin-RevId: 821445213
Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation.
This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features:
1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed
- to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime
- once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction
2. finding a lot of small changes that together reduce the runtime
- we can do this using a "hill-climbing" method
- try many perturbations until you find one slightly better than the baseline.
- try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again
- repeat until no more improvements are found
NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs.
PiperOrigin-RevId: 821309046
This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default.
PiperOrigin-RevId: 821039969
the transposes are not identity permutations. Identity transposes
should be eliminated separately in HandleTranspose already.
PiperOrigin-RevId: 820903953
PjRt-IFRT directly or indirectly fetched optimized HLO to get the output
layout mode and output layouts. This seems to introduce a regression in
some jobs that use PJRT C API and have a too large serialized HLO (> 2 GiB).
As a workaround, PjRt-IFRT gracefully handles output layout mode and
layout discovery errors, and falls back to concrete layouts that are
directly obtained from output `PjRtBuffer`s, should give the same
behavior before/after the default layout handling change.
Further changes will follow to discover default layout modes and layouts
without going through `PjRtLoadedExecutable::GetHloModules()`.
PiperOrigin-RevId: 820785277
Add placeholders for future Type serialization/deserialization. It's not an ABI breaking change as it's unused today, and it allows to avoid ABI breaking change in the future when FFI will add proper ser/des support for user defined types.
PiperOrigin-RevId: 820676169
- The VLOG messages are updated to more accurately describe whether the autotuner is finding a config in cache, using a default, or actively tuning for the best config.
- The error contains the HLO instruction.
PiperOrigin-RevId: 820640768
This change utilizes recently added Triton support for smaller block sizes.
Skipping occupancy optimization for some configs is essentially a workaround for incompatible split_k values. The impact of these configs is limited however because they are only present in non-exhaustive mode, so they mostly get filtered out anyway.
PiperOrigin-RevId: 820617352
Before this change, we disallowed all-gather such that the partitioner generates `all-reduce(dynamic-update-slice())` pattern. With this change, we allow all-gather for two reasons.
1. In most cases, all-gather is allowed and preferred.
2. It is easier to read and match the partitioner result.
PiperOrigin-RevId: 820593767
Imported from GitHub PR https://github.com/openxla/xla/pull/32388📝 Summary of Changes
Support collectives with non-minor-most last dimension in the sub-byte collective normalization pass.
🎯 Justification
Makes more collectives efficient, not require type conversion.
🚀 Kind of Contribution
Performance Improvement.
📊 Benchmark (for Performance Improvements)
```
Before:
## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=68384ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=67744ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=66976ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=67040ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=66816ns
After:
## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=41216ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=41056ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=40960ns
```
Measured on 8xH100 DGX.
🧪 Unit Tests:
yes
🧪 Execution Tests:
yes
Copybara import of the project:
--
a3777523ffffbcc59da285544e3fb5575d098b9c by Ilia Sergachev <isergachev@nvidia.com>:
[GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension.
Merging this change closes#32388
PiperOrigin-RevId: 820585923
Imported from GitHub PR https://github.com/openxla/xla/pull/32678📝 Summary of Changes
- Fix sha256 of docker image to ensure CI is not broken due to malformed image
- Fix test scripts by passing ROCM_PATH to bazel sandbox via repo_env
🎯 Justification
Continued CI runs
🚀 Kind of Contribution
🧪 Tests
Copybara import of the project:
--
3ca8114613d8e002c137f28bb6608639d08a724a by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:
[ROCm] Use working sha256 for latest ROCm 7.0 docker image
--
09ddfbdf205a6406cdd67e20671f41455fffe0f9 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
[ROCm] Add ROCM_PATH repo_env to test scripts
Merging this change closes#32678
PiperOrigin-RevId: 820582560
Imported from GitHub PR https://github.com/openxla/xla/pull/32718📝 Summary of Changes
This PR adds conv fusion support in cudnn fusion compiler.
* add conv type in `CuDnnFusionConfig` to represent different types of conv. We are getting rid of the conv custom call target so this info has be preserved in fusion config.
* add `ConvDimensionAdapter` to generate NCHW **logical layout** for cudnn frontend while physical layout could be NHWC (most preferable layout) or NCHW (for int conv). Only NHWC layout is used in the unit tests because layout assignment currently doesn't handle conv fusion to transform other layouts to NHWC, this needs to be addressed in separate PR.
* add conv translation rule from XLA conv to cudnn frontend graph API.
* Other parts of the lowering is taken care automatically by current cudnn fusion compiler: workspace allocation/graph validation/graph compilation/graph serialization.
🎯 Justification
This is the first step to unify the conv as cudnn fusion in XLA. Conv custom call will be replaced with conv fusions in the future.
🚀 Kind of Contribution
✨ New Feature
📊 Benchmark (for Performance Improvements)
No Performance changes are expected.
🧪 Unit Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.
🧪 Execution Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.
Copybara import of the project:
--
57555cd0e3759aacb7a98135c3261f4cc3f642c2 by Cjkkkk <ske@nvidia.com>:
init
--
d6edecfa42a6371a0908e22daeb8deaf32998ece by Cjkkkk <ske@nvidia.com>:
address comments
--
17df6f8451274f070d7d332a126cfefa1ef7df83 by Cjkkkk <ske@nvidia.com>:
removed one comment
--
1b7c63b1ade7751cf8f68c7fb11cd68491440081 by Cjkkkk <ske@nvidia.com>:
add const
Merging this change closes#32718
PiperOrigin-RevId: 820574737
We're perfectly able to construct a schedule using only a subset of the
iteration space of a `tile_offsets_indexing`---and in fact need to when we are
processing nested fusions.
PiperOrigin-RevId: 820454010
* Deserializing MLIR modules still tries to parse as string first as thats the default, on failure it tries to uncompress and parse.
PiperOrigin-RevId: 820396326