Commit Graph

28187 Commits

Author SHA1 Message Date
Ilya Tikhonovskiy
13376b4b8a [XLA:GPU] change 'checksum' field name to 'value'
We use this field for two different buffer debug kernels that have different semantic. Technically we could have two different structures but it does not makes much sense at the moment. Let's use the one that we already have with the generic name.

PiperOrigin-RevId: 824532743
2025-10-27 08:51:52 -07:00
Eusebio Durán Montaña
5e5976e01f Clean up includes and dependencies in ../gpu/runtime directory.
Had to manually add a `IWYU pragma: keep` in select_k_exec_stub.cc, otherwise the `::xla::bfloat16` type isn't found.

PiperOrigin-RevId: 824529669
2025-10-27 08:39:25 -07:00
A. Unique TensorFlower
fd2941bc67 Update calls to HloModule::CreateFromProto in hlo_module_util to remap instruction ids by default. This should speed up compilation.
PiperOrigin-RevId: 824521542
2025-10-27 08:16:29 -07:00
dependabot[bot]
60ac8fa628 PR #32968: Bump keras from 3.9.0 to 3.11.3 in /xla/backends/cpu/benchmarks/e2e/gemma2/keras
Imported from GitHub PR https://github.com/openxla/xla/pull/32968

Bumps [keras](https://github.com/keras-team/keras) from 3.9.0 to 3.11.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/keras-team/keras/releases">keras's releases</a>.</em></p>
<blockquote>
<h2>Keras 3.11.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Version bump to 3.11.3 by <a href="https://github.com/rtg0795"><code>@​rtg0795</code></a> in <a href="https://redirect.github.com/keras-team/keras/pull/21607">keras-team/keras#21607</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a href="https://github.com/keras-team/keras/compare/v3.11.2...v3.11.3">https://github.com/keras-team/keras/compare/v3.11.2...v3.11.3</a></p>
<h2>Keras 3.11.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Version bump 3.11.2 and nnx fix <a href="https://redirect.github.com/keras-team/keras/issues/21565">#21565</a> by <a href="https://github.com/laxmareddyp"><code>@​laxmareddyp</code></a> in <a href="https://redirect.github.com/keras-team/keras/pull/21570">keras-team/keras#21570</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/laxmareddyp"><code>@​laxmareddyp</code></a> made their first contribution in <a href="https://redirect.github.com/keras-team/keras/pull/21570">keras-team/keras#21570</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a href="https://github.com/keras-team/keras/compare/v3.11.1...v3.11.2">https://github.com/keras-team/keras/compare/v3.11.1...v3.11.2</a></p>
<h2>Keras 3.11.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Version bump 3.11.1 by <a href="https://github.com/rtg0795"><code>@​rtg0795</code></a> in <a href="https://redirect.github.com/keras-team/keras/pull/21535">keras-team/keras#21535</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a href="https://github.com/keras-team/keras/compare/v3.11.0...v3.11.1">https://github.com/keras-team/keras/compare/v3.11.0...v3.11.1</a></p>
<h2>Keras 3.11.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add int4 quantization support.</li>
<li>Support <a href="https://github.com/google/grain">Grain</a> data loaders in <code>fit()</code>/<code>evaluate()</code>/<code>predict()</code>.</li>
<li>Add <code>keras.ops.kaiser</code> function.</li>
<li>Add <code>keras.ops.hanning</code> function.</li>
<li>Add <code>keras.ops.cbrt</code> function.</li>
<li>Add <code>keras.ops.deg2rad</code> function.</li>
<li>Add <code>keras.ops.layer_normalization</code> function to leverage backend-specific performance optimizations.</li>
<li>Various bug fixes and performance optimizations.</li>
</ul>
<h2>Backend-specific changes</h2>
<h3>JAX backend</h3>
<ul>
<li>Support NNX library. It is now possible to use Keras layers and models as NNX modules.</li>
<li>Support shape -1 for slice op.</li>
</ul>
<h3>TensorFlow backend</h3>
<ul>
<li>Add support for multiple dynamic dimensions in <code>Flatten</code> layer.</li>
</ul>
<h3>OpenVINO backend</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="b491c860fc"><code>b491c86</code></a> Version bump to 3.11.3 (<a href="https://redirect.github.com/keras-team/keras/issues/21607">#21607</a>)</li>
<li><a href="251ac3422f"><code>251ac34</code></a> Version bump 3.11.2 and nnx fix <a href="https://redirect.github.com/keras-team/keras/issues/21565">#21565</a> (<a href="https://redirect.github.com/keras-team/keras/issues/21570">#21570</a>)</li>
<li><a href="0e11071e8e"><code>0e11071</code></a> Version bump 3.11.1 (<a href="https://redirect.github.com/keras-team/keras/issues/21535">#21535</a>)</li>
<li><a href="7bf852c211"><code>7bf852c</code></a> Update flax (<a href="https://redirect.github.com/keras-team/keras/issues/21527">#21527</a>)</li>
<li><a href="4085046b13"><code>4085046</code></a> [OpenVINO backend] fix openvino model exported names to match keras names (<a href="https://redirect.github.com/keras-team/keras/issues/2">#2</a>...</li>
<li><a href="6bc62031ad"><code>6bc6203</code></a> Fix a few typos in comments (<a href="https://redirect.github.com/keras-team/keras/issues/21525">#21525</a>)</li>
<li><a href="8bf6a58276"><code>8bf6a58</code></a> Add <code>VectorizedMap</code> op class. (<a href="https://redirect.github.com/keras-team/keras/issues/21516">#21516</a>)</li>
<li><a href="7cb0e48957"><code>7cb0e48</code></a> update python version (<a href="https://redirect.github.com/keras-team/keras/issues/21517">#21517</a>)</li>
<li><a href="7b9ab6a537"><code>7b9ab6a</code></a> Fix: UpSampling2D bilinear set_image_data_format(channels_first) bug (<a href="https://redirect.github.com/keras-team/keras/issues/21456">#21456</a>)</li>
<li><a href="90c8da6809"><code>90c8da6</code></a> Fix <code>_can_use_flash_attention</code>. (<a href="https://redirect.github.com/keras-team/keras/issues/21512">#21512</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/keras-team/keras/compare/v3.9.0...v3.11.3">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=keras&package-manager=pip&previous-version=3.9.0&new-version=3.11.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/openxla/xla/network/alerts).

</details>
Copybara import of the project:

--
103d4253e3cb9ef8885a36014359c4a437c465a6 by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>:

Bump keras in /xla/backends/cpu/benchmarks/e2e/gemma2/keras

Bumps [keras](https://github.com/keras-team/keras) from 3.9.0 to 3.11.3.
- [Release notes](https://github.com/keras-team/keras/releases)
- [Commits](https://github.com/keras-team/keras/compare/v3.9.0...v3.11.3)

---
updated-dependencies:
- dependency-name: keras
  dependency-version: 3.11.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Merging this change closes #32968

PiperOrigin-RevId: 824516553
2025-10-27 07:57:21 -07:00
Eugene Zhulenev
5dfa57fd92 [xla:ffi] Use same id sequence for internal and external types
Add an API to lookup type id and info by type name. We can't rely on type ids for serialization, as they are not stable and assigned at run time depending on the type registration order. Type names on the other hand must be stable.

PiperOrigin-RevId: 824512487
2025-10-27 07:47:23 -07:00
A. Unique TensorFlower
0e809d4bc8 [XLA:GPU] Add simple multimem one-shot example.
PiperOrigin-RevId: 824508652
2025-10-27 07:34:24 -07:00
A. Unique TensorFlower
51d2e6931b Update pip dependency reference from @pypi_XXX//:pkg to @pypi//XXX.
PiperOrigin-RevId: 824505081
2025-10-27 07:21:22 -07:00
A. Unique TensorFlower
b15498a538 Add Symbolic/Affine convertor methods for IndexingMap
- Renamed and make public SymbolicToAffine to SymbolicExprToAffineExpr (needed for IndexingMap::GetConstraints)
- Renamed AffineToSymbolicExpr to AffineExprToSymbolicExpr
- Added AffineExprsToSymbolicExprs to convert a list of mlir::AffineExpr to a vector of xla::gpu::SymbolicExpr (needed for IndexingMap::ConstraintsSatisfied)

PiperOrigin-RevId: 824492246
2025-10-27 06:51:35 -07:00
Ilya Tikhonovskiy
aded8e05e0 [XLA:GPU] add buffer_nan_count_thunk for the buffer_nan_count_kernel
In the follow up cl we will need to add this thunk to the buffer debug pass.
Also there we will need to infer the buffer element type.
Another refactoring would be to change the name of the payload which is the checksum at the moment to something more generic like 'value' or 'result'.
One more thing we could do is to reduce the code duplication by merging together both thunks, the checksum one and nan counter one.

PiperOrigin-RevId: 824491914
2025-10-27 06:41:06 -07:00
Karlo Basioli
3f5b49f242 [XLA:CPU] Add target machine options to compilation result proto and check compilation arch when loading aot result.
Used to check if the runtime and compilation env are compatible.

PiperOrigin-RevId: 824481786
2025-10-27 06:05:08 -07:00
Thomas Joerg
e34b86def5 [XLA:GPU] Do not create transpose ops with non-default layout in DotDecomposer.
The `DotDecomposer` pass runs ahead of layout assignment. Introducing non-default layouts at this stage causes complications for subsequent passes, in particular the `DotMerger` pass.

PiperOrigin-RevId: 824476578
2025-10-27 05:54:10 -07:00
Dragan Mladjenovic
77bed2c6ef PR #32439: [ROCm] Fix and enable xla_gpu_use_embeded_device_lib and xla_gpu_use_…
Imported from GitHub PR https://github.com/openxla/xla/pull/32439

…inprocess_lld

📝 Summary of Changes
Enable embedded device libs and in-process lld by default.

🎯 Justification
Moves amdgpu backend to be more filesystem layout independent.

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
N\A

🧪 Unit Tests:
None

🧪 Execution Tests:
None

Copybara import of the project:

--
46a100377d00d30dbc79e34c977b9219c54bda4b by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Fix and enable xla_gpu_use_embeded_device_lib and xla_gpu_use_inprocess_lld

Merging this change closes #32439

PiperOrigin-RevId: 824476138
2025-10-27 05:44:57 -07:00
Adrian Kuegel
4f15c0c9d3 [XLA:GPU] Choose a deterministic function name for nested computations
absl::Hash is not deterministic over different runs of the same program. Use
Fingerprint128 instead, and don't include the address of the computation.
PiperOrigin-RevId: 824460524
2025-10-27 04:53:10 -07:00
Alexander Grund
9ec8d8ece3 PR #31886: Fix libdevice search
Imported from GitHub PR https://github.com/openxla/xla/pull/31886

📝 Summary of Changes
This enhances the search for the CUDA libdevice path:
- Fix an invalid empty path added when `TF_CUDA_TOOLKIT_PATH` which may be empty
- Fix invalid paths based on runtime folders: `runfiles_dir.substr(0, runfiles_ind + runfiles_suffix.length())` is not meaningful when `runfiles_ind` isn't valid, i.e. `std::string::npos`
- Add `$CUDA_HOME` to the search paths. This is also used in TensorFlow already

🎯 Justification
Without this the libdevice file won't be found if CUDA isn't installed in a standard location or e.g. an updated version is available in a different location.
This is the case for e.g. HPC systems where multiple CUDA versions are available side-by-side.

🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup

Fixes #28590

🧪 Unit Tests:

Simple test that when `CandidateCudaRoots` returns anything it contains `$CUDA_HOME`

Copybara import of the project:

--
01788b896900717ee916377a71d5c14963e0176d by Alexander Grund <alexander.grund@tu-dresden.de>:

Fix libdevice search when outside test environment

When there is no `runfiles_suffix` the `rfind` returns
`std::string::npos` which should be handled to not add meaningless paths.

--
900715a846102bacdfc7688f14713cbe6101506d by Alexander Grund <alexander.grund@tu-dresden.de>:

Use `$CUDA_HOME` when searching for libdevice.

With a CUDA installed to a non-default location XLA/TF fails with:
> gpu_backend_lib.cc:579] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
> Searched for CUDA in the following directories:
>   ./cuda_sdk_lib
>   /builddir/TensorFlow/TensorFlow-2.x_mnist-test.py.runfiles/cuda_nvcc
>   /buildi/cuda_nvcc
>
>   /usr/local/cuda
>   /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
>   /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
>   /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../cuda

Consider $CUDA_HOME as an additional location after the runfiles dirs (used for tests)

--
905d0596d199598036032f0f84b4487e9afd2bef by Alexander Grund <alexander.grund@tu-dresden.de>:

Don't add empty TF_CUDA_TOOLKIT_PATH to libdevice search

At least in some environments that define is the empty string which
doesn't make sense to add to the search paths.
Add a check for that.

--
23eb59bfabd570caabf0b9ec3515233f46a4fae7 by Alexander Grund <alexander.grund@tu-dresden.de>:

Add test for $CUDA_HOME in CandidateCudaRoots

--
a8c215bc222b4ba8581f2f44549613ebd59b9cbb by Alexander Grund <alexander.grund@tu-dresden.de>:

Add braces to loops/conditions

--
39efc67f8b1d44e131f993c8040b7eb69ff52f0c by Alexander Grund <alexander.grund@tu-dresden.de>:

Use kIsOpenSource in skip condition

Merging this change closes #31886

PiperOrigin-RevId: 824450284
2025-10-27 04:10:14 -07:00
Will Froom
4d623afca2 [XLA][XTile] Make transpose folder work with xtile extract.
PiperOrigin-RevId: 824439434
2025-10-27 03:39:25 -07:00
Henning Becker
76a084f181 Move Attribute types from call_frame.cc into attribute_map.cc
This is moving `Scalar`, `Array`, `Dictionary`, `FlatAttribute`, `FlatAttributeMap`, and `AttributeMap` from `CallFrameBuilder` into the `xla::ffi` namespace.

It also moves the code into `attribute_map.{cc|h}`.

All these types are basically aliases from some kind of `std::variant` type. This change is a preparation for making them proper types and add `ToProto` and `FromProto` methods.

PiperOrigin-RevId: 824435281
2025-10-27 03:22:14 -07:00
Ilya Tikhonovskiy
78a0ca0b60 [XLA:GPU] add nan count cuda kernel
The kernel is similar to the one for the checksum calculation

PiperOrigin-RevId: 824428856
2025-10-27 03:08:29 -07:00
Will Froom
dfeccf211b [XLA:CPU][XTile] Implement pass to rewrite dynamic vector extracts to static.
PiperOrigin-RevId: 824427163
2025-10-27 02:56:12 -07:00
Junwhan Ahn
9add8b7e61 Fix a bug where S2/U2 dtypes were missing proto conversion
Also fixed the round trip test to not ignore `kInvalid` returned from proto conversion, which is why we didn't catch this bug.

PiperOrigin-RevId: 824419619
2025-10-27 02:41:11 -07:00
Adrian Kuegel
5d42d91467 Make sure to produce a deterministic Memory usage report.
PiperOrigin-RevId: 824403476
2025-10-27 01:41:57 -07:00
Eugene Zhulenev
3630944d0f [xla:ffi] Add support for binding Context object to the handler
PiperOrigin-RevId: 824278531
2025-10-26 16:48:26 -07:00
Michael Kuperstein
2de4be94aa [XLA] Remove HLO unstacker.
The pass is not used.

PiperOrigin-RevId: 824274493
2025-10-26 16:31:48 -07:00
Eugene Zhulenev
cc5ee2577c [xla:ffi] Add support for std::variant<> attributes decoding
PiperOrigin-RevId: 824272994
2025-10-26 16:24:59 -07:00
A. Unique TensorFlower
32c1551f24 Add support for int8 dots, and allow bf16 to be used on any CPU.
PiperOrigin-RevId: 824272399
2025-10-26 16:13:53 -07:00
Eugene Zhulenev
5edcd28152 [xla:cpu:ynn] Do not track work stealing workers
```
name                                                               cpu/op         cpu/op      vs base
BM_ParallelFor/8/1/process_time   [#threads=8, #threadpools=1  ]    5.470m ±  5%   5.095m ± 3%  -6.87% (p=0.000 n=80)
BM_ParallelFor/8/2/process_time   [#threads=8, #threadpools=2  ]    2.857m ±  1%   2.595m ± 2%  -9.15% (n=80)
BM_ParallelFor/8/4/process_time   [#threads=8, #threadpools=4  ]    1.447m ± 10%   1.328m ± 1%  -8.23% (p=0.000 n=80)
BM_ParallelFor/8/8/process_time   [#threads=8, #threadpools=8  ]   1058.1µ ± 20%   974.5µ ± 1%  -7.90% (p=0.000 n=80)
BM_ParallelFor/8/16/process_time  [#threads=8, #threadpools=16 ]    741.5µ ± 26%   705.8µ ± 1%  -4.81% (p=0.000 n=80)
BM_ParallelFor/16/1/process_time  [#threads=16, #threadpools=1 ]    9.796m ± 29%   9.972m ± 2%       ~ (p=0.312 n=80)
BM_ParallelFor/16/2/process_time  [#threads=16, #threadpools=2 ]    7.871m ± 28%   7.706m ± 1%  -2.10% (p=0.030 n=80)
BM_ParallelFor/16/4/process_time  [#threads=16, #threadpools=4 ]    4.330m ±  2%   4.157m ± 1%  -3.99% (p=0.000 n=80)
BM_ParallelFor/16/8/process_time  [#threads=16, #threadpools=8 ]    2.678m ±  2%   2.638m ± 1%  -1.49% (p=0.014 n=80)
BM_ParallelFor/16/16/process_time [#threads=16, #threadpools=16]    1.791m ±  1%   1.807m ± 1%       ~ (p=0.325 n=80)
BM_ParallelFor/32/1/process_time  [#threads=32, #threadpools=1 ]    15.33m ±  1%   15.41m ± 1%       ~ (p=0.215 n=80)
BM_ParallelFor/32/2/process_time  [#threads=32, #threadpools=2 ]    13.99m ±  1%   13.80m ± 2%       ~ (p=0.400 n=80)
BM_ParallelFor/32/4/process_time  [#threads=32, #threadpools=4 ]    9.415m ±  1%   9.172m ± 1%  -2.58% (p=0.000 n=80)
BM_ParallelFor/32/8/process_time  [#threads=32, #threadpools=8 ]    5.759m ±  1%   5.647m ± 1%  -1.95% (p=0.004 n=80)
BM_ParallelFor/32/16/process_time [#threads=32, #threadpools=16]    3.932m ±  1%   3.864m ± 1%  -1.72% (p=0.006 n=80)
geomean                                                            4.051m         3.916m       -3.32%

name                                                               time/op        time/op     vs base
BM_ParallelFor/8/1/process_time   [#threads=8, #threadpools=1  ]    651.2µ ±  3%   600.3µ ± 4%  -7.80% (p=0.000 n=80)
BM_ParallelFor/8/2/process_time   [#threads=8, #threadpools=2  ]    329.4µ ±  0%   298.6µ ± 2%  -9.35% (n=80)
BM_ParallelFor/8/4/process_time   [#threads=8, #threadpools=4  ]    169.3µ ± 12%   155.7µ ± 1%  -8.05% (p=0.000 n=80)
BM_ParallelFor/8/8/process_time   [#threads=8, #threadpools=8  ]    125.8µ ± 21%   115.7µ ± 1%  -8.08% (p=0.000 n=80)
BM_ParallelFor/8/16/process_time  [#threads=8, #threadpools=16 ]    95.41µ ± 24%   89.56µ ± 1%  -6.13% (p=0.000 n=80)
BM_ParallelFor/16/1/process_time  [#threads=16, #threadpools=1 ]   1015.8µ ±  1%   952.0µ ± 1%  -6.29% (n=80)
BM_ParallelFor/16/2/process_time  [#threads=16, #threadpools=2 ]    556.5µ ±  1%   522.6µ ± 1%  -6.09% (n=80)
BM_ParallelFor/16/4/process_time  [#threads=16, #threadpools=4 ]    289.7µ ±  2%   274.4µ ± 1%  -5.30% (p=0.000 n=80)
BM_ParallelFor/16/8/process_time  [#threads=16, #threadpools=8 ]    178.8µ ±  2%   174.1µ ± 1%  -2.59% (p=0.000 n=80)
BM_ParallelFor/16/16/process_time [#threads=16, #threadpools=16]    123.9µ ±  2%   123.0µ ± 1%       ~ (p=0.098 n=80)
BM_ParallelFor/32/1/process_time  [#threads=32, #threadpools=1 ]    1.526m ±  3%   1.433m ± 3%  -6.07% (p=0.000 n=80)
BM_ParallelFor/32/2/process_time  [#threads=32, #threadpools=2 ]    835.2µ ±  2%   783.5µ ± 2%  -6.19% (p=0.000 n=80)
BM_ParallelFor/32/4/process_time  [#threads=32, #threadpools=4 ]    471.6µ ±  2%   455.1µ ± 1%  -3.52% (p=0.000 n=80)
BM_ParallelFor/32/8/process_time  [#threads=32, #threadpools=8 ]    296.1µ ±  2%   287.0µ ± 2%  -3.08% (p=0.000 n=80)
BM_ParallelFor/32/16/process_time [#threads=32, #threadpools=16]    215.0µ ±  2%   211.6µ ± 1%  -1.59% (p=0.018 n=80)
geomean                                                            330.2µ         312.3µ       -5.42%
```

PiperOrigin-RevId: 824259124
2025-10-26 15:16:20 -07:00
Eugene Zhulenev
e65144c31f [xla:ffi] Check that Type used as a state is registered before the handler
PiperOrigin-RevId: 824258481
2025-10-26 15:06:40 -07:00
Eugene Zhulenev
87e3b84514 [tsl:concurrency] In Promise replace IsUnique() with NumRef() == 1
The meaning of AsyncValue::IsUnique() is fuzzy for the chain of indirect async values. Prefer simpler check for uniqueness in Future/Promise library.

Also update AsyncValue::IsUnique() documentation.

PiperOrigin-RevId: 824256830
2025-10-26 14:48:28 -07:00
Eugene Zhulenev
72d04ced58 [xla:cpu] Correctly measure CPU time in slinky thread pool benchmark
PiperOrigin-RevId: 824253351
2025-10-26 14:23:48 -07:00
Ivo Ristovski List
f3689e1314 Automated Code Change
PiperOrigin-RevId: 824094790
2025-10-26 00:36:30 -07:00
A. Unique TensorFlower
0338b08bee Add mechanism to prioritize ForceDelay custom calls
PiperOrigin-RevId: 823973702
2025-10-25 14:20:20 -07:00
A. Unique TensorFlower
6bede44c1a Integrate LLVM at llvm/llvm-project@621ed04e28
Updates LLVM usage to match
[621ed04e2878](https://github.com/llvm/llvm-project/commit/621ed04e2878)

PiperOrigin-RevId: 823941203
2025-10-25 11:22:30 -07:00
Christian Sigg
c50123703d Increment XLA GPU autotune cache version to 15.
This change invalidates the autotune cache, which is necessary because enabling the generic emitter (cl/823475406) affected autotuning results.

PiperOrigin-RevId: 823818338
2025-10-25 00:26:42 -07:00
Alexander Shaposhnikov
171247d500 Temporarily bring back the old logic for capturing RHS.
PiperOrigin-RevId: 823712382
2025-10-24 17:00:45 -07:00
Abhinav Gunjal
c7b4a8e3a5 Automated Code Change
Reverts 1b838a947b

PiperOrigin-RevId: 823696235
2025-10-24 16:11:34 -07:00
A. Unique TensorFlower
05c94a96e4 Integrate LLVM at llvm/llvm-project@704240125d
Updates LLVM usage to match
[704240125ddf](https://github.com/llvm/llvm-project/commit/704240125ddf)

PiperOrigin-RevId: 823662883
2025-10-24 14:26:34 -07:00
A. Unique TensorFlower
42d764666d Integrate LLVM at llvm/llvm-project@917d1f20ae
Updates LLVM usage to match
[917d1f20aecf](https://github.com/llvm/llvm-project/commit/917d1f20aecf)

PiperOrigin-RevId: 823542980
2025-10-24 08:50:00 -07:00
A. Unique TensorFlower
69c93c6f6a Reshard on call output if sharding mismatches with the func result.
It is no-op behaviorally for shardy. Because the call output and func result may mismatch only if dedup-functions-fully options is true, and this option is false by default.

Shardy will add explicit reshards (during shardy partitioner) on those operations that use the output of named computation and it will do so assuming the sharding of the named computation is sharded as specified in the out shardings of the named computation.

When dedup-functions-fully option is true, however, the function that is actually called may end up having a different output sharding than the corresponding named computation. So, the users of the output shardings should still use sharding as in the output shardings the named computation. Hence, if there is a mismatch between the output sharding of the named computation and the result sharding of the function, we add a reshard on the output of the call.

PiperOrigin-RevId: 823494391
2025-10-24 05:59:15 -07:00
Ilya Tikhonovskiy
0c0947cea6 [XLA:GPU] Initialize PrecisionConfig for ScaledDot in composite rewriter.
Explicitly set the operand precisions to `PrecisionConfig::DEFAULT` when creating a `ScaledDot` instruction from a composite call.

PiperOrigin-RevId: 823488638
2025-10-24 05:38:20 -07:00
Eugene Zhulenev
a5fca6a9b5 [tsl:concurrency] Do not use executor is detached future is unused
+ use `ptr` when using `AsPtr()` for consistency
+ rename `Wrap` to `AndThen` as it's more meaningful and makes profiles readable

PiperOrigin-RevId: 823476695
2025-10-24 04:55:18 -07:00
Christian Sigg
c8cc7f2fbb [XLA:GPU] Enable generic triton emitter for all gemms, second attempt.
According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default. Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future.

Reverts 85c99b1ecb

PiperOrigin-RevId: 823475406
2025-10-24 04:46:17 -07:00
Benjamin Chetioui
acf7f31c31 [XLA:GPU] Fix index of operand in call to GetNonContractingDims.
PiperOrigin-RevId: 823358506
2025-10-23 22:57:57 -07:00
A. Unique TensorFlower
cbbed7a2fd Automated Code Change
PiperOrigin-RevId: 823350718
2025-10-23 22:40:22 -07:00
Eugene Zhulenev
ef326c74ef [xla:cpu] Add benchmarks for SlinkyThreadPool
```
BM_ParallelFor/8/1      364687 ns       228963 ns         2974 items_per_second=43.6752M/s #threads=8, #threadpools=1
BM_ParallelFor/8/2      226687 ns       176171 ns         2877 items_per_second=56.763M/s #threads=8, #threadpools=2
BM_ParallelFor/8/4      211589 ns       184345 ns         5816 items_per_second=54.2462M/s #threads=8, #threadpools=4
BM_ParallelFor/8/8      177793 ns       162265 ns         3788 items_per_second=61.6275M/s #threads=8, #threadpools=8
BM_ParallelFor/8/16     206898 ns       192792 ns         3339 items_per_second=51.8693M/s #threads=8, #threadpools=16
```

PiperOrigin-RevId: 823321692
2025-10-23 21:14:28 -07:00
Alexander Shaposhnikov
3be9a21d7e Add initial support for offloading dots to YNNPACK.
PiperOrigin-RevId: 823318539
2025-10-23 21:04:45 -07:00
Zac Mustin
5893a54e81 Add PJRT c sandwich benchmarks to nanort benchmarks.
PiperOrigin-RevId: 823259666
2025-10-23 18:15:39 -07:00
Benjamin Chetioui
4ed3ee15e7 [XLA:GPU] Allow simplifying some dot point dimensions in SymbolicTileAnalysis.
Previously, we would never allow simplification when encountering a `dot`
instruction. But this constraint was overly conservative; the only dimensions
that we shouldn't simplify are those along which we intend to perform
non-standard padding to fit to hardware restrictions, i.e. the non-contracting
and contracting dimensions.

Restricting this pattern further works around a bug whereby expanding a
non-standardly padded dimension into a `1` dim can result in propagating a
tile with the wrong size.

The underlying reason for this is a bug in the `kPreserve` behaviour of
`IndexingMap` simplification, which will need to be fixed separately (the new
tiling should avoid this issue, since it shouldn't rely on the correctness of
`IndexingMap` simplification at this level).

PiperOrigin-RevId: 823258725
2025-10-23 18:03:30 -07:00
A. Unique TensorFlower
ce800f5880 Re-enable testing of the thread pool in YnnFusionThunkTest
This was disabled when we didn't have the thread pool available, but now we do.

PiperOrigin-RevId: 823247913
2025-10-23 17:27:12 -07:00
Matthias Guenther
1b838a947b Enable Stablehlo -> HLO lowering by default.
Note that, in order to maintain parity with MHLO optimizations, this enables the `assume-no-undeclared-side-effects` option. This matches the default behavior for MHLO, but StableHLO is more cautious by default. Empirically, past evidence suggests it's pretty safe given that MHLO has been doing it all this time. Disabling the flag can result in significantly larger HLO after lowering, so we enable it here.

PiperOrigin-RevId: 823234079
2025-10-23 16:43:13 -07:00
A. Unique TensorFlower
774bc48035 Update XNNPACK in XLA
PiperOrigin-RevId: 823199731
2025-10-23 15:24:59 -07:00
Maxim Ermilov
9ee1d967e1 initialize nvml in CudaPlatform
PiperOrigin-RevId: 823193773
2025-10-23 15:03:33 -07:00
Eugene Zhulenev
d55e5c1d9f [xla:cpu] Use work_item instead of a task in WorkQueue/Worker API
To avoid confusion because of different kinds of tasks we have in Worker/WorkQueue and a SlinklyThreadPool in XLA use a more generic "work item" name.

PiperOrigin-RevId: 823191886
2025-10-23 14:46:12 -07:00
A. Unique TensorFlower
512b85961f [XLA] Support passing a random engine to LoadAndRunAndDump
By passing a random engine the user can run HLOs with deterministic
random inputs.

PiperOrigin-RevId: 823184431
2025-10-23 14:28:48 -07:00
Eugene Zhulenev
7ba3317857 [xla:cpu:ynn] Implement SlinkyThreadPool on top of WorkQueue and Worker APIs
Remove `work_queue` and `worker` that were originally forked from `xla::cpu::WorkQueue` and `Worker`

PiperOrigin-RevId: 823179793
2025-10-23 14:14:06 -07:00
Matthias Guenther
96c1b6c0a6 Integrate StableHLO at openxla/stablehlo@baaf7475
PiperOrigin-RevId: 823160034
2025-10-23 13:24:44 -07:00
Eugene Zhulenev
eb36f8770a [xla:ffi] Add type checking isa<T>() APIs to RemainingArgs, RemainingRets and Dictionary
PiperOrigin-RevId: 823157056
2025-10-23 13:10:46 -07:00
Niklas Vangerow
809d5c7895 cpu_compiler_internals_test is compiler-specific. Invoke CpuCompiler directly.
PiperOrigin-RevId: 823138394
2025-10-23 12:19:12 -07:00
A. Unique TensorFlower
d631de7076 Rename ynn_threadpool_impl.cc -> ynn_threadpool.cc
PiperOrigin-RevId: 823106503
2025-10-23 11:05:33 -07:00
Yun Peng
e0592b3e22 Introduce a REQUIREMENTS variable for @python_version_repo in Bzlmod
This is just a short term solution to allow loading https://github.com/jax-ml/jax/blob/main/build/BUILD.bazel successfully. We'll need to figure out a better solution when working on supporting multiple python versions.

PiperOrigin-RevId: 823093519
2025-10-23 10:34:35 -07:00
Parker Schuh
aa21448fea Remove PjRtStreamExecutorBuffer in favor of CommonPjRtBufferImpl.
PiperOrigin-RevId: 823081955
2025-10-23 10:08:29 -07:00
Alex
bdb6ba9aa6 PR #32954: [ROCm] Introduce pool name for rbe
Imported from GitHub PR https://github.com/openxla/xla/pull/32954

📝 Summary of Changes
Introduce pool name for rbe builds

🎯 Justification
Need separate pool name for gpu tests execution.

🚀 Kind of Contribution
Please remove what does not apply:  New Feature

📊 Benchmark (for Performance Improvements)
Rbe support for rocm config ci job

🧪 Unit Tests:
Not relevant

🧪 Execution Tests:
Not relevant

Copybara import of the project:

--
d675bf9efcc44a8d740c1be7537737af3cd90f0b by Alexandros Theodoridis <alexandros.theodoridis@amd.com>:

Introduce pool name for rbe

--
d5ee82757aa74785bd2a1c68e3639c49d17ba740 by Alexandros Theodoridis <atheodor@amd.com>:

Introduce rocm rbe pools

--
36bfa7b258cb3e58430087faccccb413f9bf8a7c by Alexandros Theodoridis <atheodor@amd.com>:

First check for multigpu tag

--
9efa0a7cdfa76bb0d5102ebbee1f9b6a3dab270c by Alexandros Theodoridis <atheodor@amd.com>:

Address review comments

--
5b854a7f5915d0c106fd2ba9bc6ff774a885f907 by Alexandros Theodoridis <atheodor@amd.com>:

Fix buildifier issue

Merging this change closes #32954

PiperOrigin-RevId: 823077515
2025-10-23 09:52:20 -07:00
A. Unique TensorFlower
d98db379d1 Move "cell_reader_test.cc" to xla/tsl/lib/monitoring.
PiperOrigin-RevId: 823067036
2025-10-23 09:24:45 -07:00
A. Unique TensorFlower
c234838b70 Use MLIRContext's affine uniquer for SymbolicExpr.
This change modifies `SymbolicExprContext` to use the `mlir::StorageUniquer` provided by `mlir::MLIRContext::getAffineUniquer()` instead of maintaining its own. This produces SymbolicExprContext creation to be very lightweight.

PiperOrigin-RevId: 823052287
2025-10-23 08:41:23 -07:00
Dimitar (Mitko) Asenov
e9356bcee5 [XLA:GPU] Fix an edge-case with choosing autotuning configs based on scratch.
The old code did not update `min_duration_with_optimzed_scratch_bytes` in case the scratch sizes are equal. This could lead to subtle situation where a kernel with the most optimal time and cache is not picked, if all scratch sizes are the same, but the optimal one in terms of time does not appear at the end.

I've updated the associated test to verify this situation. The new test fails before this CL.

PiperOrigin-RevId: 823019660
2025-10-23 06:58:09 -07:00
A. Unique TensorFlower
72bc89a5c6 Use YnnThreadpool
This change moves `YnnThreadpool` to the runtime/ynnpack/ subfolder, and changes the runtime to use our custom YnnThreadpool, instead of using a thread pool created by `ynn_create_threadpool`.

PiperOrigin-RevId: 822883993
2025-10-22 23:03:18 -07:00
Parker Schuh
2cb47f1b59 Remove usage of PjRtStreamExecutorBuffer from
StreamExecutorGpuClient::MakeCrossHostReceiveBuffers.

PiperOrigin-RevId: 822860960
2025-10-22 21:33:06 -07:00
Mehrdad Khani
3f23e5eb0d [XLA:TPU] We introduce MutateBackendConfig(). This change allows to apply in-place modifications to the underlying backend config proto in a thread-safe manner.
PiperOrigin-RevId: 822825180
2025-10-22 19:12:44 -07:00
Parker Schuh
c3e202374c Implement TrackedDeviceBuffer::GetReadyFuture, TrackedDeviceBuffer::Delete, and
TrackedDeviceBuffer::CloneWithControlDependency.

PiperOrigin-RevId: 822804278
2025-10-22 17:51:52 -07:00
Sean Talts
03be5156fe [XLA:CPU] Split IntrinsicFunction into its own lib for later use.
PiperOrigin-RevId: 822800836
2025-10-22 17:38:09 -07:00
Seher Ellis
6bd65b3630 [XLA] Refactor HloSchedule::Verify to allow per-computation verification.
PiperOrigin-RevId: 822796366
2025-10-22 17:24:59 -07:00
Eugene Zhulenev
ac8332f616 [jax:ffi] Do not pass uninitialized type_id value to type registration API
Setting type_id value to 0 is required for XLA to assign unique type id, otherwise type gets assigned a random value that happens to be on the caller stack.

PiperOrigin-RevId: 822782898
2025-10-22 16:42:00 -07:00
Parker Schuh
cf3e49ba23 Delete unused PjRtStreamExecutorBuffer::ScopedHold related code.
PiperOrigin-RevId: 822779025
2025-10-22 16:24:44 -07:00
Yulia Baturina
bcc803eeb6 Remove usage of mirrored tar files from CI because hermetic xz tool helps to unpack tar.xz faster.
PiperOrigin-RevId: 822773874
2025-10-22 16:08:18 -07:00
Parker Schuh
13ea97f3a9 Update PjRtStreamExecutorClient main execute path to use CommonPjRtBuffer::ScopedHold. Crucially
this now passes reference_held=true always. This is fine because the only time
this was ever passed as false was if this was already on the compute stream and
this bool is basically ignored if the stream is the compute stream (see
MaybeWaitForEventOnStream).

PiperOrigin-RevId: 822758577
2025-10-22 15:35:02 -07:00
A. Unique TensorFlower
880f245b56 Allow TSL CellReader to work with lazy metrics.
PiperOrigin-RevId: 822757884
2025-10-22 15:25:07 -07:00
Parker Schuh
420ca15b61 Promote check to connection close.
PiperOrigin-RevId: 822746430
2025-10-22 14:52:22 -07:00
Eugene Zhulenev
2cdd8ff5ce [xla:ffi] Keep FFI handler metadata with handler registration
PiperOrigin-RevId: 822741325
2025-10-22 14:34:37 -07:00
Hyeontaek Lim
70111bb38f Reverts 16064a6c08
PiperOrigin-RevId: 822724128
2025-10-22 14:04:41 -07:00
A. Unique TensorFlower
aeda5dabd4 [XLA] Handle nested while loops in CollectivePipeliner.
This CL modifies the collective pipeliner to generate unique body and condition computations for newly generated while loop instructions.

PiperOrigin-RevId: 822719229
2025-10-22 13:47:32 -07:00
Maxim Ermilov
7b277367dc Remove inheritance of GpuComputeCapability from std::variant
PiperOrigin-RevId: 822701900
2025-10-22 13:33:16 -07:00
Parker Schuh
a6889b6922 Switch to using CommonAsyncHostToDeviceTransferManager.
PiperOrigin-RevId: 822701589
2025-10-22 13:21:45 -07:00
Matthias Guenther
6d1a7019f0 Fix issues in optimization patterns for broadcast_in_dim and pad ops.
- Prioritize replacing `broadcast_in_dim` with `reshape` over merging nested `broadcast_in_dim` ops. The new behavior matches the relevant MHLO optimization behavior, which proved to be preferable.
- Fix an issue where `pad` ops that didn't change the dimensions would be removed even if they shifted elements around within the tensor (e.g. padding by -1 on one side and +1 on the opposite side).

PiperOrigin-RevId: 822701252
2025-10-22 13:11:10 -07:00
mmakevic-amd
a5524d43e6 PR #33008: [ROCm] Add CI specific bazelrc file
Imported from GitHub PR https://github.com/openxla/xla/pull/33008

📝 Summary of Changes
Add CI-specific bazelrc that will import both `rocm.bazelrc` from `/usertools` and `rocm_xla.bazelrc`

🎯 Justification
Temporary workaround until split logic in CI (which relies on `/usertools/rocm.bazelrc`) is removed

Copybara import of the project:

--
bb4cbf0c4fbf2c171110040c5c1470bddced203b by Milica Makevic <Milica.Makevic@amd.com>:

Add CI specific bazelrc

Merging this change closes #33008

PiperOrigin-RevId: 822700005
2025-10-22 12:50:14 -07:00
Zixuan Jiang
4d53eda2fe Refactor spmd partitioner.
PiperOrigin-RevId: 822689391
2025-10-22 12:23:05 -07:00
Maxim Ermilov
1b08f96abf Port to new GpuComputeCapability API. Last part
PiperOrigin-RevId: 822676102
2025-10-22 11:59:58 -07:00
Oleg Shyshkov
3503a61282 [XLA:GPU] Combine metadata AllToAlls in RaggedAllToAllMultiHostDecomposer.
Instead of performing four separate AllToAll operations, the metadata tensors are reshaped, concatenated, and then a single AllToAll is executed. The result is then sliced back into the individual metadata tensors. This reduces latency required to initiate separate collective operations.

PiperOrigin-RevId: 822674605
2025-10-22 11:49:53 -07:00
Ken Franko
85c99b1ecb Reverts 2d4dd83773
PiperOrigin-RevId: 822637158
2025-10-22 10:17:06 -07:00
Eugene Zhulenev
4827802e7c [xla:pjrt:ffi] Remove unused type id registration API
PiperOrigin-RevId: 822630041
2025-10-22 10:01:45 -07:00
Will Froom
3353eeeab7 [XLA:CPU] Only add reassoc flag to reductions with a single floating point op.
PiperOrigin-RevId: 822598746
2025-10-22 08:33:14 -07:00
Dimitar (Mitko) Asenov
bbea04967a Reverts c28d80ae66
PiperOrigin-RevId: 822586242
2025-10-22 08:02:26 -07:00
Marcin Radomski
94d00be0e6 [XLA:GPU] Fix incorrect namespace in buffer_debug_log.*
It was moved to stream_executor/gpu, but code remained in stream_executor::cuda namespace.

PiperOrigin-RevId: 822584666
2025-10-22 07:51:36 -07:00
Oleg Shyshkov
53499fe9d0 [XLA:GPU] Move offset correction logic in a helper function.
PiperOrigin-RevId: 822572708
2025-10-22 07:29:58 -07:00
Alexander Belyaev
a34be3eb68 [XLA:GPU] Ignore zero-sized constants in layout normalization.
PiperOrigin-RevId: 822571991
2025-10-22 07:16:10 -07:00
A. Unique TensorFlower
39506ad1cd Deduplicate functions on the one with largest number of call sites.
Instead of picking arbitrarily.

PiperOrigin-RevId: 822566069
2025-10-22 06:55:15 -07:00
Thomas Joerg
83b84b3c46 [XLA:GPU] Add tests for transpose ops inserted by DotDecomposer.
Also be more precise about what is considered normal form and what is not.

PiperOrigin-RevId: 822554350
2025-10-22 06:18:34 -07:00
Kostiantyn Liepieshov
b5d09010cd Make adding missing shardings to control flow configurable in StableHLO export.
Introduce `addMissingShardingToControlFlow` option in `StablehloExportPipelineOptions` to control whether `ExportStablehloShardingsPass` adds missing shardings to control flow ops. Disable this option in `mlir_to_hlo.cc` when converting MLIR to HLO.

PiperOrigin-RevId: 822542288
2025-10-22 05:37:59 -07:00
A. Unique TensorFlower
3cc86433e3 Correctly set dnn_version in device_description when parsing from proto.
Removing the setting from the other 2 places as it is no longer necessary.

PiperOrigin-RevId: 822533070
2025-10-22 05:02:14 -07:00
Chenhao Jiang
75fa34bbde PR #32231: Support forward conv with dilation and add basic heuristic for differ…
Imported from GitHub PR https://github.com/openxla/xla/pull/32231

📝 Summary of Changes
The changes enable native support for forward convolutions with window dilation in XLA's GPU backend. Previously, all dilated convolutions were treated as non-canonical and required explicit padding materialization. Now, forward convolutions with window dilation (but not base dilation) are preserved and handled natively by cuDNN, avoiding unnecessary padding overhead.

🎯 Justification
Performance Problem: JAX shows 15-23x slower performance than PyTorch for dilated convolutions (33.5ms vs 1.4ms at dilation rate 2). This is because XLA materializes dilated convolutions as padded convolutions instead of using cuDNN's native support.
Solution: Allow forward convolutions with window dilation to bypass padding materialization and use cuDNN's native dilated convolution kernels directly.

🚀 Kind of Contribution
Performance Improvement

📊 Benchmark (for Performance Improvements)
dilation 1:
	prev: 1.08 ms
	now: 1.07 ms
dilation 2:
	prev: 25.79 ms
	now: 0.91 ms
dilation 1024:
	prev: 26.24 ms
	now: 2.34 ms

Copybara import of the project:

--
b5a38df2ed4715b43fc8ca8d652005a35290d47e by Chenhao Jiang <chenhaoj@nvidia.com>:

Support forward conv with dilation and add basic heuristic for differentiating forward/backward

Merging this change closes #32231

PiperOrigin-RevId: 822482265
2025-10-22 02:03:50 -07:00
Jian Cai
95d3b6fe36 [XLA][Numerics][HLO Value Tracking] Handle original values in while loop fusible sinking pass
This reconstructs the original value for while loops with a rewritten input/output shape during the pass.

PiperOrigin-RevId: 822465131
2025-10-22 01:08:37 -07:00
Felix Wang
add51a87c3 [XLA:GPU] Update latency hiding scheduler cost models for B200/H100 FP8 matmul
PiperOrigin-RevId: 822446122
2025-10-22 00:01:00 -07:00
A. Unique TensorFlower
ca2365df32 Make ApproxTopK Op don't fail with kMhloFrontendAttributes.
PiperOrigin-RevId: 822427505
2025-10-21 22:51:17 -07:00
Parker Schuh
68ad2b30fa Implement PjRtStreamExecutorRawBuffer::CopyTo in terms of raw buffers.
PiperOrigin-RevId: 822345080
2025-10-21 17:58:31 -07:00
Haibo Huang
bdb268c5c5 Add helper functions to check PjRtPlatformId types.
PiperOrigin-RevId: 822333726
2025-10-21 17:13:03 -07:00
Eugene Zhulenev
90491b0a55 [xla:pjrt:ffi] Prepare for legacy type registration removal
PiperOrigin-RevId: 822309311
2025-10-21 16:13:04 -07:00
Paul Ganssle
512611da80 Internal code migration
PiperOrigin-RevId: 822300362
2025-10-21 15:34:56 -07:00
Haibo Huang
b7d9295b52 Replace ComputationOrigin with the more general PjRtDeviceDimensions
PiperOrigin-RevId: 822288293
2025-10-21 15:11:47 -07:00
Olli Lupton
3cdcb03f18 PR #32838: Fix family-conditional logic
Imported from GitHub PR https://github.com/openxla/xla/pull/32838

📝 Summary of Changes
The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input.

🎯 Justification
Previously the logic would propose an incompatible architecture in this case.

🚀 Kind of Contribution
🐛 Bug Fix

🧪 Unit Tests:
Added a new test case showing the previously-failing case (it used to propose `sm_110`)
Copybara import of the project:

--
f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>:

Fix family-conditional logic

--
fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>:

Accept CR suggestion

Merging this change closes #32838

PiperOrigin-RevId: 822284790
2025-10-21 14:59:18 -07:00
Eugene Zhulenev
0fc052399b [xla:cpu] Fix data race in ThunkExecutor
Also add tsl::down_pointer_cast to improve usability.

PiperOrigin-RevId: 822257137
2025-10-21 13:46:24 -07:00
Michael Whittaker
5776d2771c Pipe incarnations to jax.live_devices.
PiperOrigin-RevId: 822250955
2025-10-21 13:35:27 -07:00
mmakevic-amd
47cd01d4a5 PR #32960: [ROCm] Refactor testing scripts
Imported from GitHub PR https://github.com/openxla/xla/pull/32960

📝 Summary of Changes
(Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, 9d358b9b26, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now.

🎯 Justification
These changes are ROCm specific and helps with rocm internal CI validation pipelines.

🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests

📊 Benchmark (for Performance Improvements)
/

🧪 Unit Tests:
/

🧪 Execution Tests:
/

Copybara import of the project:

--
804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>:

Add missing cuda-only tag to cuda test

--
44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>:

Refactor test scripts

--
fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>:

Cover more mgpu tests

--
1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>:

Switch from rocm_gcc to rocm_ci for sgpu tests

--
00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>:

Changing file permissions

--
003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>:

Remove unnecessary import

--
214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

Add license header

Merging this change closes #32960

PiperOrigin-RevId: 822245565
2025-10-21 13:25:33 -07:00
Eugene Zhulenev
7a107e3571 [xla:ffi] Rename FFI_TypeID_Register API
PiperOrigin-RevId: 822240093
2025-10-21 13:12:40 -07:00
Felix Wang
95f3e6f33c [XLA:GPU]: Refactor the unit test of matmul_interpolator_test.cc to prepare for adding the mix-precision fp8 unit test.
PiperOrigin-RevId: 822239646
2025-10-21 13:02:53 -07:00
Felix Wang
2de2bb8581 Populate the cost for async collective in both async-start and the computation root op.
PiperOrigin-RevId: 822223031
2025-10-21 12:22:08 -07:00
Eugene Zhulenev
633c3efcf9 [xla:cpu] Delete unused cpu_function_runtime header
PiperOrigin-RevId: 822215543
2025-10-21 12:15:40 -07:00
Eugene Zhulenev
6141496817 [xla:ffi] Document XLA:FFI binary API guarantees and add a supporteded API range check
PiperOrigin-RevId: 822214561
2025-10-21 12:02:12 -07:00
Kevin Gleason
fe624fe9ce [StableHLO->HLO] Only lower MHLO constants in MHLO prepare for export pass.
PiperOrigin-RevId: 822198262
2025-10-21 11:34:37 -07:00
A. Unique TensorFlower
7524326efd Create testing utilities for extension plugin testing
PiperOrigin-RevId: 822185331
2025-10-21 11:24:34 -07:00
Maxim Ermilov
f7bc8a8859 Port to new GpuComputeCapability API. Part 2
PiperOrigin-RevId: 822183464
2025-10-21 11:14:17 -07:00
A. Unique TensorFlower
2476ba49e0 Minor code improvement on tsl::monitoring::testing::CellReader.
Save a `GetLatestValueOrDefault` call when the result is not used. Also save two map lookups.

PiperOrigin-RevId: 822182539
2025-10-21 11:02:58 -07:00
Jaroslav Sevcik
4fc74ffdd2 PR #32846: Allow mixed precision operands for collective permute
Imported from GitHub PR https://github.com/openxla/xla/pull/32846

📝 Summary of Changes
Allow mixed precision collective-permute in the verifier.

🎯 Justification
Partially addresses https://github.com/openxla/xla/issues/32845

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute.

🧪 Execution Tests:
N/A
Copybara import of the project:

--
666c38a19005a609d4a7aa8e5e9b9842b1c87175 by Jaroslav Sevcik <jsevcik@nvidia.com>:

Allow mixed precision for collective permute

Merging this change closes #32846

PiperOrigin-RevId: 822179840
2025-10-21 10:51:30 -07:00
dependabot[bot]
f717c02d1c PR #32904: Bump github/codeql-action from 3.30.5 to 4.30.9
Imported from GitHub PR https://github.com/openxla/xla/pull/32904

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p>
<blockquote>
<h2>v4.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.8/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.7</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.7/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>3.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v3.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p>
<blockquote>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>[UNRELEASED]</h2>
<p>No user facing changes.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<h2>3.30.6 - 02 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.2. <a href="https://redirect.github.com/github/codeql-action/pull/3168">#3168</a></li>
</ul>
<h2>3.30.5 - 26 Sep 2025</h2>
<ul>
<li>We fixed a bug that was introduced in <code>3.30.4</code> with <code>upload-sarif</code> which resulted in files without a <code>.sarif</code> extension not getting uploaded. <a href="https://redirect.github.com/github/codeql-action/pull/3160">#3160</a></li>
</ul>
<h2>3.30.4 - 25 Sep 2025</h2>
<ul>
<li>We have improved the CodeQL Action's ability to validate that the workflow it is used in does not use different versions of the CodeQL Action for different workflow steps. Mixing different versions of the CodeQL Action in the same workflow is unsupported and can lead to unpredictable results. A warning will now be emitted from the <code>codeql-action/init</code> step if different versions of the CodeQL Action are detected in the workflow file. Additionally, an error will now be thrown by the other CodeQL Action steps if they load a configuration file that was generated by a different version of the <code>codeql-action/init</code> step. <a href="https://redirect.github.com/github/codeql-action/pull/3099">#3099</a> and <a href="https://redirect.github.com/github/codeql-action/pull/3100">#3100</a></li>
<li>We added support for reducing the size of dependency caches for Java analyses, which will reduce cache usage and speed up workflows. This will be enabled automatically at a later time. <a href="https://redirect.github.com/github/codeql-action/pull/3107">#3107</a></li>
<li>You can now run the latest CodeQL nightly bundle by passing <code>tools: nightly</code> to the <code>init</code> action. In general, the nightly bundle is unstable and we only recommend running it when directed by GitHub staff. <a href="https://redirect.github.com/github/codeql-action/pull/3130">#3130</a></li>
<li>Update default CodeQL bundle version to 2.23.1. <a href="https://redirect.github.com/github/codeql-action/pull/3118">#3118</a></li>
</ul>
<h2>3.30.3 - 10 Sep 2025</h2>
<p>No user facing changes.</p>
<h2>3.30.2 - 09 Sep 2025</h2>
<ul>
<li>Fixed a bug which could cause language autodetection to fail. <a href="https://redirect.github.com/github/codeql-action/pull/3084">#3084</a></li>
<li>Experimental: The <code>quality-queries</code> input that was added in <code>3.29.2</code> as part of an internal experiment is now deprecated and will be removed in an upcoming version of the CodeQL Action. It has been superseded by a new <code>analysis-kinds</code> input, which is part of the same internal experiment. Do not use this in production as it is subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3064">#3064</a></li>
</ul>
<h2>3.30.1 - 05 Sep 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.0. <a href="https://redirect.github.com/github/codeql-action/pull/3077">#3077</a></li>
</ul>
<h2>3.30.0 - 01 Sep 2025</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="16140ae1a1"><code>16140ae</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3213">#3213</a> from github/update-v4.30.9-70205d3d1</li>
<li><a href="30db5fee08"><code>30db5fe</code></a> Update changelog for v4.30.9</li>
<li><a href="70205d3d12"><code>70205d3</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3211">#3211</a> from github/mbg/init/starting-partial-config</li>
<li><a href="697c209bfc"><code>697c209</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="1bd53ba38c"><code>1bd53ba</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3205">#3205</a> from github/update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="cac4df0c79"><code>cac4df0</code></a> Rebuild</li>
<li><a href="77e5c0d0a2"><code>77e5c0d</code></a> Merge branch 'main' into update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="97a4f751be"><code>97a4f75</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3204">#3204</a> from github/mbg/setup-codeql</li>
<li><a href="2d5512b361"><code>2d5512b</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="fa7bdf0559"><code>fa7bdf0</code></a> Call <code>getAnalysisKinds</code> a second time, and ignore exceptions thrown during th...</li>
<li>Additional commits viewable in <a href="3599b3baa1...16140ae1a1">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github/codeql-action&package-manager=github_actions&previous-version=3.30.5&new-version=4.30.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

</details>
Copybara import of the project:

--
c14a0d2198bee3dcd76ee7fa733da41a6d1fcd6b by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>:

Bump github/codeql-action from 3.30.5 to 4.30.9

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](3599b3baa1...16140ae1a1)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.30.9
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Merging this change closes #32904

PiperOrigin-RevId: 822178959
2025-10-21 10:40:39 -07:00
Peter Hawkins
16064a6c08 Reverts 67e5eafb24
PiperOrigin-RevId: 822148458
2025-10-21 09:32:19 -07:00
A. Unique TensorFlower
ffc0e052de Adds an option for Hlo Module's CreateFromProto to not preserve instruction unique ids and reassigned them in a compacted way. Options is turned off for now but should be the new default moving forward. Deprecates RemapInstructionIds.
PiperOrigin-RevId: 822146834
2025-10-21 09:22:36 -07:00
Pedro Gonnet
797ffee48d [xla:cpu:xnn] Re-enable XNNPACK by default.
Reverts b2f2568bcc

PiperOrigin-RevId: 822127662
2025-10-21 08:33:32 -07:00
Pedro Gonnet
3a5338e48f Update XNNPACK in XLA
PiperOrigin-RevId: 822105743
2025-10-21 07:39:28 -07:00
Shaogang Wang
97c777acc4 PR #32688: [XLA:GPU] Enable command buffer DynamicSliceCopyFusion command unrolling
Imported from GitHub PR https://github.com/openxla/xla/pull/32688

📝 Summary of Changes
This PR enables command buffer DynamicSliceCopy command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd

🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.

🚀 Kind of Contribution
Please remove what does not apply:
 New Feature

🧪 Unit Tests:
xla/backends/gpu/runtime/command_buffer_cmd_test.cc: CommandBufferCmdTest:DynamicSliceCopyFusionCmd

Copybara import of the project:

--
feb2902fca397360460f6b9788ac0f7482cb547c by Shawn Wang <shawnw@nvidia.com>:

Enable command buffer DynamicSliceCopyFusion command unrolling

Merging this change closes #32688

PiperOrigin-RevId: 822104580
2025-10-21 07:24:42 -07:00
Johannes Reifferscheid
c28d80ae66 Reverts 8be12edcfe
PiperOrigin-RevId: 822095736
2025-10-21 06:59:25 -07:00
Will Froom
2abafe5c4d [XLA][XTile] Make squeeze dims work with xtile extract/insert.
PiperOrigin-RevId: 822087994
2025-10-21 06:33:13 -07:00
Shaogang Wang
8c169d147d PR #32719: 【XLA:GPU] Command buffer DynamicSliceFusionCmd supports cuda graph loop unrolling
Imported from GitHub PR https://github.com/openxla/xla/pull/32719

📝 Summary of Changes
This PR enables command buffer DynamicSliceFusion command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd

🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.

🚀 Kind of Contribution
Please remove what does not apply:
 New Feature

🧪 Unit Tests:
xla/backends/gpu/codegen/dynamic_slice_fusion_test.cc
Copybara import of the project:

--
daa975804cbffcc3a6bc5b37e3494b51a2dbe2ca by Shawn Wang <shawnw@nvidia.com>:

DynamicSliceFsuionCmd supports unrolling

Merging this change closes #32719

PiperOrigin-RevId: 822071751
2025-10-21 05:48:07 -07:00
Mikhail Goncharov
2d4dd83773 [XLA:GPU] enable generic triton emitter for all gemms
According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default.
Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future.

PiperOrigin-RevId: 822067921
2025-10-21 05:34:23 -07:00
Will Froom
bd257617f7 [XLA:GPU][XTile] Move xtile lowering to compilation pipeline.
PiperOrigin-RevId: 822066890
2025-10-21 05:23:02 -07:00
A. Unique TensorFlower
ffdd9d06e0 Update autotune results for XLA AOT compile test.
The device description in the autotune results now specifies "DNN version: 8.9.4" instead of "DNN version: 0.0.0".

PiperOrigin-RevId: 822043677
2025-10-21 04:16:40 -07:00
Dirk Hornung
1aff85868d [Autotuner] Find dot instructions in nested computations for fusion autotuning.
The fission autotuner previously only searched for dot instructions in the entry computation of an HLO module. This caused it to miss dot operations located in nested computations, such as the body of a while loop, preventing the autotuner from applying configurations to them.

PiperOrigin-RevId: 822037141
2025-10-21 04:07:19 -07:00
Ilya Tikhonovskiy
0836518bc5 [XLA:GPU] fine tune the xla part of the scaled-dot op implementation in order to support the case when we omit one of the scales and pass the bf16 argument instead.
We adjusted the emitter for the case when the scale is missing.
Also we relaxed the hlo verifier a bit and tweaked the composite rewriter that should accept the dim indexes passed by jax.

PiperOrigin-RevId: 822036474
2025-10-21 03:54:17 -07:00
Will Froom
373abf8de1 [XLA:CPU][XTile] Add support for strided extract/insert tile.
PiperOrigin-RevId: 822035319
2025-10-21 03:38:59 -07:00
Adrian Kuegel
e756c21611 [XLA:GPU] Remove a source of non-determinism from DotMerger pass.
When removing ops, we need to do that in a deterministic order. The reason is
that removing users works by finding the position of the user in the users
vector, then swapping with the last element of the vector, then popping the
last element of the vector. So if more than one element is removed from a users
list, it matters in which order the elements are removed.

PiperOrigin-RevId: 822026351
2025-10-21 03:32:31 -07:00
Jaroslav Sevcik
735f4bb631 PR #32905: Allow mixed precision operands for async collective permute
Imported from GitHub PR https://github.com/openxla/xla/pull/32905

📝 Summary of Changes
Allow mixed precision asynchronous collective-permute in the verifier.

🎯 Justification
Fixes https://github.com/openxla/xla/issues/32845

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute-start and collective-permute-done.

🧪 Execution Tests:
Manually testes the JAX repro from https://github.com/openxla/xla/issues/32845
Copybara import of the project:

--
f44faa7ce7ecfbd810983cae170a118bb19a8bb3 by Jaroslav Sevcik <jsevcik@nvidia.com>:

Allow mixed precision operands for async collective permute

Merging this change closes #32905

PiperOrigin-RevId: 822023349
2025-10-21 03:22:25 -07:00
Aleksa Arsic
b89fdab2a2 PR #32773: [ROCm] Fix convolution fp16 performance drop on gfx11xx, gfx12xx
Imported from GitHub PR https://github.com/openxla/xla/pull/32773

📝 Summary of Changes
Remove hardcoded NHWC convolution layout for fp16 precision.

🎯 Justification
Performance drops for fp16 precision on gfx11xx and gfx12xx GPUs were observed internally, as well as by the [community](https://github.com/jax-ml/jax/issues/30548).

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark
Community member provided the script with whom the [profiling can be done](https://github.com/jax-ml/jax/issues/30548#issue-3270872993).
Significant performance improvement for fp16 on gfx12xx:
```
Running on: rocm:0

Testing float32...
Avg time: 0.092307 s, Throughput: 1.68 TFLOP/s

Testing float16...
Avg time: 0.011742 s, Throughput: 13.17 TFLOP/s

Testing bfloat16...
Avg time: 0.011989 s, Throughput: 12.90 TFLOP/s
```
Results of the profiling before the fix:
```
Running on: rocm:0

Testing float32...
Avg time: 0.092312 s, Throughput: 1.67 TFLOP/s

Testing float16...
Avg time: 0.775142 s, Throughput: 0.20 TFLOP/s

Testing bfloat16...
Avg time: 0.011990 s, Throughput: 12.90 TFLOP/s
```

@xla-rotation can you please review this PR?

Copybara import of the project:

--
c9fdba79e32c13d9cbf640e61d941d071fabba9d by Aleksa Arsic <Aleksa.Arsic@amd.com>:

Remove hardcoded convolution NCHW layout assignment for fp16 precision.

--
69660d19999a14b24d63b52e6dae310cfbdcbb6b by Aleksa Arsic <Aleksa.Arsic@amd.com>:

Add unit tests for ROCm layout assignment.

Merging this change closes #32773

PiperOrigin-RevId: 822022522
2025-10-21 03:08:12 -07:00
Will Froom
0f4c3f55b5 [XLA][XTile] Use xtile entry, extract & insert in triton emitter.
PiperOrigin-RevId: 822020719
2025-10-21 02:55:47 -07:00
Will Froom
7f64538e67 [XLA:CPU] Make tiled kernel test deterministic.
PiperOrigin-RevId: 822018419
2025-10-21 02:42:18 -07:00
Will Froom
900e2d4d5c [XLA][XTile] Add TiledBuffer interface to insert/extract ops.
PiperOrigin-RevId: 822009372
2025-10-21 02:04:04 -07:00
Dimitris Vardoulakis
81f29b3472 PR #32724: Disable only the test cases that are failing and enable 3 test targets on B200.
Imported from GitHub PR https://github.com/openxla/xla/pull/32724

Copybara import of the project:

--
c3f4ff8ec6af27d24b61e2aa529585697b8aa77a by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Disable only the test cases that are failing and enable 3 test targets on B200.

--
1f6e52218ec124bb52d4dba70aa7832311762465 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Disable test case in cudnn_test that fails on Google's B200.
Keep gpu_compiler_test off CI for now due to memory leak
found by ASAN, but don't revert the changes in the file,
so it can be enabled more easily in the future.

--
42e501a41e43c174538ab186c659a072101b4ab2 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Disable ConvWgradWithNHWCLayoutExecutesCorrectly only on Blackwell.

Merging this change closes #32724

PiperOrigin-RevId: 821992088
2025-10-21 01:05:14 -07:00
Mudit Gokhale
898e238e5d Enable multi-host support for trace viewer.
PiperOrigin-RevId: 821973423
2025-10-21 00:16:33 -07:00
A. Unique TensorFlower
e7e50018a2 Reverts 7144ba7d80
PiperOrigin-RevId: 821972512
2025-10-20 23:54:39 -07:00
A. Unique TensorFlower
fdc9d00af9 Automated Code Change
PiperOrigin-RevId: 821967952
2025-10-20 23:37:29 -07:00
A. Unique TensorFlower
a31ff63e54 Refactor resource tracking logic done in GetResourcesForInstructionImpl.
PiperOrigin-RevId: 821957518
2025-10-20 22:55:19 -07:00
A. Unique TensorFlower
63d73059b0 Automated Code Change
PiperOrigin-RevId: 821947260
2025-10-20 22:21:39 -07:00
A. Unique TensorFlower
e39094330d Automated Code Change
PiperOrigin-RevId: 821946240
2025-10-20 22:11:41 -07:00
Eugene Zhulenev
5caf2a70d8 [xla:ffi] Revert ABI version change
Also relax the FFI version check.

PiperOrigin-RevId: 821905272
2025-10-20 19:43:47 -07:00
Subhankar Shah
14710459b6 [XLA:MSA] Allow MSA to schedule custom-call prefetches.
PiperOrigin-RevId: 821900214
2025-10-20 19:25:21 -07:00
A. Unique TensorFlower
7b8ce05b8b Integrate LLVM at llvm/llvm-project@32de3b9ef9
Updates LLVM usage to match
[32de3b9ef9e7](https://github.com/llvm/llvm-project/commit/32de3b9ef9e7)

PiperOrigin-RevId: 821892561
2025-10-20 18:56:42 -07:00
Benjamin Chetioui
fe31cef146 [XLA:GPU] Add a filter in the Triton fusion emitter to allow using a transposed iteration schedule for fusions rooted in dot.
The intent is to improve L2 cache hits in the case where the left-hand side argument
fully fits in L2.

PiperOrigin-RevId: 821882872
2025-10-20 18:15:08 -07:00
Haibo Huang
0e15b80eb6 Implement LogicalDeviceOfDefaultTypeForId for GPU / CPU
PiperOrigin-RevId: 821873337
2025-10-20 17:53:40 -07:00
Parker Schuh
20d14fbfb7 Rollforward of Update PjRtStreamExecutorRawBuffer::CopyRawHostToDeviceAndReturnEvent to
support staging host buffers (for non-pinned memory). This allows replacing the CopyRawToHost functions.

Reverts baf408c724

PiperOrigin-RevId: 821872812
2025-10-20 17:43:51 -07:00
Haibo Huang
d2e02ce8d9 Adds topology utility functions
PiperOrigin-RevId: 821858216
2025-10-20 17:06:02 -07:00
A. Unique TensorFlower
ce507e7993 Update XNNPACK in XLA
This update includes some workarounds that disable SME for old compilers that don't support.

PiperOrigin-RevId: 821848400
2025-10-20 16:51:04 -07:00
Alexander Shaposhnikov
72ba588a80 Clean up debug_options usage. NFC.
PiperOrigin-RevId: 821848216
2025-10-20 16:38:52 -07:00
Maxim Ermilov
361f1c64eb Port to new GpuComputeCapability API
PiperOrigin-RevId: 821845460
2025-10-20 16:24:36 -07:00
Will Froom
dd4822d61c [XLA:CPU] Fix flaky test.
PiperOrigin-RevId: 821835738
2025-10-20 16:03:57 -07:00
Hyeontaek Lim
67e5eafb24 [PJRT C] Implement Executable::GetOutputLayouts() in the PJRT Layouts extension
This change implements a native support for `xla::Executable::GetOutputLayouts()` in PJRT C API, when PJRT Layouts extension is available. This support does not fetch the optimized HLO, and thus this method becomes more reliable and fast.

This change strongly recommends the plugin that implemented the Layouts extension v2 to upgrade to v3 to avoid an incompatibility.

PiperOrigin-RevId: 821834116
2025-10-20 15:57:07 -07:00
Bill Varcho
a40f3bdebd [Upkeep] Resolve 4 instances of the following issue: Todo (resolved)
PiperOrigin-RevId: 821824006
2025-10-20 15:43:00 -07:00
Gunhyun Park
d2759cb154 Fix typo invompatible -> incompatible
PiperOrigin-RevId: 821816332
2025-10-20 15:16:51 -07:00
Sean Talts
8d940b9cc7 [XLA:CPU] Refactor: intrinsic::Type into its own library, pull out naming functions.
PiperOrigin-RevId: 821815511
2025-10-20 15:06:06 -07:00
Hyeontaek Lim
b915d3103a [NanoRt] NanoRt IFRT now returns a nullptr if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.
Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future.

PiperOrigin-RevId: 821808592
2025-10-20 14:51:22 -07:00
Benjamin Chetioui
88e4cd4d01 [XLA] Use TiledHloSchedule in SymbolicTileAnalysis.
We can now produce arbitrary iteration patterns for output tiles, simply by
parametrizing calls to `ComputeTiledHloInstructions` with different
`TiledHloSchedule`s.

PiperOrigin-RevId: 821796530
2025-10-20 14:12:48 -07:00
Sohaib Iftikhar
cbeeef926f [XLA:GPU]: Remove unused method from all reduce test
PiperOrigin-RevId: 821742010
2025-10-20 12:28:17 -07:00
Hyeontaek Lim
cc9fd2b254 [IFRT Proxy] Array::pjrt_layout() uses nullptr to indicate a default layout
IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.

Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately.
PiperOrigin-RevId: 821741105
2025-10-20 12:19:53 -07:00
Eugene Zhulenev
0e09f486e7 [xla:pjrt:ffi] Remove deprecated TypeID registration function
PiperOrigin-RevId: 821740142
2025-10-20 12:11:20 -07:00
Michael Kuperstein
b824d4e187 [XLA] Remove verify_unique_channel_ids verifier option.
The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use.

PiperOrigin-RevId: 821737613
2025-10-20 11:59:45 -07:00
Kanish Anand
16e1567819 Clarify field name
PiperOrigin-RevId: 821732575
2025-10-20 11:50:54 -07:00
Zac Mustin
903ce4c6ff Use GetPjRtCpuClient directly when making a PJRT C API CPU client.
Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`.

This change is a no-op.

PiperOrigin-RevId: 821732030
2025-10-20 11:44:41 -07:00
A. Unique TensorFlower
458995b35d Set DNN version in DeviceDescription for autotuner cache.
The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra.

PiperOrigin-RevId: 821728904
2025-10-20 11:34:10 -07:00
A. Unique TensorFlower
027a15b3dc [Autotuner] Early exit if there is only one supported config.
- We encounter this case very often (for cublas autotuner), so it makes sense to optimize it.
- Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it.

PiperOrigin-RevId: 821716593
2025-10-20 11:08:05 -07:00
Eugene Zhulenev
fd948cba88 [xla:cpu] Migrate tf2xla to BufferAllocationInfo
Reverts f2ed04aff6

PiperOrigin-RevId: 821660240
2025-10-20 08:41:06 -07:00
A. Unique TensorFlower
f32acff204 Automated Code Change
PiperOrigin-RevId: 821628227
2025-10-20 07:15:56 -07:00
A. Unique TensorFlower
789f9c8c92 Integrate LLVM at llvm/llvm-project@d5ce81dc81
Updates LLVM usage to match
[d5ce81dc8143](https://github.com/llvm/llvm-project/commit/d5ce81dc8143)

PiperOrigin-RevId: 821627987
2025-10-20 06:57:20 -07:00
Will Froom
4d5ed207cd [XLA:GPU][XTile] Handle scalar load/store in xtile lowering.
PiperOrigin-RevId: 821617202
2025-10-20 06:36:16 -07:00
Alex
8245a623aa PR #32782: [ROCm] Fix hermetic build for rocm
Imported from GitHub PR https://github.com/openxla/xla/pull/32782

📝 Summary of Changes
Fix hermetic build for rocm.

🎯 Justification
Introduce missing hipblaslt dependency.
Fix invalid libs linking and align with the data directories.

🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
CI, not relevant

🧪 Unit Tests:
Not relevant

🧪 Execution Tests:
Not relevant

Copybara import of the project:

--
f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>:

Add missing hermetic lib dependency

--
fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>:

Add missing hipblaslt deps, fix the targets

--
540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>:

Make hipblaslt mandatory

--
3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>:

Fix test

--
eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>:

Ignore asnsigaltstack

--
54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:

[ROCm] Use working sha256 for latest ROCm 7.0 docker image

--
9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

[ROCm] Add ROCM_PATH repo_env to test scripts

--
1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>:

Fix buildifier warning

Merging this change closes #32782

PiperOrigin-RevId: 821614030
2025-10-20 06:26:00 -07:00
Kostiantyn Liepieshov
5866a4f621 fix relayout propagation for MPMD.
In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts

PiperOrigin-RevId: 821612799
2025-10-20 06:11:10 -07:00
A. Unique TensorFlower
5707a02d98 [XLA:GPU] Enable chlo.asinh -> kAsinh HloInstruction lowering.
PiperOrigin-RevId: 821610794
2025-10-20 05:57:12 -07:00
Adrian Kuegel
7144ba7d80 Migrate ListScheduler from TuplePointsToAnalysis to HloAliasAnalysis.
Reverts 22032a9edb

PiperOrigin-RevId: 821605784
2025-10-20 05:44:20 -07:00
Will Froom
beb48d90e2 [XLA][XTile] Add xtile lowering passes for triton.
This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR.

The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`.

PiperOrigin-RevId: 821593545
2025-10-20 04:57:30 -07:00
Christian Sigg
ea72bd7e48 [XLA:GPU] Allow kPad in nest_gemm_fusion.
PiperOrigin-RevId: 821581567
2025-10-20 04:16:25 -07:00
Nikita Putikhin
c842d810f4 [XLA:GPU] Initialize global_split_limits with the first config's split_k.
PiperOrigin-RevId: 821579201
2025-10-20 04:03:13 -07:00
Parker Schuh
f322c0c82d Refactor the event loop + socket integration so that it is separately testable.
This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as
the API from the socket integration and subclasses can handle these
accordingly. This also moves the responsibility to destroy in the handler logic
with the contract that the event is removed from the loop on the second HalfClose event.

PiperOrigin-RevId: 821445213
2025-10-19 19:02:26 -07:00
A. Unique TensorFlower
50808df60c Add random perturbations to the xla_tpu_msa_sort_order_overrides flag
Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation.

This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features:

1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed
- to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime
- once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction

2. finding a lot of small changes that together reduce the runtime
- we can do this using a "hill-climbing" method
- try many perturbations until you find one slightly better than the baseline.
- try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again
- repeat until no more improvements are found

NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs.
PiperOrigin-RevId: 821309046
2025-10-19 06:33:49 -07:00
David Majnemer
8cf42017ec [TSL] Consolidate NUMA code across different platforms.
No functional change is intended.

PiperOrigin-RevId: 821216963
2025-10-18 22:14:29 -07:00
Subhankar Shah
a139a50e56 [XLA:MSA] Allow allocation requests with a continuous default memory requirement to fall through without resulting in a failure requiring un-commit.
PiperOrigin-RevId: 821165250
2025-10-18 18:00:21 -07:00
A. Unique TensorFlower
ac5fb8fb7f Introduce and connect an XLA:TPU shardy option to fully deduplicate functions in Shardy.
This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default.

PiperOrigin-RevId: 821039969
2025-10-18 05:48:48 -07:00
Aliia Khasanova
4985a1c2f3 Add proto [de]serialization for HostExecuteDoneThunk
PiperOrigin-RevId: 821029998
2025-10-18 04:55:53 -07:00
A. Unique TensorFlower
17117898f9 Automated Code Change
PiperOrigin-RevId: 820964120
2025-10-17 23:15:56 -07:00
Maxim Ermilov
4a42fca868 First step to introduce GpuComputeCapability custom class instead of std::variant
PiperOrigin-RevId: 820940828
2025-10-17 21:24:11 -07:00
David Majnemer
4d358b2bac [TSL] Remove unused integral_types.h from TSL platform defaults.
This header provided typedefs for integral types within the `tsl` namespace, but these are no longer necessary.

PiperOrigin-RevId: 820935573
2025-10-17 21:01:34 -07:00
A. Unique TensorFlower
4beacf5a04 Make the chain specifically target the reshape-transpose chain where
the transposes are not identity permutations. Identity transposes
should be eliminated separately in HandleTranspose already.

PiperOrigin-RevId: 820903953
2025-10-17 18:56:47 -07:00
Alexander Shaposhnikov
ce65a0ad5c [XLA:CPU] Add initial bits for YNNPACK support.
+ Do not build XLA with YNNPACK on Windows.

Co-authored-by: Penporn Koanantakool <penporn@google.com>
PiperOrigin-RevId: 820896434
2025-10-17 18:36:53 -07:00
A. Unique TensorFlower
f0057ee4b7 [XLA] Delete stale comment on ShapeUtil::PermuteDimensions usage.
PiperOrigin-RevId: 820874660
2025-10-17 16:58:56 -07:00
Zixuan Jiang
1a142dab0a Refactor shardy_xla_pass.
Remove unused code.

PiperOrigin-RevId: 820872613
2025-10-17 16:49:40 -07:00
A. Unique TensorFlower
f2ed04aff6 Reverts 0fab8daf15
PiperOrigin-RevId: 820869543
2025-10-17 16:39:59 -07:00
A. Unique TensorFlower
206f1c1891 Update XNNPACK in XLA
PiperOrigin-RevId: 820860720
2025-10-17 16:14:15 -07:00
Haibo Huang
a619e2de08 Expose new methods to PjRtTopologyDescription.
PiperOrigin-RevId: 820837477
2025-10-17 15:04:17 -07:00
A. Unique TensorFlower
119e1f6731 https://github.com/llvm/llvm-project/pull/162120 removed some automatic namespace determinations, so we need to explicitly specify some namespaces now. This is needed
for the LLVM integrate.

PiperOrigin-RevId: 820836649
2025-10-17 14:52:43 -07:00
David Majnemer
bdb78510d0 [TSL] Clean up integral types
Let's migrate to u?int\d+_t types instead of our own bespoke stuff.

PiperOrigin-RevId: 820815523
2025-10-17 14:19:08 -07:00
Eugene Zhulenev
d531cdce30 [xla:ffi] Add TypeRegistry::TypeInfo to be able to register functions to manipulate user-defined types
PiperOrigin-RevId: 820811829
2025-10-17 13:41:40 -07:00
Kevin Gleason
46522b8a20 [StableHLO] Add transpose simplification
PiperOrigin-RevId: 820804015
2025-10-17 13:31:39 -07:00
Niklas Vangerow
13006913d2 Migrate sample_file_test to HloRunnerPjRt.
PiperOrigin-RevId: 820803579
2025-10-17 13:21:59 -07:00
Hyeontaek Lim
05101b9755 [PjRt-IFRT] Temporary workaround for output layout handling
PjRt-IFRT directly or indirectly fetched optimized HLO to get the output
layout mode and output layouts. This seems to introduce a regression in
some jobs that use PJRT C API and have a too large serialized HLO (> 2 GiB).

As a workaround, PjRt-IFRT gracefully handles output layout mode and
layout discovery errors, and falls back to concrete layouts that are
directly obtained from output `PjRtBuffer`s, should give the same
behavior before/after the default layout handling change.

Further changes will follow to discover default layout modes and layouts
without going through `PjRtLoadedExecutable::GetHloModules()`.

PiperOrigin-RevId: 820785277
2025-10-17 12:41:35 -07:00
Parker Schuh
b07145966f Add StatusOr to transfer server BulkTransportInterface on the bond id to
forward errors from bond connection failures to the control plane connection.

PiperOrigin-RevId: 820783819
2025-10-17 12:28:16 -07:00
Eugene Zhulenev
0fab8daf15 [xla:cpu] Migrate tf2xla to BufferAllocationInfo
Reverts 94fbd7554e

PiperOrigin-RevId: 820770766
2025-10-17 11:54:08 -07:00
Benjamin Chetioui
81798b5240 [XLA] Throw away TilingSpecification in the TransposedDotTiledHloSchedule.
After relaxing the constraints related to the iteration space in a recent
change, this is no longer necessary.

PiperOrigin-RevId: 820766539
2025-10-17 11:33:01 -07:00
A. Unique TensorFlower
94fbd7554e Reverts fb52ce8275
PiperOrigin-RevId: 820748684
2025-10-17 10:58:15 -07:00
Penporn Koanantakool
8614a97d98 [xla:cpu:ynn] Add build macros for YNNPACK integration.
We won't build XLA with YNNPACK on Windows yet.

PiperOrigin-RevId: 820744698
2025-10-17 10:40:45 -07:00
Kostiantyn Liepieshov
f910c98db0 Use R"hlo(...)hlo" for HLO text in sample_text_test.cc.
This improves readability and allows for better syntax highlighting of the embedded HLO strings.

PiperOrigin-RevId: 820710394
2025-10-17 09:12:53 -07:00
Eugene Zhulenev
fb52ce8275 [xla:cpu] Migrate tf2xla to BufferAllocationInfo
PiperOrigin-RevId: 820707093
2025-10-17 08:59:31 -07:00
Eugene Zhulenev
4752801386 [xla:ffi] Make TypeInfo mandatory in XLA_FFI_REGISTER_TYPE
Add placeholders for future Type serialization/deserialization. It's not an ABI breaking change as it's unused today, and it allows to avoid ABI breaking change in the future when FFI will add proper ser/des support for user defined types.

PiperOrigin-RevId: 820676169
2025-10-17 07:20:25 -07:00
Aliia Khasanova
30d25d6d18 Add proto [de]serialization for HostExecuteStartThunk
PiperOrigin-RevId: 820645056
2025-10-17 05:32:26 -07:00
Karlo Basioli
0bb1532ddf [XLA] Enable multihost runner to load unoptimized hlo snapshots dumped without custom serialization.
PiperOrigin-RevId: 820643951
2025-10-17 05:26:10 -07:00
A. Unique TensorFlower
51fc1ac0d5 Improve logging and error messages from autotuner.
- The VLOG messages are updated to more accurately describe whether the autotuner is finding a config in cache, using a default, or actively tuning for the best config.
- The error contains the HLO instruction.

PiperOrigin-RevId: 820640768
2025-10-17 05:16:19 -07:00
Eugene Zhulenev
52749919c9 [xla:cpu] Add buffer_allocation_info to xla_cpu_runtime_hdrs
PiperOrigin-RevId: 820639686
2025-10-17 05:03:10 -07:00
Mohammed Anany
097f587e4e [XLA:GPU/WS] Adding test coverage for auto warp specialization via Triton.
PiperOrigin-RevId: 820637611
2025-10-17 04:49:39 -07:00
Nikita Putikhin
cc58fb18fd [XLA:GPU] Enable dots with block_n=8 in triton and autotuner
This change utilizes recently added Triton support for smaller block sizes.

Skipping occupancy optimization for some configs is essentially a workaround for incompatible split_k values. The impact of these configs is limited however because they are only present in non-exhaustive mode, so they mostly get filtered out anyway.

PiperOrigin-RevId: 820617352
2025-10-17 03:32:51 -07:00
Will Froom
abc19d2d20 [XLA:CPU] Combine optimization & lowering pass managers by using callback pass.
PiperOrigin-RevId: 820610316
2025-10-17 03:07:44 -07:00
Karlo Basioli
5da47fcdd8 [XLA:GPU][codegen] Emit shlo for broadcast_in_dim and lower to equivalent triton op.
PiperOrigin-RevId: 820598440
2025-10-17 02:33:27 -07:00
Zixuan Jiang
0ab4818f74 Use all-gather in the spmd_partitioner_test.
Before this change, we disallowed all-gather such that the partitioner generates `all-reduce(dynamic-update-slice())` pattern. With this change, we allow all-gather for two reasons.
1. In most cases, all-gather is allowed and preferred.
2. It is easier to read and match the partitioner result.

PiperOrigin-RevId: 820593767
2025-10-17 02:02:58 -07:00
Ilia Sergachev
4cd7465b84 PR #32388: [GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension.
Imported from GitHub PR https://github.com/openxla/xla/pull/32388

📝 Summary of Changes
Support collectives with non-minor-most last dimension in the sub-byte collective normalization pass.

🎯 Justification
Makes more collectives efficient, not require type conversion.

🚀 Kind of Contribution
Performance Improvement.

📊 Benchmark (for Performance Improvements)
```
Before:

## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=68384ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=67744ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=66976ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=67040ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=66816ns

After:

## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=41216ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=41056ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=40960ns
```
Measured on 8xH100 DGX.

🧪 Unit Tests:
yes

🧪 Execution Tests:
yes
Copybara import of the project:

--
a3777523ffffbcc59da285544e3fb5575d098b9c by Ilia Sergachev <isergachev@nvidia.com>:

[GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension.

Merging this change closes #32388

PiperOrigin-RevId: 820585923
2025-10-17 01:38:24 -07:00
Harsha H S
086937e138 PR #32678: [ROCm] Use working sha256 for latest ROCm 7.0 docker image and fix test scripts
Imported from GitHub PR https://github.com/openxla/xla/pull/32678

📝 Summary of Changes
- Fix sha256 of docker image to ensure CI is not broken due to malformed image
- Fix test scripts by passing ROCM_PATH to bazel sandbox via repo_env

🎯 Justification
Continued CI runs

🚀 Kind of Contribution
 🧪 Tests

Copybara import of the project:

--
3ca8114613d8e002c137f28bb6608639d08a724a by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:

[ROCm] Use working sha256 for latest ROCm 7.0 docker image

--
09ddfbdf205a6406cdd67e20671f41455fffe0f9 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

[ROCm] Add ROCM_PATH repo_env to test scripts

Merging this change closes #32678

PiperOrigin-RevId: 820582560
2025-10-17 01:25:06 -07:00
Shanbin Ke
f573329cc6 PR #32718: [XLA:GPU] add conv fusion support in cudnn fusion compiler
Imported from GitHub PR https://github.com/openxla/xla/pull/32718

📝 Summary of Changes
This PR adds conv fusion support in cudnn fusion compiler.

* add conv type in `CuDnnFusionConfig` to represent different types of conv. We are getting rid of the conv custom call target so this info has be preserved in fusion config.
* add `ConvDimensionAdapter` to generate NCHW **logical layout** for cudnn frontend while physical layout could be NHWC (most preferable layout) or NCHW (for int conv). Only NHWC layout is used in the unit tests because layout assignment currently doesn't handle conv fusion to transform other layouts to NHWC, this needs to be addressed in separate PR.
* add conv translation rule from XLA conv to cudnn frontend graph API.
* Other parts of the lowering is taken care automatically by current cudnn fusion compiler: workspace allocation/graph validation/graph  compilation/graph serialization.

🎯 Justification
This is the first step to unify the conv as cudnn fusion in XLA. Conv custom call will be replaced with conv fusions in the future.

🚀 Kind of Contribution
 New Feature

📊 Benchmark (for Performance Improvements)
No Performance changes are expected.

🧪 Unit Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.

🧪 Execution Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.
Copybara import of the project:

--
57555cd0e3759aacb7a98135c3261f4cc3f642c2 by Cjkkkk <ske@nvidia.com>:

init

--
d6edecfa42a6371a0908e22daeb8deaf32998ece by Cjkkkk <ske@nvidia.com>:

address comments

--
17df6f8451274f070d7d332a126cfefa1ef7df83 by Cjkkkk <ske@nvidia.com>:

removed one comment

--
1b7c63b1ade7751cf8f68c7fb11cd68491440081 by Cjkkkk <ske@nvidia.com>:

add const

Merging this change closes #32718

PiperOrigin-RevId: 820574737
2025-10-17 00:58:07 -07:00
Jacques Pienaar
2096501975 Remove register everything.
This should just be IR one.

PiperOrigin-RevId: 820548236
2025-10-16 23:22:26 -07:00
A. Unique TensorFlower
1ddcd859d3 Move absl_thread_pool to XLA as YnnThreadpool
PiperOrigin-RevId: 820544939
2025-10-16 23:13:24 -07:00
Christian Sigg
c9d8d37611 [xla:gpu] Relax nested gemm fusion constraints.
This change removes dimension ordering constraints in `AcceptDotOperand`.

PiperOrigin-RevId: 820542964
2025-10-16 23:02:42 -07:00
A. Unique TensorFlower
d46c1b99a9 Automated Code Change
PiperOrigin-RevId: 820542824
2025-10-16 22:51:48 -07:00
Gregory Pataky
c0d9a60f83 Internal changes to project structure
PiperOrigin-RevId: 820527062
2025-10-16 21:52:14 -07:00
Penporn Koanantakool
b2f2568bcc [xla:cpu:xnn] Temporarily disable XNNPACK by default.
PiperOrigin-RevId: 820519075
2025-10-16 21:31:15 -07:00
A. Unique TensorFlower
5592d364ec Automated Code Change
PiperOrigin-RevId: 820505039
2025-10-16 20:36:41 -07:00
A. Unique TensorFlower
a8a747470e Update XNNPACK in XLA
PiperOrigin-RevId: 820502825
2025-10-16 20:24:07 -07:00
Eugene Zhulenev
ef3a678718 [xla:cpu] Fix BufferAllocationInfo::InOutParameter constructor
PiperOrigin-RevId: 820456592
2025-10-16 17:49:08 -07:00
Kevin Gleason
e0f3263a48 [StableHLO Builder] Add API to set frontend attributes
PiperOrigin-RevId: 820455957
2025-10-16 17:34:02 -07:00
Benjamin Chetioui
c19b0d8727 [XLA] Relax restrictions on the iteration_space parameter in Schedule.
We're perfectly able to construct a schedule using only a subset of the
iteration space of a `tile_offsets_indexing`---and in fact need to when we are
processing nested fusions.

PiperOrigin-RevId: 820454010
2025-10-16 17:25:27 -07:00
Haibo Huang
c3ce8a9881 Add PjRtDeviceDimensions struct and proto.
PiperOrigin-RevId: 820440467
2025-10-16 16:41:32 -07:00
Yulia Baturina
180445df5b Add more textual hdrs to mkl_dnn_acl target to enable usage of --config=mkl_aarch64_threadpool with --config=rbe_cross_compile_linux_aarch64 in JAX wheel builds.
This addition fixes the error below:

```
2025-09-03 23:33:24,197 - INFO - external/mkl_dnn_acl_compatible/src/graph/interface/partition_impl.cpp:17:10: fatal error: 'graph/interface/partition_impl.hpp' file not found
2025-09-03 23:33:24,197 - INFO -    17 | #include "graph/interface/partition_impl.hpp"
2025-09-03 23:33:24,197 - INFO -       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-03 23:33:24,197 - INFO - 1 error generated.
```

PiperOrigin-RevId: 820423794
2025-10-16 15:58:17 -07:00
A. Unique TensorFlower
a04d8eeb75 Update XNNPACK in XLA
PiperOrigin-RevId: 820419078
2025-10-16 15:43:29 -07:00
Eugene Zhulenev
a2ab65d20c [xla:ffi] Fix XLA_FFI_REGISTER_TYPE_X macro to support optional args
PiperOrigin-RevId: 820414595
2025-10-16 15:30:26 -07:00
Parker Schuh
fd09236fe4 Rollforward of: Implement PjRtStreamExecutorRawBuffer::CopyToLiteralAsync and allow
PjRtStreamExecutorBuffer to just use inherited literal conversion logic.

Reverts 009d8fdbf4

PiperOrigin-RevId: 820409509
2025-10-16 15:17:15 -07:00
A. Unique TensorFlower
1f3321fa67 * Adds flag tf_serialize_mlir_to_compressed_bytecode to serialize to compressed bytecode.
* Deserializing MLIR modules still tries to parse as string first as thats the default, on failure it tries to uncompress and parse.

PiperOrigin-RevId: 820396326
2025-10-16 14:49:31 -07:00
A. Unique TensorFlower
5863476a05 Integrate LLVM at llvm/llvm-project@bfee9db785
Updates LLVM usage to match
[bfee9db78577](https://github.com/llvm/llvm-project/commit/bfee9db78577)

PiperOrigin-RevId: 820396282
2025-10-16 14:38:22 -07:00
Kanish Anand
a6a2128d5f Add IFTTT for named sharding types
PiperOrigin-RevId: 820392329
2025-10-16 14:25:10 -07:00
Maxim Ermilov
5c18a50655 Use nvml impl lib based wrapper
PiperOrigin-RevId: 820376041
2025-10-16 13:59:12 -07:00
Michael Kuperstein
83a98ac48d [XLA] Remove dead module-group related code from HloPassPipelineTest
PiperOrigin-RevId: 820373683
2025-10-16 13:50:15 -07:00
Will Froom
168eb2c36a [XLA:CPU][XTile] Add lowering for tensor extract and from_elements & fix dot with scalar output.
PiperOrigin-RevId: 820368257
2025-10-16 13:37:25 -07:00
Michael Kuperstein
4e29e9da6c [XLA] Remove unused RecordPassStartMetadata overload.
PiperOrigin-RevId: 820365407
2025-10-16 13:22:35 -07:00
A. Unique TensorFlower
c8b47aac14 In profile_data_lib.cc, throw std::runtime_error instead of check fail.
PiperOrigin-RevId: 820351230
2025-10-16 12:47:14 -07:00
A. Unique TensorFlower
83c407040a [XLA:GPU] Don't fail Autotuner::GetSupportedConfigs if one of the backend fails
PiperOrigin-RevId: 820303427
2025-10-16 10:58:41 -07:00