Commit Graph

28187 Commits

Author SHA1 Message Date
Parker Schuh
68ad2b30fa Implement PjRtStreamExecutorRawBuffer::CopyTo in terms of raw buffers.
PiperOrigin-RevId: 822345080
2025-10-21 17:58:31 -07:00
Haibo Huang
bdb268c5c5 Add helper functions to check PjRtPlatformId types.
PiperOrigin-RevId: 822333726
2025-10-21 17:13:03 -07:00
Eugene Zhulenev
90491b0a55 [xla:pjrt:ffi] Prepare for legacy type registration removal
PiperOrigin-RevId: 822309311
2025-10-21 16:13:04 -07:00
Paul Ganssle
512611da80 Internal code migration
PiperOrigin-RevId: 822300362
2025-10-21 15:34:56 -07:00
Haibo Huang
b7d9295b52 Replace ComputationOrigin with the more general PjRtDeviceDimensions
PiperOrigin-RevId: 822288293
2025-10-21 15:11:47 -07:00
Olli Lupton
3cdcb03f18 PR #32838: Fix family-conditional logic
Imported from GitHub PR https://github.com/openxla/xla/pull/32838

📝 Summary of Changes
The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input.

🎯 Justification
Previously the logic would propose an incompatible architecture in this case.

🚀 Kind of Contribution
🐛 Bug Fix

🧪 Unit Tests:
Added a new test case showing the previously-failing case (it used to propose `sm_110`)
Copybara import of the project:

--
f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>:

Fix family-conditional logic

--
fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>:

Accept CR suggestion

Merging this change closes #32838

PiperOrigin-RevId: 822284790
2025-10-21 14:59:18 -07:00
Eugene Zhulenev
0fc052399b [xla:cpu] Fix data race in ThunkExecutor
Also add tsl::down_pointer_cast to improve usability.

PiperOrigin-RevId: 822257137
2025-10-21 13:46:24 -07:00
Michael Whittaker
5776d2771c Pipe incarnations to jax.live_devices.
PiperOrigin-RevId: 822250955
2025-10-21 13:35:27 -07:00
mmakevic-amd
47cd01d4a5 PR #32960: [ROCm] Refactor testing scripts
Imported from GitHub PR https://github.com/openxla/xla/pull/32960

📝 Summary of Changes
(Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, 9d358b9b26, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now.

🎯 Justification
These changes are ROCm specific and helps with rocm internal CI validation pipelines.

🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests

📊 Benchmark (for Performance Improvements)
/

🧪 Unit Tests:
/

🧪 Execution Tests:
/

Copybara import of the project:

--
804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>:

Add missing cuda-only tag to cuda test

--
44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>:

Refactor test scripts

--
fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>:

Cover more mgpu tests

--
1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>:

Switch from rocm_gcc to rocm_ci for sgpu tests

--
00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>:

Changing file permissions

--
003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>:

Remove unnecessary import

--
214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

Add license header

Merging this change closes #32960

PiperOrigin-RevId: 822245565
2025-10-21 13:25:33 -07:00
Eugene Zhulenev
7a107e3571 [xla:ffi] Rename FFI_TypeID_Register API
PiperOrigin-RevId: 822240093
2025-10-21 13:12:40 -07:00
Felix Wang
95f3e6f33c [XLA:GPU]: Refactor the unit test of matmul_interpolator_test.cc to prepare for adding the mix-precision fp8 unit test.
PiperOrigin-RevId: 822239646
2025-10-21 13:02:53 -07:00
Felix Wang
2de2bb8581 Populate the cost for async collective in both async-start and the computation root op.
PiperOrigin-RevId: 822223031
2025-10-21 12:22:08 -07:00
Eugene Zhulenev
633c3efcf9 [xla:cpu] Delete unused cpu_function_runtime header
PiperOrigin-RevId: 822215543
2025-10-21 12:15:40 -07:00
Eugene Zhulenev
6141496817 [xla:ffi] Document XLA:FFI binary API guarantees and add a supporteded API range check
PiperOrigin-RevId: 822214561
2025-10-21 12:02:12 -07:00
Kevin Gleason
fe624fe9ce [StableHLO->HLO] Only lower MHLO constants in MHLO prepare for export pass.
PiperOrigin-RevId: 822198262
2025-10-21 11:34:37 -07:00
A. Unique TensorFlower
7524326efd Create testing utilities for extension plugin testing
PiperOrigin-RevId: 822185331
2025-10-21 11:24:34 -07:00
Maxim Ermilov
f7bc8a8859 Port to new GpuComputeCapability API. Part 2
PiperOrigin-RevId: 822183464
2025-10-21 11:14:17 -07:00
A. Unique TensorFlower
2476ba49e0 Minor code improvement on tsl::monitoring::testing::CellReader.
Save a `GetLatestValueOrDefault` call when the result is not used. Also save two map lookups.

PiperOrigin-RevId: 822182539
2025-10-21 11:02:58 -07:00
Jaroslav Sevcik
4fc74ffdd2 PR #32846: Allow mixed precision operands for collective permute
Imported from GitHub PR https://github.com/openxla/xla/pull/32846

📝 Summary of Changes
Allow mixed precision collective-permute in the verifier.

🎯 Justification
Partially addresses https://github.com/openxla/xla/issues/32845

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute.

🧪 Execution Tests:
N/A
Copybara import of the project:

--
666c38a19005a609d4a7aa8e5e9b9842b1c87175 by Jaroslav Sevcik <jsevcik@nvidia.com>:

Allow mixed precision for collective permute

Merging this change closes #32846

PiperOrigin-RevId: 822179840
2025-10-21 10:51:30 -07:00
dependabot[bot]
f717c02d1c PR #32904: Bump github/codeql-action from 3.30.5 to 4.30.9
Imported from GitHub PR https://github.com/openxla/xla/pull/32904

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p>
<blockquote>
<h2>v4.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.8/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.7</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.7/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>3.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v3.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p>
<blockquote>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>[UNRELEASED]</h2>
<p>No user facing changes.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<h2>3.30.6 - 02 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.2. <a href="https://redirect.github.com/github/codeql-action/pull/3168">#3168</a></li>
</ul>
<h2>3.30.5 - 26 Sep 2025</h2>
<ul>
<li>We fixed a bug that was introduced in <code>3.30.4</code> with <code>upload-sarif</code> which resulted in files without a <code>.sarif</code> extension not getting uploaded. <a href="https://redirect.github.com/github/codeql-action/pull/3160">#3160</a></li>
</ul>
<h2>3.30.4 - 25 Sep 2025</h2>
<ul>
<li>We have improved the CodeQL Action's ability to validate that the workflow it is used in does not use different versions of the CodeQL Action for different workflow steps. Mixing different versions of the CodeQL Action in the same workflow is unsupported and can lead to unpredictable results. A warning will now be emitted from the <code>codeql-action/init</code> step if different versions of the CodeQL Action are detected in the workflow file. Additionally, an error will now be thrown by the other CodeQL Action steps if they load a configuration file that was generated by a different version of the <code>codeql-action/init</code> step. <a href="https://redirect.github.com/github/codeql-action/pull/3099">#3099</a> and <a href="https://redirect.github.com/github/codeql-action/pull/3100">#3100</a></li>
<li>We added support for reducing the size of dependency caches for Java analyses, which will reduce cache usage and speed up workflows. This will be enabled automatically at a later time. <a href="https://redirect.github.com/github/codeql-action/pull/3107">#3107</a></li>
<li>You can now run the latest CodeQL nightly bundle by passing <code>tools: nightly</code> to the <code>init</code> action. In general, the nightly bundle is unstable and we only recommend running it when directed by GitHub staff. <a href="https://redirect.github.com/github/codeql-action/pull/3130">#3130</a></li>
<li>Update default CodeQL bundle version to 2.23.1. <a href="https://redirect.github.com/github/codeql-action/pull/3118">#3118</a></li>
</ul>
<h2>3.30.3 - 10 Sep 2025</h2>
<p>No user facing changes.</p>
<h2>3.30.2 - 09 Sep 2025</h2>
<ul>
<li>Fixed a bug which could cause language autodetection to fail. <a href="https://redirect.github.com/github/codeql-action/pull/3084">#3084</a></li>
<li>Experimental: The <code>quality-queries</code> input that was added in <code>3.29.2</code> as part of an internal experiment is now deprecated and will be removed in an upcoming version of the CodeQL Action. It has been superseded by a new <code>analysis-kinds</code> input, which is part of the same internal experiment. Do not use this in production as it is subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3064">#3064</a></li>
</ul>
<h2>3.30.1 - 05 Sep 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.0. <a href="https://redirect.github.com/github/codeql-action/pull/3077">#3077</a></li>
</ul>
<h2>3.30.0 - 01 Sep 2025</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="16140ae1a1"><code>16140ae</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3213">#3213</a> from github/update-v4.30.9-70205d3d1</li>
<li><a href="30db5fee08"><code>30db5fe</code></a> Update changelog for v4.30.9</li>
<li><a href="70205d3d12"><code>70205d3</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3211">#3211</a> from github/mbg/init/starting-partial-config</li>
<li><a href="697c209bfc"><code>697c209</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="1bd53ba38c"><code>1bd53ba</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3205">#3205</a> from github/update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="cac4df0c79"><code>cac4df0</code></a> Rebuild</li>
<li><a href="77e5c0d0a2"><code>77e5c0d</code></a> Merge branch 'main' into update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="97a4f751be"><code>97a4f75</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3204">#3204</a> from github/mbg/setup-codeql</li>
<li><a href="2d5512b361"><code>2d5512b</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="fa7bdf0559"><code>fa7bdf0</code></a> Call <code>getAnalysisKinds</code> a second time, and ignore exceptions thrown during th...</li>
<li>Additional commits viewable in <a href="3599b3baa1...16140ae1a1">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github/codeql-action&package-manager=github_actions&previous-version=3.30.5&new-version=4.30.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

</details>
Copybara import of the project:

--
c14a0d2198bee3dcd76ee7fa733da41a6d1fcd6b by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>:

Bump github/codeql-action from 3.30.5 to 4.30.9

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](3599b3baa1...16140ae1a1)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.30.9
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Merging this change closes #32904

PiperOrigin-RevId: 822178959
2025-10-21 10:40:39 -07:00
Peter Hawkins
16064a6c08 Reverts 67e5eafb24
PiperOrigin-RevId: 822148458
2025-10-21 09:32:19 -07:00
A. Unique TensorFlower
ffc0e052de Adds an option for Hlo Module's CreateFromProto to not preserve instruction unique ids and reassigned them in a compacted way. Options is turned off for now but should be the new default moving forward. Deprecates RemapInstructionIds.
PiperOrigin-RevId: 822146834
2025-10-21 09:22:36 -07:00
Pedro Gonnet
797ffee48d [xla:cpu:xnn] Re-enable XNNPACK by default.
Reverts b2f2568bcc

PiperOrigin-RevId: 822127662
2025-10-21 08:33:32 -07:00
Pedro Gonnet
3a5338e48f Update XNNPACK in XLA
PiperOrigin-RevId: 822105743
2025-10-21 07:39:28 -07:00
Shaogang Wang
97c777acc4 PR #32688: [XLA:GPU] Enable command buffer DynamicSliceCopyFusion command unrolling
Imported from GitHub PR https://github.com/openxla/xla/pull/32688

📝 Summary of Changes
This PR enables command buffer DynamicSliceCopy command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd

🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.

🚀 Kind of Contribution
Please remove what does not apply:
 New Feature

🧪 Unit Tests:
xla/backends/gpu/runtime/command_buffer_cmd_test.cc: CommandBufferCmdTest:DynamicSliceCopyFusionCmd

Copybara import of the project:

--
feb2902fca397360460f6b9788ac0f7482cb547c by Shawn Wang <shawnw@nvidia.com>:

Enable command buffer DynamicSliceCopyFusion command unrolling

Merging this change closes #32688

PiperOrigin-RevId: 822104580
2025-10-21 07:24:42 -07:00
Johannes Reifferscheid
c28d80ae66 Reverts 8be12edcfe
PiperOrigin-RevId: 822095736
2025-10-21 06:59:25 -07:00
Will Froom
2abafe5c4d [XLA][XTile] Make squeeze dims work with xtile extract/insert.
PiperOrigin-RevId: 822087994
2025-10-21 06:33:13 -07:00
Shaogang Wang
8c169d147d PR #32719: 【XLA:GPU] Command buffer DynamicSliceFusionCmd supports cuda graph loop unrolling
Imported from GitHub PR https://github.com/openxla/xla/pull/32719

📝 Summary of Changes
This PR enables command buffer DynamicSliceFusion command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd

🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.

🚀 Kind of Contribution
Please remove what does not apply:
 New Feature

🧪 Unit Tests:
xla/backends/gpu/codegen/dynamic_slice_fusion_test.cc
Copybara import of the project:

--
daa975804cbffcc3a6bc5b37e3494b51a2dbe2ca by Shawn Wang <shawnw@nvidia.com>:

DynamicSliceFsuionCmd supports unrolling

Merging this change closes #32719

PiperOrigin-RevId: 822071751
2025-10-21 05:48:07 -07:00
Mikhail Goncharov
2d4dd83773 [XLA:GPU] enable generic triton emitter for all gemms
According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default.
Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future.

PiperOrigin-RevId: 822067921
2025-10-21 05:34:23 -07:00
Will Froom
bd257617f7 [XLA:GPU][XTile] Move xtile lowering to compilation pipeline.
PiperOrigin-RevId: 822066890
2025-10-21 05:23:02 -07:00
A. Unique TensorFlower
ffdd9d06e0 Update autotune results for XLA AOT compile test.
The device description in the autotune results now specifies "DNN version: 8.9.4" instead of "DNN version: 0.0.0".

PiperOrigin-RevId: 822043677
2025-10-21 04:16:40 -07:00
Dirk Hornung
1aff85868d [Autotuner] Find dot instructions in nested computations for fusion autotuning.
The fission autotuner previously only searched for dot instructions in the entry computation of an HLO module. This caused it to miss dot operations located in nested computations, such as the body of a while loop, preventing the autotuner from applying configurations to them.

PiperOrigin-RevId: 822037141
2025-10-21 04:07:19 -07:00
Ilya Tikhonovskiy
0836518bc5 [XLA:GPU] fine tune the xla part of the scaled-dot op implementation in order to support the case when we omit one of the scales and pass the bf16 argument instead.
We adjusted the emitter for the case when the scale is missing.
Also we relaxed the hlo verifier a bit and tweaked the composite rewriter that should accept the dim indexes passed by jax.

PiperOrigin-RevId: 822036474
2025-10-21 03:54:17 -07:00
Will Froom
373abf8de1 [XLA:CPU][XTile] Add support for strided extract/insert tile.
PiperOrigin-RevId: 822035319
2025-10-21 03:38:59 -07:00
Adrian Kuegel
e756c21611 [XLA:GPU] Remove a source of non-determinism from DotMerger pass.
When removing ops, we need to do that in a deterministic order. The reason is
that removing users works by finding the position of the user in the users
vector, then swapping with the last element of the vector, then popping the
last element of the vector. So if more than one element is removed from a users
list, it matters in which order the elements are removed.

PiperOrigin-RevId: 822026351
2025-10-21 03:32:31 -07:00
Jaroslav Sevcik
735f4bb631 PR #32905: Allow mixed precision operands for async collective permute
Imported from GitHub PR https://github.com/openxla/xla/pull/32905

📝 Summary of Changes
Allow mixed precision asynchronous collective-permute in the verifier.

🎯 Justification
Fixes https://github.com/openxla/xla/issues/32845

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute-start and collective-permute-done.

🧪 Execution Tests:
Manually testes the JAX repro from https://github.com/openxla/xla/issues/32845
Copybara import of the project:

--
f44faa7ce7ecfbd810983cae170a118bb19a8bb3 by Jaroslav Sevcik <jsevcik@nvidia.com>:

Allow mixed precision operands for async collective permute

Merging this change closes #32905

PiperOrigin-RevId: 822023349
2025-10-21 03:22:25 -07:00
Aleksa Arsic
b89fdab2a2 PR #32773: [ROCm] Fix convolution fp16 performance drop on gfx11xx, gfx12xx
Imported from GitHub PR https://github.com/openxla/xla/pull/32773

📝 Summary of Changes
Remove hardcoded NHWC convolution layout for fp16 precision.

🎯 Justification
Performance drops for fp16 precision on gfx11xx and gfx12xx GPUs were observed internally, as well as by the [community](https://github.com/jax-ml/jax/issues/30548).

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark
Community member provided the script with whom the [profiling can be done](https://github.com/jax-ml/jax/issues/30548#issue-3270872993).
Significant performance improvement for fp16 on gfx12xx:
```
Running on: rocm:0

Testing float32...
Avg time: 0.092307 s, Throughput: 1.68 TFLOP/s

Testing float16...
Avg time: 0.011742 s, Throughput: 13.17 TFLOP/s

Testing bfloat16...
Avg time: 0.011989 s, Throughput: 12.90 TFLOP/s
```
Results of the profiling before the fix:
```
Running on: rocm:0

Testing float32...
Avg time: 0.092312 s, Throughput: 1.67 TFLOP/s

Testing float16...
Avg time: 0.775142 s, Throughput: 0.20 TFLOP/s

Testing bfloat16...
Avg time: 0.011990 s, Throughput: 12.90 TFLOP/s
```

@xla-rotation can you please review this PR?

Copybara import of the project:

--
c9fdba79e32c13d9cbf640e61d941d071fabba9d by Aleksa Arsic <Aleksa.Arsic@amd.com>:

Remove hardcoded convolution NCHW layout assignment for fp16 precision.

--
69660d19999a14b24d63b52e6dae310cfbdcbb6b by Aleksa Arsic <Aleksa.Arsic@amd.com>:

Add unit tests for ROCm layout assignment.

Merging this change closes #32773

PiperOrigin-RevId: 822022522
2025-10-21 03:08:12 -07:00
Will Froom
0f4c3f55b5 [XLA][XTile] Use xtile entry, extract & insert in triton emitter.
PiperOrigin-RevId: 822020719
2025-10-21 02:55:47 -07:00
Will Froom
7f64538e67 [XLA:CPU] Make tiled kernel test deterministic.
PiperOrigin-RevId: 822018419
2025-10-21 02:42:18 -07:00
Will Froom
900e2d4d5c [XLA][XTile] Add TiledBuffer interface to insert/extract ops.
PiperOrigin-RevId: 822009372
2025-10-21 02:04:04 -07:00
Dimitris Vardoulakis
81f29b3472 PR #32724: Disable only the test cases that are failing and enable 3 test targets on B200.
Imported from GitHub PR https://github.com/openxla/xla/pull/32724

Copybara import of the project:

--
c3f4ff8ec6af27d24b61e2aa529585697b8aa77a by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Disable only the test cases that are failing and enable 3 test targets on B200.

--
1f6e52218ec124bb52d4dba70aa7832311762465 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Disable test case in cudnn_test that fails on Google's B200.
Keep gpu_compiler_test off CI for now due to memory leak
found by ASAN, but don't revert the changes in the file,
so it can be enabled more easily in the future.

--
42e501a41e43c174538ab186c659a072101b4ab2 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Disable ConvWgradWithNHWCLayoutExecutesCorrectly only on Blackwell.

Merging this change closes #32724

PiperOrigin-RevId: 821992088
2025-10-21 01:05:14 -07:00
Mudit Gokhale
898e238e5d Enable multi-host support for trace viewer.
PiperOrigin-RevId: 821973423
2025-10-21 00:16:33 -07:00
A. Unique TensorFlower
e7e50018a2 Reverts 7144ba7d80
PiperOrigin-RevId: 821972512
2025-10-20 23:54:39 -07:00
A. Unique TensorFlower
fdc9d00af9 Automated Code Change
PiperOrigin-RevId: 821967952
2025-10-20 23:37:29 -07:00
A. Unique TensorFlower
a31ff63e54 Refactor resource tracking logic done in GetResourcesForInstructionImpl.
PiperOrigin-RevId: 821957518
2025-10-20 22:55:19 -07:00
A. Unique TensorFlower
63d73059b0 Automated Code Change
PiperOrigin-RevId: 821947260
2025-10-20 22:21:39 -07:00
A. Unique TensorFlower
e39094330d Automated Code Change
PiperOrigin-RevId: 821946240
2025-10-20 22:11:41 -07:00
Eugene Zhulenev
5caf2a70d8 [xla:ffi] Revert ABI version change
Also relax the FFI version check.

PiperOrigin-RevId: 821905272
2025-10-20 19:43:47 -07:00
Subhankar Shah
14710459b6 [XLA:MSA] Allow MSA to schedule custom-call prefetches.
PiperOrigin-RevId: 821900214
2025-10-20 19:25:21 -07:00
A. Unique TensorFlower
7b8ce05b8b Integrate LLVM at llvm/llvm-project@32de3b9ef9
Updates LLVM usage to match
[32de3b9ef9e7](https://github.com/llvm/llvm-project/commit/32de3b9ef9e7)

PiperOrigin-RevId: 821892561
2025-10-20 18:56:42 -07:00
Benjamin Chetioui
fe31cef146 [XLA:GPU] Add a filter in the Triton fusion emitter to allow using a transposed iteration schedule for fusions rooted in dot.
The intent is to improve L2 cache hits in the case where the left-hand side argument
fully fits in L2.

PiperOrigin-RevId: 821882872
2025-10-20 18:15:08 -07:00
Haibo Huang
0e15b80eb6 Implement LogicalDeviceOfDefaultTypeForId for GPU / CPU
PiperOrigin-RevId: 821873337
2025-10-20 17:53:40 -07:00
Parker Schuh
20d14fbfb7 Rollforward of Update PjRtStreamExecutorRawBuffer::CopyRawHostToDeviceAndReturnEvent to
support staging host buffers (for non-pinned memory). This allows replacing the CopyRawToHost functions.

Reverts baf408c724

PiperOrigin-RevId: 821872812
2025-10-20 17:43:51 -07:00
Haibo Huang
d2e02ce8d9 Adds topology utility functions
PiperOrigin-RevId: 821858216
2025-10-20 17:06:02 -07:00
A. Unique TensorFlower
ce507e7993 Update XNNPACK in XLA
This update includes some workarounds that disable SME for old compilers that don't support.

PiperOrigin-RevId: 821848400
2025-10-20 16:51:04 -07:00
Alexander Shaposhnikov
72ba588a80 Clean up debug_options usage. NFC.
PiperOrigin-RevId: 821848216
2025-10-20 16:38:52 -07:00
Maxim Ermilov
361f1c64eb Port to new GpuComputeCapability API
PiperOrigin-RevId: 821845460
2025-10-20 16:24:36 -07:00
Will Froom
dd4822d61c [XLA:CPU] Fix flaky test.
PiperOrigin-RevId: 821835738
2025-10-20 16:03:57 -07:00
Hyeontaek Lim
67e5eafb24 [PJRT C] Implement Executable::GetOutputLayouts() in the PJRT Layouts extension
This change implements a native support for `xla::Executable::GetOutputLayouts()` in PJRT C API, when PJRT Layouts extension is available. This support does not fetch the optimized HLO, and thus this method becomes more reliable and fast.

This change strongly recommends the plugin that implemented the Layouts extension v2 to upgrade to v3 to avoid an incompatibility.

PiperOrigin-RevId: 821834116
2025-10-20 15:57:07 -07:00
Bill Varcho
a40f3bdebd [Upkeep] Resolve 4 instances of the following issue: Todo (resolved)
PiperOrigin-RevId: 821824006
2025-10-20 15:43:00 -07:00
Gunhyun Park
d2759cb154 Fix typo invompatible -> incompatible
PiperOrigin-RevId: 821816332
2025-10-20 15:16:51 -07:00
Sean Talts
8d940b9cc7 [XLA:CPU] Refactor: intrinsic::Type into its own library, pull out naming functions.
PiperOrigin-RevId: 821815511
2025-10-20 15:06:06 -07:00
Hyeontaek Lim
b915d3103a [NanoRt] NanoRt IFRT now returns a nullptr if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.
Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future.

PiperOrigin-RevId: 821808592
2025-10-20 14:51:22 -07:00
Benjamin Chetioui
88e4cd4d01 [XLA] Use TiledHloSchedule in SymbolicTileAnalysis.
We can now produce arbitrary iteration patterns for output tiles, simply by
parametrizing calls to `ComputeTiledHloInstructions` with different
`TiledHloSchedule`s.

PiperOrigin-RevId: 821796530
2025-10-20 14:12:48 -07:00
Sohaib Iftikhar
cbeeef926f [XLA:GPU]: Remove unused method from all reduce test
PiperOrigin-RevId: 821742010
2025-10-20 12:28:17 -07:00
Hyeontaek Lim
cc9fd2b254 [IFRT Proxy] Array::pjrt_layout() uses nullptr to indicate a default layout
IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.

Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately.
PiperOrigin-RevId: 821741105
2025-10-20 12:19:53 -07:00
Eugene Zhulenev
0e09f486e7 [xla:pjrt:ffi] Remove deprecated TypeID registration function
PiperOrigin-RevId: 821740142
2025-10-20 12:11:20 -07:00
Michael Kuperstein
b824d4e187 [XLA] Remove verify_unique_channel_ids verifier option.
The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use.

PiperOrigin-RevId: 821737613
2025-10-20 11:59:45 -07:00
Kanish Anand
16e1567819 Clarify field name
PiperOrigin-RevId: 821732575
2025-10-20 11:50:54 -07:00
Zac Mustin
903ce4c6ff Use GetPjRtCpuClient directly when making a PJRT C API CPU client.
Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`.

This change is a no-op.

PiperOrigin-RevId: 821732030
2025-10-20 11:44:41 -07:00
A. Unique TensorFlower
458995b35d Set DNN version in DeviceDescription for autotuner cache.
The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra.

PiperOrigin-RevId: 821728904
2025-10-20 11:34:10 -07:00
A. Unique TensorFlower
027a15b3dc [Autotuner] Early exit if there is only one supported config.
- We encounter this case very often (for cublas autotuner), so it makes sense to optimize it.
- Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it.

PiperOrigin-RevId: 821716593
2025-10-20 11:08:05 -07:00
Eugene Zhulenev
fd948cba88 [xla:cpu] Migrate tf2xla to BufferAllocationInfo
Reverts f2ed04aff6

PiperOrigin-RevId: 821660240
2025-10-20 08:41:06 -07:00
A. Unique TensorFlower
f32acff204 Automated Code Change
PiperOrigin-RevId: 821628227
2025-10-20 07:15:56 -07:00
A. Unique TensorFlower
789f9c8c92 Integrate LLVM at llvm/llvm-project@d5ce81dc81
Updates LLVM usage to match
[d5ce81dc8143](https://github.com/llvm/llvm-project/commit/d5ce81dc8143)

PiperOrigin-RevId: 821627987
2025-10-20 06:57:20 -07:00
Will Froom
4d5ed207cd [XLA:GPU][XTile] Handle scalar load/store in xtile lowering.
PiperOrigin-RevId: 821617202
2025-10-20 06:36:16 -07:00
Alex
8245a623aa PR #32782: [ROCm] Fix hermetic build for rocm
Imported from GitHub PR https://github.com/openxla/xla/pull/32782

📝 Summary of Changes
Fix hermetic build for rocm.

🎯 Justification
Introduce missing hipblaslt dependency.
Fix invalid libs linking and align with the data directories.

🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
CI, not relevant

🧪 Unit Tests:
Not relevant

🧪 Execution Tests:
Not relevant

Copybara import of the project:

--
f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>:

Add missing hermetic lib dependency

--
fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>:

Add missing hipblaslt deps, fix the targets

--
540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>:

Make hipblaslt mandatory

--
3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>:

Fix test

--
eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>:

Ignore asnsigaltstack

--
54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:

[ROCm] Use working sha256 for latest ROCm 7.0 docker image

--
9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

[ROCm] Add ROCM_PATH repo_env to test scripts

--
1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>:

Fix buildifier warning

Merging this change closes #32782

PiperOrigin-RevId: 821614030
2025-10-20 06:26:00 -07:00
Kostiantyn Liepieshov
5866a4f621 fix relayout propagation for MPMD.
In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts

PiperOrigin-RevId: 821612799
2025-10-20 06:11:10 -07:00
A. Unique TensorFlower
5707a02d98 [XLA:GPU] Enable chlo.asinh -> kAsinh HloInstruction lowering.
PiperOrigin-RevId: 821610794
2025-10-20 05:57:12 -07:00
Adrian Kuegel
7144ba7d80 Migrate ListScheduler from TuplePointsToAnalysis to HloAliasAnalysis.
Reverts 22032a9edb

PiperOrigin-RevId: 821605784
2025-10-20 05:44:20 -07:00
Will Froom
beb48d90e2 [XLA][XTile] Add xtile lowering passes for triton.
This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR.

The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`.

PiperOrigin-RevId: 821593545
2025-10-20 04:57:30 -07:00
Christian Sigg
ea72bd7e48 [XLA:GPU] Allow kPad in nest_gemm_fusion.
PiperOrigin-RevId: 821581567
2025-10-20 04:16:25 -07:00
Nikita Putikhin
c842d810f4 [XLA:GPU] Initialize global_split_limits with the first config's split_k.
PiperOrigin-RevId: 821579201
2025-10-20 04:03:13 -07:00
Parker Schuh
f322c0c82d Refactor the event loop + socket integration so that it is separately testable.
This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as
the API from the socket integration and subclasses can handle these
accordingly. This also moves the responsibility to destroy in the handler logic
with the contract that the event is removed from the loop on the second HalfClose event.

PiperOrigin-RevId: 821445213
2025-10-19 19:02:26 -07:00
A. Unique TensorFlower
50808df60c Add random perturbations to the xla_tpu_msa_sort_order_overrides flag
Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation.

This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features:

1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed
- to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime
- once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction

2. finding a lot of small changes that together reduce the runtime
- we can do this using a "hill-climbing" method
- try many perturbations until you find one slightly better than the baseline.
- try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again
- repeat until no more improvements are found

NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs.
PiperOrigin-RevId: 821309046
2025-10-19 06:33:49 -07:00
David Majnemer
8cf42017ec [TSL] Consolidate NUMA code across different platforms.
No functional change is intended.

PiperOrigin-RevId: 821216963
2025-10-18 22:14:29 -07:00
Subhankar Shah
a139a50e56 [XLA:MSA] Allow allocation requests with a continuous default memory requirement to fall through without resulting in a failure requiring un-commit.
PiperOrigin-RevId: 821165250
2025-10-18 18:00:21 -07:00
A. Unique TensorFlower
ac5fb8fb7f Introduce and connect an XLA:TPU shardy option to fully deduplicate functions in Shardy.
This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default.

PiperOrigin-RevId: 821039969
2025-10-18 05:48:48 -07:00
Aliia Khasanova
4985a1c2f3 Add proto [de]serialization for HostExecuteDoneThunk
PiperOrigin-RevId: 821029998
2025-10-18 04:55:53 -07:00
A. Unique TensorFlower
17117898f9 Automated Code Change
PiperOrigin-RevId: 820964120
2025-10-17 23:15:56 -07:00
Maxim Ermilov
4a42fca868 First step to introduce GpuComputeCapability custom class instead of std::variant
PiperOrigin-RevId: 820940828
2025-10-17 21:24:11 -07:00
David Majnemer
4d358b2bac [TSL] Remove unused integral_types.h from TSL platform defaults.
This header provided typedefs for integral types within the `tsl` namespace, but these are no longer necessary.

PiperOrigin-RevId: 820935573
2025-10-17 21:01:34 -07:00
A. Unique TensorFlower
4beacf5a04 Make the chain specifically target the reshape-transpose chain where
the transposes are not identity permutations. Identity transposes
should be eliminated separately in HandleTranspose already.

PiperOrigin-RevId: 820903953
2025-10-17 18:56:47 -07:00
Alexander Shaposhnikov
ce65a0ad5c [XLA:CPU] Add initial bits for YNNPACK support.
+ Do not build XLA with YNNPACK on Windows.

Co-authored-by: Penporn Koanantakool <penporn@google.com>
PiperOrigin-RevId: 820896434
2025-10-17 18:36:53 -07:00
A. Unique TensorFlower
f0057ee4b7 [XLA] Delete stale comment on ShapeUtil::PermuteDimensions usage.
PiperOrigin-RevId: 820874660
2025-10-17 16:58:56 -07:00
Zixuan Jiang
1a142dab0a Refactor shardy_xla_pass.
Remove unused code.

PiperOrigin-RevId: 820872613
2025-10-17 16:49:40 -07:00
A. Unique TensorFlower
f2ed04aff6 Reverts 0fab8daf15
PiperOrigin-RevId: 820869543
2025-10-17 16:39:59 -07:00
A. Unique TensorFlower
206f1c1891 Update XNNPACK in XLA
PiperOrigin-RevId: 820860720
2025-10-17 16:14:15 -07:00
Haibo Huang
a619e2de08 Expose new methods to PjRtTopologyDescription.
PiperOrigin-RevId: 820837477
2025-10-17 15:04:17 -07:00
A. Unique TensorFlower
119e1f6731 https://github.com/llvm/llvm-project/pull/162120 removed some automatic namespace determinations, so we need to explicitly specify some namespaces now. This is needed
for the LLVM integrate.

PiperOrigin-RevId: 820836649
2025-10-17 14:52:43 -07:00
David Majnemer
bdb78510d0 [TSL] Clean up integral types
Let's migrate to u?int\d+_t types instead of our own bespoke stuff.

PiperOrigin-RevId: 820815523
2025-10-17 14:19:08 -07:00
Eugene Zhulenev
d531cdce30 [xla:ffi] Add TypeRegistry::TypeInfo to be able to register functions to manipulate user-defined types
PiperOrigin-RevId: 820811829
2025-10-17 13:41:40 -07:00
Kevin Gleason
46522b8a20 [StableHLO] Add transpose simplification
PiperOrigin-RevId: 820804015
2025-10-17 13:31:39 -07:00
Niklas Vangerow
13006913d2 Migrate sample_file_test to HloRunnerPjRt.
PiperOrigin-RevId: 820803579
2025-10-17 13:21:59 -07:00
Hyeontaek Lim
05101b9755 [PjRt-IFRT] Temporary workaround for output layout handling
PjRt-IFRT directly or indirectly fetched optimized HLO to get the output
layout mode and output layouts. This seems to introduce a regression in
some jobs that use PJRT C API and have a too large serialized HLO (> 2 GiB).

As a workaround, PjRt-IFRT gracefully handles output layout mode and
layout discovery errors, and falls back to concrete layouts that are
directly obtained from output `PjRtBuffer`s, should give the same
behavior before/after the default layout handling change.

Further changes will follow to discover default layout modes and layouts
without going through `PjRtLoadedExecutable::GetHloModules()`.

PiperOrigin-RevId: 820785277
2025-10-17 12:41:35 -07:00
Parker Schuh
b07145966f Add StatusOr to transfer server BulkTransportInterface on the bond id to
forward errors from bond connection failures to the control plane connection.

PiperOrigin-RevId: 820783819
2025-10-17 12:28:16 -07:00
Eugene Zhulenev
0fab8daf15 [xla:cpu] Migrate tf2xla to BufferAllocationInfo
Reverts 94fbd7554e

PiperOrigin-RevId: 820770766
2025-10-17 11:54:08 -07:00
Benjamin Chetioui
81798b5240 [XLA] Throw away TilingSpecification in the TransposedDotTiledHloSchedule.
After relaxing the constraints related to the iteration space in a recent
change, this is no longer necessary.

PiperOrigin-RevId: 820766539
2025-10-17 11:33:01 -07:00
A. Unique TensorFlower
94fbd7554e Reverts fb52ce8275
PiperOrigin-RevId: 820748684
2025-10-17 10:58:15 -07:00
Penporn Koanantakool
8614a97d98 [xla:cpu:ynn] Add build macros for YNNPACK integration.
We won't build XLA with YNNPACK on Windows yet.

PiperOrigin-RevId: 820744698
2025-10-17 10:40:45 -07:00
Kostiantyn Liepieshov
f910c98db0 Use R"hlo(...)hlo" for HLO text in sample_text_test.cc.
This improves readability and allows for better syntax highlighting of the embedded HLO strings.

PiperOrigin-RevId: 820710394
2025-10-17 09:12:53 -07:00
Eugene Zhulenev
fb52ce8275 [xla:cpu] Migrate tf2xla to BufferAllocationInfo
PiperOrigin-RevId: 820707093
2025-10-17 08:59:31 -07:00
Eugene Zhulenev
4752801386 [xla:ffi] Make TypeInfo mandatory in XLA_FFI_REGISTER_TYPE
Add placeholders for future Type serialization/deserialization. It's not an ABI breaking change as it's unused today, and it allows to avoid ABI breaking change in the future when FFI will add proper ser/des support for user defined types.

PiperOrigin-RevId: 820676169
2025-10-17 07:20:25 -07:00
Aliia Khasanova
30d25d6d18 Add proto [de]serialization for HostExecuteStartThunk
PiperOrigin-RevId: 820645056
2025-10-17 05:32:26 -07:00
Karlo Basioli
0bb1532ddf [XLA] Enable multihost runner to load unoptimized hlo snapshots dumped without custom serialization.
PiperOrigin-RevId: 820643951
2025-10-17 05:26:10 -07:00
A. Unique TensorFlower
51fc1ac0d5 Improve logging and error messages from autotuner.
- The VLOG messages are updated to more accurately describe whether the autotuner is finding a config in cache, using a default, or actively tuning for the best config.
- The error contains the HLO instruction.

PiperOrigin-RevId: 820640768
2025-10-17 05:16:19 -07:00
Eugene Zhulenev
52749919c9 [xla:cpu] Add buffer_allocation_info to xla_cpu_runtime_hdrs
PiperOrigin-RevId: 820639686
2025-10-17 05:03:10 -07:00
Mohammed Anany
097f587e4e [XLA:GPU/WS] Adding test coverage for auto warp specialization via Triton.
PiperOrigin-RevId: 820637611
2025-10-17 04:49:39 -07:00
Nikita Putikhin
cc58fb18fd [XLA:GPU] Enable dots with block_n=8 in triton and autotuner
This change utilizes recently added Triton support for smaller block sizes.

Skipping occupancy optimization for some configs is essentially a workaround for incompatible split_k values. The impact of these configs is limited however because they are only present in non-exhaustive mode, so they mostly get filtered out anyway.

PiperOrigin-RevId: 820617352
2025-10-17 03:32:51 -07:00
Will Froom
abc19d2d20 [XLA:CPU] Combine optimization & lowering pass managers by using callback pass.
PiperOrigin-RevId: 820610316
2025-10-17 03:07:44 -07:00
Karlo Basioli
5da47fcdd8 [XLA:GPU][codegen] Emit shlo for broadcast_in_dim and lower to equivalent triton op.
PiperOrigin-RevId: 820598440
2025-10-17 02:33:27 -07:00
Zixuan Jiang
0ab4818f74 Use all-gather in the spmd_partitioner_test.
Before this change, we disallowed all-gather such that the partitioner generates `all-reduce(dynamic-update-slice())` pattern. With this change, we allow all-gather for two reasons.
1. In most cases, all-gather is allowed and preferred.
2. It is easier to read and match the partitioner result.

PiperOrigin-RevId: 820593767
2025-10-17 02:02:58 -07:00
Ilia Sergachev
4cd7465b84 PR #32388: [GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension.
Imported from GitHub PR https://github.com/openxla/xla/pull/32388

📝 Summary of Changes
Support collectives with non-minor-most last dimension in the sub-byte collective normalization pass.

🎯 Justification
Makes more collectives efficient, not require type conversion.

🚀 Kind of Contribution
Performance Improvement.

📊 Benchmark (for Performance Improvements)
```
Before:

## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=68384ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=67744ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=66976ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=67040ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=66816ns

After:

## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=41216ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=41056ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=40960ns
```
Measured on 8xH100 DGX.

🧪 Unit Tests:
yes

🧪 Execution Tests:
yes
Copybara import of the project:

--
a3777523ffffbcc59da285544e3fb5575d098b9c by Ilia Sergachev <isergachev@nvidia.com>:

[GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension.

Merging this change closes #32388

PiperOrigin-RevId: 820585923
2025-10-17 01:38:24 -07:00
Harsha H S
086937e138 PR #32678: [ROCm] Use working sha256 for latest ROCm 7.0 docker image and fix test scripts
Imported from GitHub PR https://github.com/openxla/xla/pull/32678

📝 Summary of Changes
- Fix sha256 of docker image to ensure CI is not broken due to malformed image
- Fix test scripts by passing ROCM_PATH to bazel sandbox via repo_env

🎯 Justification
Continued CI runs

🚀 Kind of Contribution
 🧪 Tests

Copybara import of the project:

--
3ca8114613d8e002c137f28bb6608639d08a724a by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:

[ROCm] Use working sha256 for latest ROCm 7.0 docker image

--
09ddfbdf205a6406cdd67e20671f41455fffe0f9 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

[ROCm] Add ROCM_PATH repo_env to test scripts

Merging this change closes #32678

PiperOrigin-RevId: 820582560
2025-10-17 01:25:06 -07:00
Shanbin Ke
f573329cc6 PR #32718: [XLA:GPU] add conv fusion support in cudnn fusion compiler
Imported from GitHub PR https://github.com/openxla/xla/pull/32718

📝 Summary of Changes
This PR adds conv fusion support in cudnn fusion compiler.

* add conv type in `CuDnnFusionConfig` to represent different types of conv. We are getting rid of the conv custom call target so this info has be preserved in fusion config.
* add `ConvDimensionAdapter` to generate NCHW **logical layout** for cudnn frontend while physical layout could be NHWC (most preferable layout) or NCHW (for int conv). Only NHWC layout is used in the unit tests because layout assignment currently doesn't handle conv fusion to transform other layouts to NHWC, this needs to be addressed in separate PR.
* add conv translation rule from XLA conv to cudnn frontend graph API.
* Other parts of the lowering is taken care automatically by current cudnn fusion compiler: workspace allocation/graph validation/graph  compilation/graph serialization.

🎯 Justification
This is the first step to unify the conv as cudnn fusion in XLA. Conv custom call will be replaced with conv fusions in the future.

🚀 Kind of Contribution
 New Feature

📊 Benchmark (for Performance Improvements)
No Performance changes are expected.

🧪 Unit Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.

🧪 Execution Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.
Copybara import of the project:

--
57555cd0e3759aacb7a98135c3261f4cc3f642c2 by Cjkkkk <ske@nvidia.com>:

init

--
d6edecfa42a6371a0908e22daeb8deaf32998ece by Cjkkkk <ske@nvidia.com>:

address comments

--
17df6f8451274f070d7d332a126cfefa1ef7df83 by Cjkkkk <ske@nvidia.com>:

removed one comment

--
1b7c63b1ade7751cf8f68c7fb11cd68491440081 by Cjkkkk <ske@nvidia.com>:

add const

Merging this change closes #32718

PiperOrigin-RevId: 820574737
2025-10-17 00:58:07 -07:00
Jacques Pienaar
2096501975 Remove register everything.
This should just be IR one.

PiperOrigin-RevId: 820548236
2025-10-16 23:22:26 -07:00
A. Unique TensorFlower
1ddcd859d3 Move absl_thread_pool to XLA as YnnThreadpool
PiperOrigin-RevId: 820544939
2025-10-16 23:13:24 -07:00
Christian Sigg
c9d8d37611 [xla:gpu] Relax nested gemm fusion constraints.
This change removes dimension ordering constraints in `AcceptDotOperand`.

PiperOrigin-RevId: 820542964
2025-10-16 23:02:42 -07:00
A. Unique TensorFlower
d46c1b99a9 Automated Code Change
PiperOrigin-RevId: 820542824
2025-10-16 22:51:48 -07:00
Gregory Pataky
c0d9a60f83 Internal changes to project structure
PiperOrigin-RevId: 820527062
2025-10-16 21:52:14 -07:00
Penporn Koanantakool
b2f2568bcc [xla:cpu:xnn] Temporarily disable XNNPACK by default.
PiperOrigin-RevId: 820519075
2025-10-16 21:31:15 -07:00
A. Unique TensorFlower
5592d364ec Automated Code Change
PiperOrigin-RevId: 820505039
2025-10-16 20:36:41 -07:00
A. Unique TensorFlower
a8a747470e Update XNNPACK in XLA
PiperOrigin-RevId: 820502825
2025-10-16 20:24:07 -07:00
Eugene Zhulenev
ef3a678718 [xla:cpu] Fix BufferAllocationInfo::InOutParameter constructor
PiperOrigin-RevId: 820456592
2025-10-16 17:49:08 -07:00
Kevin Gleason
e0f3263a48 [StableHLO Builder] Add API to set frontend attributes
PiperOrigin-RevId: 820455957
2025-10-16 17:34:02 -07:00
Benjamin Chetioui
c19b0d8727 [XLA] Relax restrictions on the iteration_space parameter in Schedule.
We're perfectly able to construct a schedule using only a subset of the
iteration space of a `tile_offsets_indexing`---and in fact need to when we are
processing nested fusions.

PiperOrigin-RevId: 820454010
2025-10-16 17:25:27 -07:00
Haibo Huang
c3ce8a9881 Add PjRtDeviceDimensions struct and proto.
PiperOrigin-RevId: 820440467
2025-10-16 16:41:32 -07:00
Yulia Baturina
180445df5b Add more textual hdrs to mkl_dnn_acl target to enable usage of --config=mkl_aarch64_threadpool with --config=rbe_cross_compile_linux_aarch64 in JAX wheel builds.
This addition fixes the error below:

```
2025-09-03 23:33:24,197 - INFO - external/mkl_dnn_acl_compatible/src/graph/interface/partition_impl.cpp:17:10: fatal error: 'graph/interface/partition_impl.hpp' file not found
2025-09-03 23:33:24,197 - INFO -    17 | #include "graph/interface/partition_impl.hpp"
2025-09-03 23:33:24,197 - INFO -       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-03 23:33:24,197 - INFO - 1 error generated.
```

PiperOrigin-RevId: 820423794
2025-10-16 15:58:17 -07:00
A. Unique TensorFlower
a04d8eeb75 Update XNNPACK in XLA
PiperOrigin-RevId: 820419078
2025-10-16 15:43:29 -07:00
Eugene Zhulenev
a2ab65d20c [xla:ffi] Fix XLA_FFI_REGISTER_TYPE_X macro to support optional args
PiperOrigin-RevId: 820414595
2025-10-16 15:30:26 -07:00
Parker Schuh
fd09236fe4 Rollforward of: Implement PjRtStreamExecutorRawBuffer::CopyToLiteralAsync and allow
PjRtStreamExecutorBuffer to just use inherited literal conversion logic.

Reverts 009d8fdbf4

PiperOrigin-RevId: 820409509
2025-10-16 15:17:15 -07:00
A. Unique TensorFlower
1f3321fa67 * Adds flag tf_serialize_mlir_to_compressed_bytecode to serialize to compressed bytecode.
* Deserializing MLIR modules still tries to parse as string first as thats the default, on failure it tries to uncompress and parse.

PiperOrigin-RevId: 820396326
2025-10-16 14:49:31 -07:00
A. Unique TensorFlower
5863476a05 Integrate LLVM at llvm/llvm-project@bfee9db785
Updates LLVM usage to match
[bfee9db78577](https://github.com/llvm/llvm-project/commit/bfee9db78577)

PiperOrigin-RevId: 820396282
2025-10-16 14:38:22 -07:00
Kanish Anand
a6a2128d5f Add IFTTT for named sharding types
PiperOrigin-RevId: 820392329
2025-10-16 14:25:10 -07:00
Maxim Ermilov
5c18a50655 Use nvml impl lib based wrapper
PiperOrigin-RevId: 820376041
2025-10-16 13:59:12 -07:00
Michael Kuperstein
83a98ac48d [XLA] Remove dead module-group related code from HloPassPipelineTest
PiperOrigin-RevId: 820373683
2025-10-16 13:50:15 -07:00
Will Froom
168eb2c36a [XLA:CPU][XTile] Add lowering for tensor extract and from_elements & fix dot with scalar output.
PiperOrigin-RevId: 820368257
2025-10-16 13:37:25 -07:00
Michael Kuperstein
4e29e9da6c [XLA] Remove unused RecordPassStartMetadata overload.
PiperOrigin-RevId: 820365407
2025-10-16 13:22:35 -07:00
A. Unique TensorFlower
c8b47aac14 In profile_data_lib.cc, throw std::runtime_error instead of check fail.
PiperOrigin-RevId: 820351230
2025-10-16 12:47:14 -07:00
A. Unique TensorFlower
83c407040a [XLA:GPU] Don't fail Autotuner::GetSupportedConfigs if one of the backend fails
PiperOrigin-RevId: 820303427
2025-10-16 10:58:41 -07:00
Yulia Baturina
3c991bd608 Remove linking libnvidia-ml.so from hermetic CUDA forward compatibility mode.
`libnvidia-ml.so` version is coupled with kernel mode driver version, hence we can't provide a custom version of `libnvidia-ml.so` if the machine has a different KMD installed on it.

PiperOrigin-RevId: 820291348
2025-10-16 10:28:04 -07:00
Michael Whittaker
7a51446051 Propagate NCCL aborts into user exceptions.
If a multi-controller JAX program is running a collective (e.g., AllReduce) and
one of the collective participants fails, we'll abort the collective.
Previously, a JAX programmer didn't have a way to know that their collective
was aborted. The collective would just return garbage data. This commit changes
the TFRT GPU client to propagate aborted collectives into Python exceptions.

There is some subtlety in detecting when we abort a collective. The NCCL API
unfortunately doesn't provide a direct way to query whether or not a collective
was aborted. Instead, we check the health of the participants after the
collective returns. If the participants are stale, then we conservatively
assume the collective was canceled.

This leads to the possibility that some processes think the collective was
aborted and some don't. I'll address this lack of atomicity in future changes.

PiperOrigin-RevId: 820272371
2025-10-16 09:49:10 -07:00
A. Unique TensorFlower
c12ac785ab [XLA] Add asinh as a native HLO opcode.
This change promotes `asinh` from a composite operation to a native HLO opcode (`kAsinh`). This allows for direct lowering to device-specific intrinsics which should be more performant. Support is added for GPU.

PiperOrigin-RevId: 820245338
2025-10-16 08:34:29 -07:00
Sevin Fide Varoglu
b81b3316be PR #31030: [XLA:GPU] Move ReduceScatterCreator after AlgebraicSimplifier
Imported from GitHub PR https://github.com/openxla/xla/pull/31030

📝 Summary of Changes
This PR moves the ReduceScatterCreator pass to run after AlgebraicSimplifier, simplifying the transformation pattern and allowing ReduceScatterCreator to convert more all-reduces into reduce-scatters that would otherwise be missed.

🎯 Justification
Running ReduceScatterCreator after AlgebraicSimplifier makes the input patterns easier to recognize. This allows more all-reduces to be converted into reduce-scatters, which would otherwise be missed, leading to better performance. _This was reported internally as an optimization for llama3.3-70b._

🚀 Kind of Contribution
️ Performance Improvement,

📊 Benchmark (for Performance Improvements)
On H100:
|  | PR | main |
|----------|----------|----------|
| llama31_8b_bf16_1x8    | 1372251 us   | 1369631 us    |
| llama31_8b_fp8_1x8    | 1106135 us   | 1107605 us    |
| llama31_8b_bf16_2x8    | 1373637 us   | 1370564 us    |
| llama31_8b_fp8_2x8    | 1111912 us   | 1108061 us    |
| llama31_70b_bf16_16x8    | 13933022 us   | 13913957 us    |
| llama31_70b_fp8_16x8    | 9848173 us   | 9867955 us    |
| llama31_70b_bf16_32x8    | 14103619 us   | 14065225 us    |
| llama31_70b_fp8_32x8    | 9732961 us   | 9760739 us    |
| llama31_405b_bf16_64x8    | 52926476 us   | 52886529 us    |
| llama31_405b_fp8_64x8    | 35576505 us   | 37929776 us   |
| mixtral_8x7b_bf16_1x8   | 744367 us   | 744491 us    |
| mixtral_8x7b_bf16_2x8    | 1126425 us   | 1130912 us    |

🧪 Unit Tests:
Added a new unit test

🧪 Execution Tests:
Tested for functionality with llama3.3 70b zero1 + gradient accumulation and saw ~5% performance improvement.

Copybara import of the project:

--
2d999987762ac3d90960179b06587bc95fc954d1 by Sevin Varoglu <svaroglu@nvidia.com>:

Move ReduceScatterCreator after AlgebraicSimplifier

--
0e41c2b8281234eec9af21a98fd5f81bd4884689 by Sevin Varoglu <svaroglu@nvidia.com>:

Add unit test

Merging this change closes #31030

PiperOrigin-RevId: 820221148
2025-10-16 07:56:21 -07:00
Benjamin Chetioui
edf3f8bb43 [XLA] Implement a TiledHloSchedule that transposes the iteration order over the non-contracting dimensions of a dot.
A concrete use case when such a schedule is useful is when we have a matrix
multiplication such that a chunk of shape `(block_m, k)` of the left-hand
side argument fully fits into L2. The transposed iteration order will step
through the `n` dimension first, allowing to hit L2 cache more often when
loading tiles of the left-hand side.

This schedule is intentionally restricted at the moment in order to unblock
launching the generic Triton emitter for GEMMs.

PiperOrigin-RevId: 820214481
2025-10-16 07:45:23 -07:00
Will Froom
78e02816b8 [XLA:CPU][XTile] Add lowering for StableHLO DotGeneral.
PiperOrigin-RevId: 820214413
2025-10-16 07:32:50 -07:00
Eugene Zhulenev
d52ccd4d4b [xla:ffi] Fix clang macro expansion warnings
PiperOrigin-RevId: 820212080
2025-10-16 07:10:34 -07:00
Aliia Khasanova
f7524f08b8 Add proto [de]serialization for SelectKThunk.
PiperOrigin-RevId: 820210212
2025-10-16 06:50:53 -07:00
Marcin Radomski
373e68f60c [XLA:GPU] Change SDC names to more descriptive ones
And dump the log proto into file called buffer_debug_log rather than sdc_log

Changes to implementation details:
- Renames:
  - SdcLogProto -> BufferDebugLogProto
  - SdcLog -> BufferDebugLog
  - SdcBufferId -> ThunkBufferId
  - SdcThunk -> BuffersChecksumThunk
  - SdcXorChecksumKernel -> BufferDebugXorChecksumKernel
- move BufferDebugLog to stream_executor/gpu from stream_executor/cuda as it's not CUDA-specific

PiperOrigin-RevId: 820186034
2025-10-16 05:24:40 -07:00
Will Froom
deac36865f [XLA:GPU] Unconditionally emit func.func from triton emitter.
PiperOrigin-RevId: 820175948
2025-10-16 04:55:04 -07:00
Will Froom
ba9c63910a [XLA:CPU][XTile] Create simple lowering for tiled ops.
PiperOrigin-RevId: 820160792
2025-10-16 04:20:15 -07:00
Mohammed Anany
9d2df1c2a5 [Triton] Fixing getLastInductionValue utility to also accept Index type. This would otherwise crash when warp specialization is enabled.
PiperOrigin-RevId: 820159796
2025-10-16 04:08:58 -07:00
Kanish Anand
b31aff76d2 Update mesh definition to better match it's use cases of querying tile index from device id's or vice-versa. Refactor into separate classes.
#hloshardingv3

PiperOrigin-RevId: 820154911
2025-10-16 03:54:39 -07:00
Christian Sigg
e17b7829f7 NFC: Simplify padding logic in Split-K GEMM rewriter.
Remove checks for `padded_k_size.has_value()` as `padded_k_size` is always expected to be present after its initialization. This simplifies the conditions for needing padding and the calculation of the padding amount.

PiperOrigin-RevId: 820148929
2025-10-16 03:28:40 -07:00
Marcin Radomski
72dc64133a [XLA:GPU] Add experimental buffer checksum tracing
A debugging tool meant to pinpoint nondeterministic computations by finding
differences in buffer values across multiple runs. It makes XLA calculate
checksums of input/output buffers, and dump them to the output directory.

Enabling the new `--xla_gpu_experimental_enable_checksum_tracing_on_thunks`
flag enables a new ThunkChecksumTracingPass, which adds checksum thunks to the
thunk graph:

- Inserts SDC log initialization to beginning.
- Replaces each thunk with a SequentialThunk [checksum inputs, run original
  thunk, checksum outputs].
- Inserts a thunk that dumps SDC log to a file at the end of execution.

PiperOrigin-RevId: 820148916
2025-10-16 03:18:07 -07:00
A. Unique TensorFlower
41549024ea Remove obsolete Triton LLVM integration patches.
These patches are no longer needed as the corresponding changes have been integrated into the Triton codebase.

PiperOrigin-RevId: 820140492
2025-10-16 03:04:09 -07:00
A. Unique TensorFlower
379d3eba7b Set call result shardings to the out shardings of func that is created or found from cache.
It is a no op for `dedupFunctionsFully` is false which is also the default.

outShardings is the the output shardings of the named computation at hand. However, if dedupFunctionsFully true, the func we pick from `createFuncOpOrGetFromCache`, which is the func the call will actually be calling, may have a different output sharding than the named computation, and call result sharding should be set to the output sharding it calls. For example,

namedComputation1(foo): insharding={"x"} outsharding={"y"}
namedComputation2(foo): insharding={"x"} outsharding={"z"}

call1 to namedComputation1
call2 to namedComputation2

When dedupFunctionsFully is false, we have separate instances of foo as their outshardings are different.

func foo1 insharding={"x"} outsharding={"y"} {...}
func foo2 insharding={"x"} outsharding={"z"} {...}
call1 to foo_1 resultsharding={"y"}
call2 to foo_2 resultsharding={"z"}

When dedupFunctionsFully is true, we do not have separate instance of foo, we need to pick either namedComputation1 or namedComputation2, say we pick namedComputation1, hence it becomes:

func foo insharding={"x"} outsharding={"y"} {...}
call1 to foo resultsharding={"y"}
call2 to foo resultsharding={"y"}

As a result, call2 should have a resultsharding={"y"} since it is calling foo, instead of the out sharding of namedComputation2 which is {"z"}.

PiperOrigin-RevId: 820139879
2025-10-16 02:55:05 -07:00
Christian Sigg
6ac8b2ea6e [xla:gpu] Add padding to split-k to allow pipelining.
Loads are required to be 16-byte aligned for Triton to apply pipelining. This change adds extra padding to both split-k rewriters so that the reduction dimensions are a multiple of 16 bytes.

PiperOrigin-RevId: 820134896
2025-10-16 02:42:45 -07:00
A. Unique TensorFlower
c1e8fc6dc4 [XLA:GPU] Add abstract class for multicast memory to GpuExecutor.
PiperOrigin-RevId: 820115707
2025-10-16 01:24:32 -07:00
Aliia Khasanova
4dab5ef4a6 Add proto [de]serialization for Memset32BitValueThunk.
PiperOrigin-RevId: 820109174
2025-10-16 01:02:43 -07:00
Adrian Kuegel
83f3904c5f [XLA:GPU] Consider multi-output fusions supported by Triton codegen.
Currently we would fail when trying to check whether the element type of the
root tuple is supported. We should not even access the element type on a tuple
shape. Therefore we skip the root tuple.

PiperOrigin-RevId: 820096050
2025-10-16 00:16:18 -07:00
Alexander Belyaev
6c440133b8 [XLA:GPU] Move AsyncStreamKind and CollectiveOpGroupMode to xla_data.proto.
This is a preparation CL before adding serialization for collective thunks.

PiperOrigin-RevId: 820091670
2025-10-16 00:05:34 -07:00
Adrian Kuegel
06ea67005b [XLA:GPU] Run hlo lit tests on several GPU platforms.
This increases test coverage.
Also remove the empty test suite mlir_lit_tests. These tests have been moved to
another directory long ago.

PiperOrigin-RevId: 820074643
2025-10-15 23:10:39 -07:00
A. Unique TensorFlower
2edf3555cf Sort op's first operand is now generated without duplicates if the
sort is stable.

PiperOrigin-RevId: 820067242
2025-10-15 22:46:43 -07:00
A. Unique TensorFlower
88d1adfc68 Automated Code Change
PiperOrigin-RevId: 820049303
2025-10-15 21:55:41 -07:00
A. Unique TensorFlower
d33383d214 Introduce tsl::WithCurrentContext for capturing the current context.
PiperOrigin-RevId: 820042807
2025-10-15 21:19:52 -07:00
Subhankar Shah
4df1a3c67f [XLA:MSA] When block prefetching, finalize the original value if a sliced value is prefetched successfully and the original value is not.
We already have a pinned allocation for the original value, it should be finalized to avoid re-allocation causing multiple pinned allocations for the same buffer.

PiperOrigin-RevId: 820015337
2025-10-15 19:56:19 -07:00
Hyeontaek Lim
55371dfcb4 [PjRt-IFRT] ifrt::PjRtArray::pjrt_layout() uses nullptr to indicate a default layout
PjRt-IFRT now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.

`ifrt::PjRtArray` creation now request extra information on whether the underlying `PjRtBuffer` is using a custom layout as IFRT tracks the defaultness of array layouts. This information cannot be inferred correctly from `PjRtBuffer` alone because `PjRtBuffer::layout()` only returns a concrete layout. PjRt would mostly work fine today if a default layout is said to be a custom layout, but some strict layout equality check can fail and require more precise information to be supplied.

A few test cases in IFRT ArrayImplTest against PjRt CPU and GPU clients
have been disabled because the output array does not track the
non-default-ness of the layout correctly when
`MakeArraysFromHostBufferShards()` is implemented using
`ClientMakeArraysFromHostBufferShards()`.

PiperOrigin-RevId: 819995407
2025-10-15 18:47:15 -07:00
Parker Schuh
0c8f3eab9a Change EnterHostCallback() and
LeaveHostCallback() to use a c++ raii object to ensure
that Enter and Leave are always matched.

PiperOrigin-RevId: 819993376
2025-10-15 18:35:59 -07:00
Eugene Zhulenev
61785a4328 [xla:ffi] Add a test for automatic FFI handler signature inference from C++ function
PiperOrigin-RevId: 819988900
2025-10-15 18:14:47 -07:00
Yun Peng
0ce64afaf3 Introduce HERMETIC_PYTHON_VERSION_KIND for the Bzlmod build.
Add a placeholder for `HERMETIC_PYTHON_VERSION_KIND` in the generated `py_version.bzl` file. This new variable is currently set to an empty string until we figure out how to deal with it.

PiperOrigin-RevId: 819956767
2025-10-15 16:33:03 -07:00
Parker Schuh
a6a11a6036 Implement StreamExecutorGpuClient::ScheduleRemoteSend. This allows migrating
CopyToRemoteDevice to CommonPjRtBuffer APIs.

PiperOrigin-RevId: 819949965
2025-10-15 16:12:06 -07:00
Karlo Basioli
dd90f5fa76 [XLA:GPU][codegen] Emit stablehlo for iota and implement lowering of stablehlo.iota to tt.make_range
PiperOrigin-RevId: 819934458
2025-10-15 15:31:07 -07:00
A. Unique TensorFlower
644b4a83b5 Replace stream->BlockHostUntilDone() with BlockHostUntilDoneWithHostCallback().
BlockHostUntilDone calls `cuStreamSynchronize`, which has some performance issues.

PiperOrigin-RevId: 819924678
2025-10-15 15:03:44 -07:00
Frederik Gossen
2582934b0b [XLA:GPU] Add verbose tracing for BlockHostUntilDone and stream synchronization
PiperOrigin-RevId: 819914599
2025-10-15 14:35:36 -07:00
Mohammed Anany
f147bddb8d Extract launch information from the Triton compilation pipeline and use it instead of XLA's calculation. This is necessary in cases where the pipeline overrides the expected launch configuration.
This was observed when auto warp specialization was enabled. Triton requires more threads per block than expected, and this information is available in the module attributes.

PiperOrigin-RevId: 819893926
2025-10-15 13:57:39 -07:00
A. Unique TensorFlower
c265f586c6 Integrate LLVM at llvm/llvm-project@267fa8dd1e
Updates LLVM usage to match
[267fa8dd1efc](https://github.com/llvm/llvm-project/commit/267fa8dd1efc)

PiperOrigin-RevId: 819892951
2025-10-15 13:51:53 -07:00
Sean Talts
ccd875910a [XLA:CPU] Use asm to set name of intrinsic generated IR functions.
PiperOrigin-RevId: 819885948
2025-10-15 13:42:06 -07:00
Sean Talts
3c7395e014 [XLA:CPU] Fix intrinsic library failing when passed an already vectorized call. From Will Froom.
PiperOrigin-RevId: 819867554
2025-10-15 13:18:23 -07:00
Aliia Khasanova
1108cc983b Add proto [de]serialization for CholeskyThunk.
The only non-obvious part of the thunk is `solver_context_creator`, but we can retrieve it during the deserialization from `stream_executor::Platform`, which is available during runtime.

PiperOrigin-RevId: 819863398
2025-10-15 13:00:47 -07:00
Sean Talts
23736ecfc6 [XLA:CPU] Add test showing exp intrinsic vectorizations.
This test will serve to illustrate an upcoming change in intrinsic_lib's vectorization logic.

PiperOrigin-RevId: 819851790
2025-10-15 12:11:45 -07:00
Karlo Basioli
948d0df409 [XLA:GPU][codegen] Emit tensor dialect for bitcast and implement lowering of bitcast from tensor dialect to triton.
PiperOrigin-RevId: 819833904
2025-10-15 11:57:19 -07:00
Eugene Zhulenev
503198fb6b [xla:cpu] Construct BufferAllocationInfo from BufferAssignment
This is no-op change, preparing for migration from cpu_function_runtime::BufferInfo to new BufferAllocationInfo type.

PiperOrigin-RevId: 819827983
2025-10-15 11:36:46 -07:00
A. Unique TensorFlower
6c32106238 Integrate Triton up to [de2ba394](de2ba3946b)
https://github.com/openxla/triton/tree/triton_integrate_branch-

PiperOrigin-RevId: 819807700
2025-10-15 11:01:37 -07:00
A. Unique TensorFlower
b545b61c0d [XLA:GPU] Provide functions to setup multicast from a single process.
PiperOrigin-RevId: 819790003
2025-10-15 10:48:13 -07:00
Aliia Khasanova
0ab9f48846 Refactor SelectKThunk to accept ThunkInfo instead of HloInstruction pointer.
PiperOrigin-RevId: 819786719
2025-10-15 10:37:40 -07:00
Marcin Radomski
f0ea4b75e3 [XLA:GPU] ThunkPassPipeline: pass HloModule* to Run()
This allows SDC log dumper to derive unique path for each module execution.

PiperOrigin-RevId: 819781581
2025-10-15 10:08:55 -07:00
Peter Hawkins
009d8fdbf4 Reverts 7dbc996979
PiperOrigin-RevId: 819777372
2025-10-15 09:48:29 -07:00
Alexander Shaposhnikov
4626ec956f Bump XNNPACK version for open source builds.
PiperOrigin-RevId: 819774605
2025-10-15 09:33:48 -07:00
A. Unique TensorFlower
0290b24ad8 Internal visibility change.
PiperOrigin-RevId: 819771473
2025-10-15 09:26:34 -07:00
Mohammed Anany
6969cce01e [XLA:GPU/WS] Adding xla_gpu_experimental_enable_triton_warp_specialization flag. This is currently only used to decorate the contracting dimension loop for dot fusions going through Triton with tt.warp_specialize, enabling the feature in Triton.
PiperOrigin-RevId: 819765526
2025-10-15 09:18:52 -07:00
Joshua Lang
1b2ecc8924 Disable broken se_gpu_pjrt_client_test_2gpu_b200 test
PiperOrigin-RevId: 819764723
2025-10-15 09:06:41 -07:00
Marcin Radomski
1aa192d839 [XLA:GPU] Avoid use-after-free in StreamExecutorGpuClientTest::CopyRawToHostOutOfRange
PiperOrigin-RevId: 819763300
2025-10-15 08:44:41 -07:00
Peter Hawkins
baf408c724 Reverts 5a3a4bcd44
PiperOrigin-RevId: 819762394
2025-10-15 08:23:20 -07:00
Kostiantyn Liepieshov
2b17e0e0c0 Support SparseActivationsUnstack and SparseActivationsUnstackInterleaved custom call always return tuple result
PiperOrigin-RevId: 819743515
2025-10-15 07:30:48 -07:00
A. Unique TensorFlower
9567225474 [XLA:GPU] Enable chlo.asin -> kAsin HloInstruction lowering.
PiperOrigin-RevId: 819720031
2025-10-15 06:49:12 -07:00
Mohammed Anany
aa3cb5c5d8 [NFC] Moving extraction utility out of fusion_emitter to emitter_helpers. Also added a test for coverage as I realize this function wasn't tested.
More utilities will follow as part of an upcoming change, so this refactor makes sense to land first.

PiperOrigin-RevId: 819716328
2025-10-15 06:35:52 -07:00
Eugene Zhulenev
339325c6d7 [xla:ffi] Add XLA_FFI_TypeInfo in preparation for adding it to TypeRegistry
PiperOrigin-RevId: 819715434
2025-10-15 06:22:37 -07:00
Ilia Sergachev
2408b9968e PR #32003: [GPU][NFC] Merge methods querying fusion kind.
Imported from GitHub PR https://github.com/openxla/xla/pull/32003

Copybara import of the project:

--
2a3ad034522e871edc9c7f580e86fc3980025542 by Ilia Sergachev <isergachev@nvidia.com>:

[GPU][NFC] Merge methods querying fusion kind.

--
ebeb25599d6017d34ea92ece415a255d109af049 by Ilia Sergachev <isergachev@nvidia.com>:

Address review requests.

Merging this change closes #32003

PiperOrigin-RevId: 819692807
2025-10-15 04:57:40 -07:00
Aleksa Arsic
9a25b01c7e PR #32283: [ROCm] Change misleading method name RocmComputeCapability::has_amd_matrix_core()
Imported from GitHub PR https://github.com/openxla/xla/pull/32283

📝 Summary of Changes
Change misleading method name RocmComputeCapability::has_amd_matrix_core() to more suitable name has_amd_mat_acc_instructions() as gfx11xx do not have matrix cores, but support matrix acceleration instruction set known as WMMA.

🎯 Justification
RocmComputeCapability::has_amd_matrix_core() is misleading as gfx11xx do not have matrix cores but still support matrix acceleration instruction set - WMMA.

🚀 Kind of Contribution
♻️ Cleanup

@xla-rotation please review my changes.

Copybara import of the project:

--
23cf1ab79fdcc4ee2ee4996973dee2c103d2762a by Aleksa Arsic <aleksa.arsic@amd.com>:

Change misleading method name RocmComputeCapability::has_amd_matrix_core() to more suitable name has_amd_mat_acc_instructions() as gfx11xx do not have matrix cores, but support matrix acceleration instruction set known as WMMA.

Merging this change closes #32283

PiperOrigin-RevId: 819652238
2025-10-15 02:53:07 -07:00
Thomas Joerg
28c0be7a10 [XLA:GPU] Run GpuKernelTilingTests on default GPU platforms. So far, this test is limited to Pascal.
PiperOrigin-RevId: 819650786
2025-10-15 02:36:10 -07:00
Ilia Sergachev
a1891cea11 PR #31994: [NFC] Move computation simplification methods from command buffer scheduling to a new library.
Imported from GitHub PR https://github.com/openxla/xla/pull/31994

Copybara import of the project:

--
dd037f3ef1c2da262029a9ebc34845ddb3c8a7f1 by Ilia Sergachev <isergachev@nvidia.com>:

[NFC] Move computation simplification methods from command buffer scheduling to a new library.

--
2594c7a473945f5d410ae8e8894b7e90f5812c1e by Ilia Sergachev <isergachev@nvidia.com>:

Address review feedback.

Merging this change closes #31994

PiperOrigin-RevId: 819631409
2025-10-15 01:57:26 -07:00
Alex
5d0658679a PR #32642: [ROCm] Fix invalid run_under script for ci job and asan ignore files
Imported from GitHub PR https://github.com/openxla/xla/pull/32642

📝 Summary of Changes
Fix rocm build with asan settings

🎯 Justification
Fix invalid run under script used in order to pass through the asan ignore lists file.

🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
Not relevant

🧪 Unit Tests:
Not relevant

🧪 Execution Tests:
Not relevant

Copybara import of the project:

--
708b0b274d18b88ca7467c3ab3f44aaa11710995 by Alexandros Theodoridis <atheodor@amd.com>:

Fix invalid run_under script for ci job and asan ignore files

Merging this change closes #32642

PiperOrigin-RevId: 819630816
2025-10-15 01:47:03 -07:00
Adrian Kuegel
a150bc01ab [XLA:GPU] Make sure to simplify both lhs and rhs of convolution.
We need to ensure that symbols for trivial dimensions are simplified away
consistently. If we simplify it on one side, we also need to simplify it on
the other, as we want to use the same iteration space for both lhs and rhs.

PiperOrigin-RevId: 819606712
2025-10-15 00:32:36 -07:00
A. Unique TensorFlower
09eaac74e8 Automated Code Change
PiperOrigin-RevId: 819566851
2025-10-14 22:14:42 -07:00
Joshua Lang
ae547f429c Enable multi-GPU tests on B200.
PiperOrigin-RevId: 819551562
2025-10-14 21:23:05 -07:00
Parker Schuh
5a3a4bcd44 Update PjRtStreamExecutorRawBuffer::CopyRawHostToDeviceAndReturnEvent to
support staging host buffers (for non-pinned memory). This allows replacing the CopyRawToHost functions.

PiperOrigin-RevId: 819514361
2025-10-14 19:35:11 -07:00
Parker Schuh
7dbc996979 Implement PjRtStreamExecutorRawBuffer::CopyToLiteralAsync and allow
PjRtStreamExecutorBuffer to just use inherited literal conversion logic.

PiperOrigin-RevId: 819490746
2025-10-14 18:23:53 -07:00
A. Unique TensorFlower
14d70d20ee [XLA] Add tests for tile_assignment.
PiperOrigin-RevId: 819461056
2025-10-14 17:09:39 -07:00
A. Unique TensorFlower
95340a0066 Use hostcallback for h2d
PiperOrigin-RevId: 819459051
2025-10-14 16:59:11 -07:00
Alex Pivovarov
e5c6a2b8ae Add support for CollectiveBroadcastThunk in Command Buffer.
This change integrates `CollectiveBroadcastStartThunk` and `CollectiveBroadcastDoneThunk` into the command buffer execution framework, allowing them to be converted into command buffer commands. It also includes a minor fix to an error message.

PiperOrigin-RevId: 819432425
2025-10-14 16:03:15 -07:00
A. Unique TensorFlower
b9024cb4f2 Rollback of PR #32389
Reverts 6e534c2dc1

PiperOrigin-RevId: 819424613
2025-10-14 15:49:43 -07:00
Benjamin Chetioui
2bd860f639 [XLA] Clarify an additional invariant with regards to command buffer compatibility in the FFI API.
PiperOrigin-RevId: 819424170
2025-10-14 15:36:43 -07:00
Michael Whittaker
5750ca22b1 Adds more logging to coordination service.
PiperOrigin-RevId: 819391653
2025-10-14 14:16:16 -07:00
Parker Schuh
5c5d76e7ad Add more precise stream synchronization which allows more aggressive stream
synchronization in raw buffer APIs (namely CopyRawHostToDeviceAndReturnEvent
and CopyRawDeviceToHostAndReturnEvent). Old buffers will require no
synchronization, but recent buffers will get a cached compute_stream event
and then it will sync with this compute_stream event repeatedly instead of
syncing with the stream itself.

PiperOrigin-RevId: 819380480
2025-10-14 13:54:10 -07:00
Oleg Shyshkov
256d7c57ca [XLA:GPU] Add a check that each replica group is equally distributed across hosts.
This way we also make sure that we don't decompose ra2a that stays within single hostl.

PiperOrigin-RevId: 819311950
2025-10-14 11:31:17 -07:00
Karlo Basioli
c61dd4a830 [XLA:CPU] Erase legacy compute function path in cpu_executable.
This has been unused for a while now.

PiperOrigin-RevId: 819276283
2025-10-14 10:27:15 -07:00
Will Froom
263a774f70 [XLA][XTile] Create initial shared tiled dialect + ops.
PiperOrigin-RevId: 819273024
2025-10-14 10:03:30 -07:00
Yun Peng
26658a5e2e Reformat MODULE.bazel and adjust pip repo usage.
This change reorders arguments within various override and extension calls for consistency. It also removes the explicit `xla_pypi_311_numpy` from the `use_repo(pip, ...)` call.

PiperOrigin-RevId: 819260347
2025-10-14 09:53:09 -07:00
Marcin Radomski
5a43dd8d5e [XLA:GPU] Add SdcThunk
A Thunk that calculates checksums for all configured buffers and stores them in
an SdcLog.

PiperOrigin-RevId: 819258906
2025-10-14 09:34:56 -07:00
Eugene Zhulenev
812c201256 [xla:ffi] Add support for returning TypeId for stateful handlers
PiperOrigin-RevId: 819234816
2025-10-14 08:58:32 -07:00
A. Unique TensorFlower
6085657619 Automated Code Change
PiperOrigin-RevId: 819231654
2025-10-14 08:44:48 -07:00
Byungchul Kim
f474e0b8f2 Fix incorrect python interpreter path of non-bzlmod
Remove duplicated "_host" suffix.

PiperOrigin-RevId: 819229058
2025-10-14 08:36:47 -07:00
A. Unique TensorFlower
f23728131e Make all files depending on IndexingMap to use SymbolicExprContext
This CL is a key step in integrating the new SymbolicExpr library into IndexingMap (b/433696544). The primary goal is to replace the existing `mlir::MLIRContext` with `gpu::SymbolicExprContext` throughout every class that depends on IndexingMap.

Goal:
  - Enables Symbolic Reasoning: `SymbolicExprContext` is designed to manage symbolic expressions, which will allow for more powerful analysis and optimization of indexing maps.
  - Performance: We believe with pifon@ than by using a dedicated context, compilation time could be improved by the fact of not overusing the lock inside MLIRContext (used everywhere). This should be confirmed with real data after finishing the entire migration.
  - Foundation: This refactoring is a prerequisite for fully replacing `AffineExpr` with `SymbolicExpr` in `IndexingMap`. This should unblock the replacement of the internal implementation (cl/802100018).

Changes:
   - Signature Updates: Function signatures across numerous files in `xla/service/gpu`, `xla/backends/gpu`, `xla/backends/cpu`, and `xla/hlo/analysis` have been updated to accept `gpu::SymbolicExprContext*` instead of `mlir::MLIRContext*`.
   - Context Propagation: The `SymbolicExprContext` is now created and owned by `GpuCompiler` and propagated down to the various components, including fusion emitters, autotuners, and performance models.
   - Test Updates: Unit tests and test utilities have been updated to use the new context.
   - Some areas required temporary workarounds where the `SymbolicExprContext` is created on the fly from an existing `mlir::MLIRContext`. These are marked with TODOs (b/446856820, b/446856303) to be cleaned up in follow-up CLs as the integration progresses.

Ideally no functional changes are intended, but the `mlir::MLIRContext` was inconsistently managed across the codebase, requiring careful tracing and updates to ensure the new `SymbolicExprContext` is correctly propagated everywhere. This made the refactoring process time-consuming and I would appreciate careful review because I could have made some mistakes as well. In this process I had 3 different and unrelated segmentation faults and a crash in StorageUniquer for not using the same context in different places.

This CL represents step 2 of the integration plan outlined in b/433696544#comment9.

PiperOrigin-RevId: 819228363
2025-10-14 08:24:41 -07:00
Joshua Lang
a88366a806 Add B200 GPU spec to XLA GPU device info tests.
PiperOrigin-RevId: 819220544
2025-10-14 08:04:57 -07:00
Eusebio Durán Montaña
d555ed2c74 Add (de)serialization for FftThunk
This one is a pretty direct mapping from the struct to the proto.

PiperOrigin-RevId: 819214943
2025-10-14 07:52:20 -07:00
Michael Kuperstein
0064d2d1bb Rolling forward with fix.
Reverts eefde23194

PiperOrigin-RevId: 819210668
2025-10-14 07:37:44 -07:00
A. Unique TensorFlower
5e244973e3 Fix test in case when driver is old.
PiperOrigin-RevId: 819197347
2025-10-14 07:25:18 -07:00
A. Unique TensorFlower
e57cf67a64 Reverts fe185826cd
PiperOrigin-RevId: 819194024
2025-10-14 07:10:15 -07:00
Eusebio Durán Montaña
1e7235a721 Clean up FftThunk includes and BUILD dependencies
Doing this before touching the files, also adding missing brackets in an if.

PiperOrigin-RevId: 819191695
2025-10-14 06:57:40 -07:00
A. Unique TensorFlower
fe185826cd Automated Code Change
PiperOrigin-RevId: 819166638
2025-10-14 06:00:33 -07:00
Eusebio Durán Montaña
bcd41217ed Add (de)serialization for the ConvolutionReorderThunk
PiperOrigin-RevId: 819165377
2025-10-14 05:48:11 -07:00
A. Unique TensorFlower
51ef995fff Automated Code Change
PiperOrigin-RevId: 819120115
2025-10-14 04:01:56 -07:00
Eusebio Durán Montaña
841ccebbd0 Refactor ConvolutionReorderThunk member fields
In practice the thunk always has:
* an input and output filter
* either:
  * no biases
  * both an input and output bias

So specify this invariant into the data structure, to make this more readable and to make it harder to create an invalid thunk.

PiperOrigin-RevId: 819099118
2025-10-14 03:12:50 -07:00
Marcin Radomski
733a820470 [XLA:GPU] SdcXorChecksumKernel: move trait to stream_executor/gpu
Use GpuKernelRegistry for loading the kernel rather than `TypedKernelFactory`.
The new header will help prevent errors related to use of "gpu"-tagged targets
in non-"gpu"-tagged ones.

Also, avoid using atomic fetch_add to prevent JAX build failures on <sm60.
We're going to ensure that with a runtime check.

PiperOrigin-RevId: 819098591
2025-10-14 03:02:11 -07:00
Yunlong Liu
7ea52be53e PR #32454: Respect debug options override in LHS.
Imported from GitHub PR https://github.com/openxla/xla/pull/32454

Any place inside the compiler has to respect the debug options override set by the users.

Copybara import of the project:

--
6971175737582aad4e9256f983890ac04009a074 by Yunlong Liu <yliu120@users.noreply.github.com>:

Respect debug options override in LHS.

Merging this change closes #32454

PiperOrigin-RevId: 819088927
2025-10-14 02:46:59 -07:00
A. Unique TensorFlower
d0564fd223 [XLA:GPU] Add functions to allocate memory with VMM API.
PiperOrigin-RevId: 819074758
2025-10-14 02:02:15 -07:00
Henning Becker
9793c54120 Replace CustomCallThunk::Slice by ShapedSlice
The two types have the same definition and represent the same semantically.

PiperOrigin-RevId: 819044248
2025-10-14 00:39:15 -07:00
A. Unique TensorFlower
e50b78b75b Automated Code Change
PiperOrigin-RevId: 819038195
2025-10-14 00:28:17 -07:00
A. Unique TensorFlower
5b18f472ef Automated Code Change
PiperOrigin-RevId: 819010361
2025-10-13 23:10:32 -07:00
Oleg Shyshkov
dd73f5e444 [XLA:GPU] Support arbitrary replica groups in RaggedAllToAlMultiHostDecomposer.
This change lifts the original restriction that ra2a should have only one iota replica group.

PiperOrigin-RevId: 818996017
2025-10-13 22:24:51 -07:00
A. Unique TensorFlower
7cfed27c49 Automated Code Change
PiperOrigin-RevId: 818986694
2025-10-13 22:10:24 -07:00
A. Unique TensorFlower
b136b5e191 Automated Code Change
PiperOrigin-RevId: 818968287
2025-10-13 21:48:10 -07:00
A. Unique TensorFlower
ab00f6c182 Automated Code Change
PiperOrigin-RevId: 818968250
2025-10-13 21:39:35 -07:00
A. Unique TensorFlower
267f1b7fd7 Automated Code Change
PiperOrigin-RevId: 818965009
2025-10-13 21:23:42 -07:00
Eugene Zhulenev
152bbebe79 [tsl:concurrency] Add support for detached futures
Running OnReady and Map callbacks in the thread that calls promise.Set() can be dangerous for performance. Add an API to execute all callbacks in the given executor.

PiperOrigin-RevId: 818964744
2025-10-13 20:59:26 -07:00
A. Unique TensorFlower
b0674cd448 Automated Code Change
PiperOrigin-RevId: 818954805
2025-10-13 20:24:18 -07:00
A. Unique TensorFlower
76e891c4db Automated Code Change
PiperOrigin-RevId: 818950162
2025-10-13 20:13:43 -07:00
Haibo Huang
6a1c2a8a79 Add methods to query chip count and logical devices per chip to PjRtTopologyDescription.
PiperOrigin-RevId: 818946853
2025-10-13 19:59:38 -07:00
Frederik Gossen
43f9e0789c [XLA:GPU] Add verbose kernel scheduling tracing for debugging
PiperOrigin-RevId: 818918076
2025-10-13 18:20:43 -07:00
Hyeontaek Lim
631a48b8da [IFRT] Update the semantics of ifrt::Array::pjrt_layout() regarding default layouts
`ifrt::Array::pjrt_layout()` will soon be returning a `nullptr` for a default layout (soon with removal of `absl::StatusOr<>` part). The user can continue to get a concrete default layout via `ifrt::Client::GetDefaultPjRtLayout()`.

During a transition, IFRT implementations may return either `nullptr` or a concrete default layout, and this state will be permitted temporarily, while they will be migrated to return `nullptr` for default layouts.

PiperOrigin-RevId: 818911623
2025-10-13 18:04:27 -07:00
jparkerh
3a43ed26e8 Make the gpu_static_registration work properly
gpu_static_registration is intended to pull in all dependencies needed for the target - this ensures that this is the case.

PiperOrigin-RevId: 818820817
2025-10-13 14:29:45 -07:00
Matthias Kramm
464202fa3b Apply memory optimizations if options.allow_in_place_mlir_modification is true.
PiperOrigin-RevId: 818818778
2025-10-13 14:06:27 -07:00
Parker Schuh
5c75040b40 Everyone sets untuple_result = true unconditionally
because that is the only supported way to return Tuples, so
flip the flag by default to true.

Callers can now stop setting this.

PiperOrigin-RevId: 818803753
2025-10-13 13:29:58 -07:00
Yulia Baturina
9445f93155 Bump rules_ml_toolchain version to integrate @cuda_nvrtc header.
PiperOrigin-RevId: 818772198
2025-10-13 12:24:23 -07:00
Michael Kuperstein
eefde23194 Rolling back due to breakage.
Reverts 47ec6671b8

PiperOrigin-RevId: 818768109
2025-10-13 12:14:00 -07:00
Allan Renucci
da9375fa74 [NFC] Simplifications to CommandBufferConversionPass.
PiperOrigin-RevId: 818763449
2025-10-13 12:03:39 -07:00
David Dunleavy
2b947eb9a4 Delete xla/tests/fuzz
These tests haven't demonstrated much value, and the generated HLOs are no longer used anywhere else.

PiperOrigin-RevId: 818759915
2025-10-13 11:53:59 -07:00
Eusebio Durán Montaña
056f013072 Refractor nchw filter dimension parsing in ConvolutionReorderThunk
Instead of inputing the filter dimensions as a span of integers, which we implicitly expect to be of size 4, we pass this a proto.

Using a proto instead of a struct since we'll need the `ConvolutionFilterDimensions` to serialize the `ConvolutionReorderThunk`. (We don't want to serialize the `FilterDescriptor` since most of its fields are only written during execution, so we'll serialize the `ConvolutionFilterDimensions` instead).

Not sure where the best place for the `ConvolutionFilterDimensions` proto to live is. Other options would be to define it in:
* In the thunk.proto, or;
* In some other file more closely related to convolution filters (not sure where that could be).

PiperOrigin-RevId: 818726237
2025-10-13 10:40:50 -07:00
Christian Sigg
929dc2d18b [xla:gpu] NFC: Hoist computation index map creation in Triton dot emitters.
This change moves the creation of the `computation_index_map` outside the loop body in `EmitDot` and `EmitScaledDot`, as it does not depend on the loop induction variable. It also simplifies how the tile size is retrieved in `GetDotLoopIterationCount` by using `TiledHloInstruction::tile_size`.

PiperOrigin-RevId: 818718168
2025-10-13 10:16:39 -07:00
Felix Wang
3ab1351fe1 Add comments for GPUCommunicationType.
PiperOrigin-RevId: 818705728
2025-10-13 09:57:36 -07:00
Felix Wang
34ac2390a2 Support async reduce-scatter for GetLatencyBetween in sol cost estimator && add verbose log
PiperOrigin-RevId: 818699549
2025-10-13 09:39:48 -07:00
A. Unique TensorFlower
893e85d01a Integrate LLVM at llvm/llvm-project@3a6b818132
Updates LLVM usage to match
[3a6b818132e3](https://github.com/llvm/llvm-project/commit/3a6b818132e3)

PiperOrigin-RevId: 818692560
2025-10-13 09:24:50 -07:00
A. Unique TensorFlower
66537d1bd8 Remove use of vector.splat
Removed in ea291d0e8c. The replacement is vector.broadcast.

PiperOrigin-RevId: 818680777
2025-10-13 08:57:51 -07:00
A. Unique TensorFlower
ec753f48a1 [XLA] Remove brittleness regarding the CHECK-HIGH-LEVEL directives in chlo_legalize_to_mhlo.mlir tests by adding a CHECK-HIGH-LEVEL-LABEL to discard any output preceding the beginning of the test. Without this, the tests are dependent; a CHECK-HIGH-LEVEL can match an output from a previous test, as it matches all output up until the latest preceding CHECK-HIGH-LEVEL. To prevent also matching output after the end of the test, a CHECK-HIGH-LEVEL-LABEL is added to each test following a test that contains a CHECK-HIGH-LEVEL.
Also, add `func.func` in some `CHECK-LABEL`s to make them more robust.

PiperOrigin-RevId: 818670103
2025-10-13 08:41:04 -07:00
Michael Kuperstein
47ec6671b8 [XLA] Remove RunOnModelGroup.
We no longer support module groups with size > 1, so there's no point in supporting generic RunOnModelGroup. One possible use-case of model groups with size == 1 is to be able to *replace* the module (instead of modifying it in-place). This adds a new interface to support that.

PiperOrigin-RevId: 818665806
2025-10-13 08:26:53 -07:00
Alex
c88c02e21b PR #32475: [ROCm] Prepare asan builds to be rbe compatible, include sanitizer ignore lists as data dpependency
Imported from GitHub PR https://github.com/openxla/xla/pull/32475

📝 Summary of Changes
Make asan builds hermetic so they can be used with rbe

🎯 Justification
Add sanitizer ignore lists as a dependency to run_under script so they are available in rbe worker

🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix, ♻️ Cleanup

📊 Benchmark (for Performance Improvements)
not relevant

🧪 Unit Tests:
not relevant

🧪 Execution Tests:
not relevant

Copybara import of the project:

--
cae2ea8d4808c161becb80602fba605ba08a4bd5 by Alexandros Theodoridis <atheodor@amd.com>:

Adjust ci script to include asan ignore list as deps

Merging this change closes #32475

PiperOrigin-RevId: 818658730
2025-10-13 08:19:23 -07:00
gaurides
d39d15f929 PR #32357: Update gemma2 keras benchmark script - fix ttft, and use tokenizer
Imported from GitHub PR https://github.com/openxla/xla/pull/32357

📝 Summary of Changes

1. Update calculation for TTFT to be the time to first generated token. This will also impact TPOT calculations.
2. Use tokenizer to count the number of tokens generated instead of counting words using space

🎯 Justification
Currently the script computes TTFT as time to first token which is from the prompt and still in prefill stage.

🚀 Kind of Contribution
🐛 Bug Fix
Copybara import of the project:

--
25178775f936a6f40a205e6969582222f150f0dd by Gauri Deshpande <gauri1.deshpande@intel.com>:

Update gemma2 keras benchmark script - fix ttft, and use tokenizer

--
9b20ead588ad38152e648067c7d34314ba8a5645 by Gauri Deshpande <gauri1.deshpande@intel.com>:

address review comments

Merging this change closes #32357

PiperOrigin-RevId: 818656164
2025-10-13 08:08:09 -07:00
Henning Becker
a3e9afb2e8 Move custom call handler resolution into CustomCallThunk
For a custom call we need to resolve a target name (string) to a function pointer. So far this happens in `IrEmitterUnnested` (at the end of the compilation pipeline). But for thunk serialization we need this to happen at runtime (when the thunks are getting reconstructed from the proto representation). Therefore I'm moving this resolving step into the `CustomCallThunk` factory function.

Note that there remains a way to construct a `CustomCallThunk` from just a function pointer. These thunks will not be serializable and that's okay. The logic handles these cases and returns an error. It is important for tests to be able to quickly create a CustomCallThunk from a closure. If we had to register these calls in the registry first it would complicate our tests significantly.

In detail this change entails:
- Move resolver logic in new overloads of the `CustomCallThunk::Create` factory function.
- Call these overloads from `IrEmitterUnnested` and the custom kernel fusion emitter.
- Add tests for the new overloads
- Migrate some tests to FFI (the new custom call registry and API)
- Adjust some error codes in custom call tests. (If a custom call was not found now `kNotFound` is returned instead of `kUnimplemented`).

PiperOrigin-RevId: 818655300
2025-10-13 07:52:19 -07:00
Hyeontaek Lim
3e425f3c44 [IFRT] Migrate Array::pjrt_layout() callers to interpret nullptr as a default layout
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.

This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.

Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.

PiperOrigin-RevId: 818618831
2025-10-13 06:06:16 -07:00
Eusebio Durán Montaña
adf4c2bf01 Split ConvolutionReorderThunk into its own file
There's no shared logic in between the `ConvolutionReorderThunk` and the `ConvolutionThunk` thunk, so I think its cleaner for each to be defined in their own file.

PiperOrigin-RevId: 818613921
2025-10-13 05:50:34 -07:00
Greg Olechwierowicz
c359abbbdf [XLA:GPU] Add xla:friends to GPU transforms.
PiperOrigin-RevId: 818603646
2025-10-13 05:10:30 -07:00
Eusebio Durán Montaña
607bca2e9c Add proto (de)serialisation for ConvolutionThunk
PiperOrigin-RevId: 818597563
2025-10-13 04:46:06 -07:00
Kostiantyn Liepieshov
f0cea5d779 Allow ifrt::CopyArraysOp as IFRT IR program outputs.
The output of an IFRT IR program can be either a direct argument (`mlir::BlockArgument`) or (`xla::ifrt::CallLoadedExecutableOp`) output or (`ifrt::CopyArraysOp`)

PiperOrigin-RevId: 818592194
2025-10-13 04:27:03 -07:00
A. Unique TensorFlower
b048c10512 Automated Code Change
PiperOrigin-RevId: 818581950
2025-10-13 04:04:36 -07:00
Dragan Mladjenovic
cab160061e PR #32504: [ROCm] Remove rocm_diagnostics
Imported from GitHub PR https://github.com/openxla/xla/pull/32504

📝 Summary of Changes
Remove rocm_diagnostics.cc

🎯 Justification
 RocmDiagnostics module never worked and provides no meaningful information to the user.

🚀 Kind of Contribution
♻️ Cleanup

📊 Benchmark (for Performance Improvements)
N\A

🧪 Unit Tests:
None

🧪 Execution Tests:
None

Copybara import of the project:

--
73c4357ea80c720e2e46ddc0f91c8943e571b1ca by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Remove rocm_diagnostics

Merging this change closes #32504

PiperOrigin-RevId: 818581529
2025-10-13 03:57:36 -07:00
Eusebio Durán Montaña
aae796c51f Add proto (de)serialisation for GpuConvDescriptor
We need `GpuConvDescriptor` to be serializable to be able to add (de)serialisation for the `ConvolutionThunk`

PiperOrigin-RevId: 818581222
2025-10-13 03:48:05 -07:00
Adam Paszke
0a9ce803d7 [XLA:GPU] Make sure to insert parameter copies if they feed Mosaic GPU collectives
Otherwise the kernels don't actually get the operands in symmetric memory and can crash.

PiperOrigin-RevId: 818560766
2025-10-13 02:41:38 -07:00
A. Unique TensorFlower
540e6982c0 Automated Code Change
PiperOrigin-RevId: 818508156
2025-10-13 00:18:46 -07:00
Henning Becker
ff3c33ab71 Remove unused nvtx_utils target
The code is not in use anymore and hasn't been for quite a while. Let's remove it.

PiperOrigin-RevId: 818506241
2025-10-13 00:02:46 -07:00
Henning Becker
993e1f75ad Remove MLIR-related unused code from CollectiveThunk
The removed data fields are not used and always initialized from a nullptr.
The removed functions are not called from anywhere. So let's remove all of that.

PiperOrigin-RevId: 818504444
2025-10-12 23:50:51 -07:00
A. Unique TensorFlower
f37f60d618 Automated Code Change
PiperOrigin-RevId: 818504106
2025-10-12 23:44:43 -07:00
A. Unique TensorFlower
a142538d86 Automated Code Change
PiperOrigin-RevId: 818504078
2025-10-12 23:35:12 -07:00
A. Unique TensorFlower
c52198581e Automated Code Change
PiperOrigin-RevId: 818496642
2025-10-12 23:01:57 -07:00
Alexander Belyaev
bbde4992ae [XLA:GPU] Remove unused mlir::Value fields in CollectiveThunk::Buffer.
PiperOrigin-RevId: 818477433
2025-10-12 21:48:19 -07:00
A. Unique TensorFlower
9cefa031f1 Support sinking all-reduce feeding an all-gather.
PiperOrigin-RevId: 818471462
2025-10-12 21:23:26 -07:00
A. Unique TensorFlower
a9ac35b8f5 - Don't move host offloading annotations in licm.
- Allow loop-related instructions between DS and host offloading annotations.

PiperOrigin-RevId: 818469793
2025-10-12 21:12:43 -07:00
A. Unique TensorFlower
7aacc01270 Replace a chain of xpose/reshape to a nop if the composite transpose permutation is idendity
PiperOrigin-RevId: 818461965
2025-10-12 20:54:18 -07:00
Aiden Grossman
80bc8984ea Use triple overload of lookupTarget
The overload accepting a llvm::StringRef will be removed when LLVM 22 branches.

PiperOrigin-RevId: 818460238
2025-10-12 20:42:42 -07:00
Eugene Zhulenev
0c7c2289a9 [tsl:concurrency] Don't submit tasks to executor if Future::Map result is unused
PiperOrigin-RevId: 818444632
2025-10-12 19:51:40 -07:00
A. Unique TensorFlower
5af93d1725 Automated Code Change
PiperOrigin-RevId: 818400843
2025-10-12 16:16:53 -07:00
A. Unique TensorFlower
dae3e53816 [XLA] Remove a redundant acosh_complex_f32 legalization test. The same block of lines exists in lines 467-617.
PiperOrigin-RevId: 818362259
2025-10-12 12:43:16 -07:00
A. Unique TensorFlower
bfaa51810d Automated Code Change
PiperOrigin-RevId: 818299030
2025-10-12 06:43:47 -07:00
Terry Sun
499c00e520 PR #31795: [GPU] Assign default color to tuples
Imported from GitHub PR https://github.com/openxla/xla/pull/31795

📝 Summary of Changes
[Downstream check](https://github.com/openxla/xla/blob/main/xla/pjrt/pjrt_executable.cc#L288-L299) assumes tuples on default memory space, force assign default color to tuples will get around the check.

🎯 Justification
NCCL user buffer runs are crashing on MaxText main. This PR fixes the crash.

🚀 Kind of Contribution
🐛 Bug Fix

📊 Benchmark (for Performance Improvements)
N/A.

🧪 Unit Tests:
Existing unit tests.

🧪 Execution Tests:
Added multiple execution tests.

Copybara import of the project:

--
c60fe9d62827596eac57df2b480891520b40ab07 by Terry Sun <tesun@nvidia.com>:

assign default color yo tuples

--
717412a55a94be71afcbb7627f03905c408f8b6a by Terry Sun <tesun@nvidia.com>:

add constant and polish doc string

--
c907b2d1ca5a62299b6bfd2534e99c6215313ffd by Terry Sun <tesun@nvidia.com>:

update test

Merging this change closes #31795

PiperOrigin-RevId: 818295813
2025-10-12 06:26:13 -07:00
Oleg Shyshkov
20e2d74013 [XLA:GPU] Add collective kind to rendezvous name in CollectiveThunk.
This makes the rendezvous name more informative by including the specific type of collective operation, which helps in debugging.

PiperOrigin-RevId: 818215415
2025-10-11 23:40:18 -07:00
Eugene Zhulenev
e4aca637fe [tsl:concurrency] Make it always safe to run AsyncValue waiters on executor
PiperOrigin-RevId: 818197306
2025-10-11 22:02:09 -07:00
A. Unique TensorFlower
19b9c603a7 Allow annotations to be appended or prepended in AnnotatedUserContext.
For some annotations, it is more user-friendly to print the annotation
before the annotated context. Add an option for this and remove the
automatic "; " delimiter to allow more flexibility in the annotation
format.

PiperOrigin-RevId: 818118289
2025-10-11 15:56:16 -07:00
Oleg Shyshkov
0721938cea [XLA:GPU] Only set channel id it is present in the original instruction.
PiperOrigin-RevId: 818048132
2025-10-11 09:31:53 -07:00
Kanish Anand
d979afadb7 Add Proto suffix to proto types to avoid name conflicts with corresponding cpp types as both are under xla namespace.
#hloshardingv3

PiperOrigin-RevId: 818005071
2025-10-11 05:41:22 -07:00
Jian Cai
05715ab5d2 [XLA][Numerics][HLO Value Tracking] Create an original value for compiler-inserted tuples during fusion
This also removes the code that handles original values in MergeFusionInstructionIntoMultiOutput, as it eventually calls into HloCallableInstruction::CloneAndAppendInstructionIntoCalledComputation to create a tuple result and the corresponding original value.

PiperOrigin-RevId: 817944313
2025-10-11 00:32:12 -07:00
Parker Schuh
738dacb09b Update TfrtGpuExecutable::ExecuteHelper to force untupling to at least be true whenever the result is a tuple (to match other backends).
PiperOrigin-RevId: 817897232
2025-10-10 20:46:47 -07:00
Hyeontaek Lim
6c026def7a [IFRT] Migrate Array::pjrt_layout() callers to interpret nullptr as a default layout
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.

This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.

Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.

PiperOrigin-RevId: 817893146
2025-10-10 20:27:55 -07:00
Eugene Zhulenev
99c5f51e9b [tsl] Add AsExecutor() adaptor to tsl:🧵:ThreadPool
It's hard to construct ThreadPoolExecutor from a ThreadPool& reference and correctly manage the lifetime of executor. Instead make it possible to get a tsl::Executor adaptor from a ThreadPool instance.

PiperOrigin-RevId: 817872041
2025-10-10 19:11:47 -07:00
Parker Schuh
0573fab4a4 Implement CreateLinkedEventPromise() for PjRtStreamExecutorClient.
PiperOrigin-RevId: 817869433
2025-10-10 18:58:54 -07:00
Buddh Prakash
9a534e7b9d Introduce XLA_[Q|D]CHECK macros for Check failures.
These macros are carefully designed to
1. have the exact same API as absl::CHECKs
2. produce error messages and content in exactly the same format as absl::CHECK's.
3. respect absl flags e.g. ABSL_MIN_LOG_LEVEL

They only differ from absl::CHECK's in that, for Check failures, they
1. prepend an error code and append a link to the openxla.org webpage to the error message
2. additionally, append DebugMeContext information if available.

PiperOrigin-RevId: 817868212
2025-10-10 18:52:47 -07:00
Maxim Ermilov
25554481df bump rules_ml_toolchain version
PiperOrigin-RevId: 817856738
2025-10-10 18:13:51 -07:00
Gunhyun Park
7f6b55adaf Fix bug in IsInteger condition.
PiperOrigin-RevId: 817854819
2025-10-10 18:04:16 -07:00
Sean Talts
c79f78b706 [XLA:CPU] Emit basic aarch64 and x86 versions in cc_to_llvm_ir.
PiperOrigin-RevId: 817843552
2025-10-10 17:42:48 -07:00
Eugene Zhulenev
1783c8cccf [xla:ifrt] Replace Future::OnReady with Future::Map
PiperOrigin-RevId: 817838912
2025-10-10 17:30:48 -07:00
Eugene Zhulenev
8151205458 [tsl:concurrency] Add Future::map overrides that can take Executor to run the map functor
PiperOrigin-RevId: 817835460
2025-10-10 17:16:13 -07:00
Eugene Zhulenev
eba90df183 [xla:cpu] Create xla::cpu::BufferAllocationInfo from deprecated cpu_function_runtime::BufferInfo
PiperOrigin-RevId: 817834097
2025-10-10 17:03:10 -07:00
Eugene Zhulenev
c9269c03de [xla:cpu] Delete unused APIs from cpu_function_runtime
PiperOrigin-RevId: 817825043
2025-10-10 16:37:38 -07:00
Yurii Topin
48dc889ddd SYCL is built by default using hermetic Clang.
GCC is no longer supported for SYCL builds.

PiperOrigin-RevId: 817821428
2025-10-10 16:26:39 -07:00
Subhankar Shah
10e6b9eaee [XLA:MSA] Make all block-prefetch tests better readable by refactoring common elements.
PiperOrigin-RevId: 817818480
2025-10-10 16:20:37 -07:00
A. Unique TensorFlower
28fd47ca0d #HLODiffService Add forced mapping options to HLO Diff backend.
PiperOrigin-RevId: 817818436
2025-10-10 16:10:53 -07:00
Eugene Zhulenev
1ae52eb0f8 [tsl:concurrency] Add Future::OnReady overrides that can take Executor to run the callback
name                        cpu/op        cpu/op      vs base
BM_CreateOkFuture           1.784n ± 0%   1.860n ± 1%  +4.26% (p=0.000 n=40)
BM_CopyFuture               1.737n ± 0%   1.727n ± 0%  -0.58% (p=0.001 n=40)
BM_MapStatelessFuture       14.29n ± 0%   14.27n ± 0%       ~ (p=0.283 n=40)
BM_TryMapStatelessFuture    14.31n ± 0%   14.25n ± 0%       ~ (p=0.062 n=40)
BM_MapToFromStatelessFuture 14.43n ± 0%   14.08n ± 1%  -2.44% (p=0.000 n=40)
BM_MapStatefulFuture        14.55n ± 0%   14.51n ± 0%       ~ (p=0.607 n=40)
BM_TryMapStatefulFuture     14.54n ± 0%   14.49n ± 1%       ~ (p=0.405 n=40)
geomean                     7.908n        7.908n       -0.00%

PiperOrigin-RevId: 817817852
2025-10-10 15:59:58 -07:00
Eugene Zhulenev
9f772d8449 [tsl:concurrency] Optimize Future::Map for ready futures
```
name                        cpu/op        cpu/op      vs base
BM_CreateOkFuture           1.948n ± 0%   1.627n ± 0%  -16.50% (n=80)
BM_CopyFuture               1.946n ± 0%   1.625n ± 0%  -16.52% (n=80)
BM_MapStatelessFuture       37.16n ± 0%   15.90n ± 0%  -57.21% (n=80)
BM_TryMapStatelessFuture    36.03n ± 1%   15.88n ± 0%  -55.93% (n=80)
BM_MapToFromStatelessFuture 38.04n ± 0%   15.91n ± 0%  -58.19% (n=80)
BM_MapStatefulFuture        38.43n ± 0%   16.45n ± 0%  -57.18% (n=80)
BM_TryMapStatefulFuture     37.50n ± 0%   16.45n ± 0%  -56.13% (n=80)
geomean                     16.08n        8.368n       -47.97%
```

PiperOrigin-RevId: 817788917
2025-10-10 14:46:39 -07:00
Junwhan Ahn
2dafd2aff3 Avoid changing the MLIR context of an IFRT IR program during compilation if the program does not exclusively own the context
PiperOrigin-RevId: 817763225
2025-10-10 13:25:59 -07:00
Junwhan Ahn
f41c626f12 Use xla::ifrt::HloSharding for executable outputs
PiperOrigin-RevId: 817726203
2025-10-10 11:42:51 -07:00
A. Unique TensorFlower
c2974baa85 [IFRT]Expand visibility of mpmd_executable.
Adds `friends`, `internal`, and `users` to the visibility list for the `mpmd_executable` target.

PiperOrigin-RevId: 817713192
2025-10-10 11:21:50 -07:00
Yun Peng
134503de2b Support building XLA with Bzlmod
- Added MODULE.bazel and module extensions to introduce external dependencies with Bzlmod.
- Added a CI config for Linux CPU build with Bzlmod (enabled by `--config=bzlmod`)

TODOs:
- Support overriding Python runtime
- Support build with more configs and platforms
PiperOrigin-RevId: 817711851
2025-10-10 11:09:33 -07:00
Eugene Zhulenev
7fd35dfa02 [tsl:concurrency] Add ABSL_ATTRIBUTE_ALWAYS_INLINE to Future::{Map,OnReady}
```
name                        cpu/op        cpu/op      vs base
BM_MapStatelessFuture       46.03n ± 1%   36.93n ± 0%  -19.76% (p=0.000 n=40)
BM_TryMapStatelessFuture    46.15n ± 1%   36.00n ± 1%  -22.01% (p=0.000 n=40)
BM_MapToFromStatelessFuture 49.54n ± 0%   39.19n ± 0%  -20.90% (p=0.000 n=40)
BM_MapStatefulFuture        45.88n ± 0%   39.97n ± 3%  -12.87% (p=0.000 n=40)
BM_TryMapStatefulFuture     52.14n ± 0%   36.90n ± 0%  -29.24% (p=0.000 n=40)
```

PiperOrigin-RevId: 817699788
2025-10-10 10:38:14 -07:00
Eusebio Durán Montaña
6719e41da5 Add function to convert ConvolutionKind proto to the c++ enum
For the `ConvolutionThunk` (de)serialisation we need to make the `GpuConvDescriptor` serializable, and for that we need `CudnnConvKind` too.

A couple additional changes:

* Renamed the existing c++ enum to proto enum to a (hopefully) more readable name.
* Enforce that all c++ enums can be mapped to the proto version at compile time. I can't think of a case where we wouldn't want this, and with this change we can get rid of some non-ok Status invariants.

PiperOrigin-RevId: 817676211
2025-10-10 09:43:00 -07:00
Shaogang Wang
12bb0a674b PR #28740: [XLA:GPU] Lowering dynamic update slice thunk into command buffer if it depends on loop iteration.
Imported from GitHub PR https://github.com/openxla/xla/pull/28740

This is PR tries to lower DynamicSliceThunk into command buffer, even if it depends on the loop iteration.

The command buffer implementation will also use the same approach (HloEvaluator to get new allocation during runtime) as DynamicSLiceThunk to get the sliced allocations, and for each iteration, CommandBuffer will use HloEvalutor to get the new addresses, and doing graph update with the new address.

The major changes to custom.cc file is to resolve the issue that when a module has been parsed by command buffer scheduler, it rewrites the module into nested calls, which breaks the while loop analysis pattern, and module extraction pattern, so the fix is trying to introduced a cloned inline module, and perform the loop analysis and module extraction from the inlined module.

Copybara import of the project:

--
2fe7c75a9fcbc9ade65f5a275aba3a2bc996ba07 by Shawn Wang <shawnw@nvidia.com>:

add debug information for command_buffer_conversion_pass

--
88183dd7dc53c2bdc80f3a664a99b50e275311b2 by Shawn Wang <shawnw@nvidia.com>:

Lower dynamic update slice thunk into command buffer when its offset
value depends on loop iteraiton.

--
3cf46be90b3be2185f0b5106ea9eeaa45b088601 by Shawn Wang <shawnw@nvidia.com>:

fix

--
45b31f69f9299a13bac24a966625190c9e90c91e by Shawn Wang <shawnw@nvidia.com>:

fix

--
ce3af2b9b131c9902b45d6d9934424d861656d32 by Shawn Wang <shawnw@nvidia.com>:

fix

--
a7fc4ab02b5d7dec6d337fcc57bbfd38a3b205ed by Shawn Wang <shawnw@nvidia.com>:

fix

--
73784aa6530244559c1530b2f922cf81c6d43822 by Shawn Wang <shawnw@nvidia.com>:

change to gemm command for test

--
64b1cf454fc360bcc3255f29bd27c01799537e07 by Shawn Wang <shawnw@nvidia.com>:

fix

--
0a3d7a1b6c142a3c9aa2b299d902520ed7f91515 by Shawn Wang <shawnw@nvidia.com>:

clang format

--
3105ce82fa3751d73d41b0564402e108328ea147 by Shawn Wang <shawnw@nvidia.com>:

fix

--
85ce21672052c4bbfd50db54248dbe1ae2494230 by Shawn Wang <shawnw@nvidia.com>:

fix

Merging this change closes #28740

PiperOrigin-RevId: 817644265
2025-10-10 08:08:17 -07:00
Karlo Basioli
49b26980d7 [XLA:GPU][codegen] Separate out emitting shared dialect and lowering to triton as APIs used for testing.
This change also migrates some device tests to use the API.

PiperOrigin-RevId: 817624248
2025-10-10 07:00:46 -07:00
Adrian Kuegel
03ed995f8a [XLA:GPU] Allow unrolling for ReduceWindow with small window.
Experiments show that this performs still better than not unrolling.
Also rename the method MayPreventVectorization as the naming is misleading. The
other logic makes sure that we can at least vectorize the stores, so this
function should check whether there is an expected performance drop due to
unrolling, not whether we may be able to vectorize loads.

PiperOrigin-RevId: 817621544
2025-10-10 06:47:23 -07:00
Kanish Anand
7bf89c1611 Introduce NamedSharding to OpSharding proto and HloSharding.
#hloshardingv3

PiperOrigin-RevId: 817614842
2025-10-10 06:31:34 -07:00
Hyeontaek Lim
e74f6c1368 [IFRT] Migrate Array::pjrt_layout() callers to interpret nullptr as a default layout
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.

This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.

Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.

PiperOrigin-RevId: 817600994
2025-10-10 05:56:10 -07:00
Dimitris Vardoulakis
d0850b6100 PR #32229: Changes to the compute capabilities to account for the two different Blackwell Edge GPUs.
Imported from GitHub PR https://github.com/openxla/xla/pull/32229

RTX PRO 6000 has CC 12.0.
Spark has CC 12.1.

Removed the IsAtLeastBlackwellPro method because there is no guarantee that future data center GPUs will have CC higher than 12.0.

Also skipped the latency estimator test on Edge GPUs because it uses the collective performance model and crashes here:
784702574e/xla/service/gpu/model/gpu_collective_performance_model.cc (L239)
Copybara import of the project:

--
ca47c656de78f8c5385dcf77b7454d7adc774203 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

Some Spark fixes.                                                                                                                                                                                                                                                             Rename kBlackwellPro to kBlackwell12, as the sm_12x compute capabilities also include Spark.

Fix the latency estimator test and the gemm fusion autotuner test for Spark.

Removed the IsAtLeastBlackwellPro method because there is no guarantee that future
data center GPUs will have CC higher than 12.0.

Merging this change closes #32229

PiperOrigin-RevId: 817600860
2025-10-10 05:34:49 -07:00
Sergey Kozub
e1f227f289 PR #32525: [XLA:GPU] Fix block scaled dot global scaling for older cuDNN versions
Imported from GitHub PR https://github.com/openxla/xla/pull/32525

📝 Summary of Changes
Pass cuDNN version to the `BlockScalingRewriter` pass, and make lowering decisions based on that.

🎯 Justification
The global scaling factor doesn't work before cuDNN v9.13 (the graph is compiled, but the scaling factor is not applied).
Use the slower lowering (apply global scaling factor outside the fusion) in this case.

🚀 Kind of Contribution
🐛 Bug Fix

Copybara import of the project:

--
a47ef5175d076270e371c9e5cf355fc1ad96efc8 by Sergey Kozub <skozub@nvidia.com>:

[XLA:GPU] Fix block scaled dot global scaling for older cuDNN versions

Merging this change closes #32525

PiperOrigin-RevId: 817592016
2025-10-10 05:03:09 -07:00
Marcin Radomski
9ba0ed96cf [XLA:GPU] Add SdcLog::ReadProto
A helper that does `SdcLog::ReadFromDevice` and returns the result as
`SdcLogProto`. The proto will be dumped to log directory for debugging.

PiperOrigin-RevId: 817587228
2025-10-10 04:50:12 -07:00
Henning Becker
fb63bc5217 Sort dumped execution files before checking.
The order of files returned by `GetMatchingPaths` is not guaranteed, so sorting ensures deterministic test behavior and should get rid of the flakyness of the test.

Also a tiny assertion cleanup for better error messages.

PiperOrigin-RevId: 817564449
2025-10-10 03:29:25 -07:00
Sohaib Iftikhar
197b1c2454 [XLA:GPU]: Add support for loading HLO directly from profiler to graphviz
No change to the OSS version for this tool.

PiperOrigin-RevId: 817546670
2025-10-10 02:44:46 -07:00
Eusebio Durán Montaña
534a24b7a8 Move GpuConvConfig creation to the ConvolutionThunk constructor
To (de)serialize this thunk, we'll be using the `GpuConvDescriptor` instead of the `GpuConvConfig`, since its easier to serialize (most of the config fields actually get populated during execution).

So we move the creation to the Thunk, so that in the next CL we can also store the descriptor to use for (de)serialisation. I didn't add the `GpuConvDescriptor descriptor_` field in this CL, since its technically not needed yet.

PiperOrigin-RevId: 817523742
2025-10-10 01:07:40 -07:00
Hyeontaek Lim
4bf3e73aa9 [IFRT] Migrate Array::pjrt_layout() callers to interpret nullptr as a default layout
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.

This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.

Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.

PiperOrigin-RevId: 817516042
2025-10-10 00:37:53 -07:00
A. Unique TensorFlower
4c9b4048f9 Automated Code Change
PiperOrigin-RevId: 817506473
2025-10-10 00:01:31 -07:00
Joshua Lang
3e05740a3d Increase timeout for gpu_spmd_e2e_compile_test.
PiperOrigin-RevId: 817494720
2025-10-09 23:14:58 -07:00
A. Unique TensorFlower
103b3b5fe1 Automated Code Change
PiperOrigin-RevId: 817489367
2025-10-09 23:05:29 -07:00
A. Unique TensorFlower
09f731ce0d Automated Code Change
PiperOrigin-RevId: 817488802
2025-10-09 22:53:48 -07:00
Junwhan Ahn
c868704652 Reverts 6cb439dba1
PiperOrigin-RevId: 817470084
2025-10-09 21:34:21 -07:00