tensorflow

mirror of https://github.com/zebrajr/tensorflow.git synced 2025-12-06 12:20:11 +01:00

Author	SHA1	Message	Date
Parker Schuh	68ad2b30fa	Implement PjRtStreamExecutorRawBuffer::CopyTo in terms of raw buffers. PiperOrigin-RevId: 822345080	2025-10-21 17:58:31 -07:00
Haibo Huang	bdb268c5c5	Add helper functions to check PjRtPlatformId types. PiperOrigin-RevId: 822333726	2025-10-21 17:13:03 -07:00
Eugene Zhulenev	90491b0a55	[xla:pjrt:ffi] Prepare for legacy type registration removal PiperOrigin-RevId: 822309311	2025-10-21 16:13:04 -07:00
Paul Ganssle	512611da80	Internal code migration PiperOrigin-RevId: 822300362	2025-10-21 15:34:56 -07:00
Haibo Huang	b7d9295b52	Replace `ComputationOrigin` with the more general `PjRtDeviceDimensions` PiperOrigin-RevId: 822288293	2025-10-21 15:11:47 -07:00
Olli Lupton	3cdcb03f18	PR #32838 : Fix family-conditional logic Imported from GitHub PR https://github.com/openxla/xla/pull/32838 📝 Summary of Changes The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input. 🎯 Justification Previously the logic would propose an incompatible architecture in this case. 🚀 Kind of Contribution 🐛 Bug Fix 🧪 Unit Tests: Added a new test case showing the previously-failing case (it used to propose `sm_110`) Copybara import of the project: -- f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>: Fix family-conditional logic -- fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>: Accept CR suggestion Merging this change closes #32838 PiperOrigin-RevId: 822284790	2025-10-21 14:59:18 -07:00
Eugene Zhulenev	0fc052399b	[xla:cpu] Fix data race in ThunkExecutor Also add tsl::down_pointer_cast to improve usability. PiperOrigin-RevId: 822257137	2025-10-21 13:46:24 -07:00
Michael Whittaker	5776d2771c	Pipe incarnations to `jax.live_devices`. PiperOrigin-RevId: 822250955	2025-10-21 13:35:27 -07:00
mmakevic-amd	47cd01d4a5	PR #32960 : [ROCm] Refactor testing scripts Imported from GitHub PR https://github.com/openxla/xla/pull/32960 📝 Summary of Changes (Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, `9d358b9b26`, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now. 🎯 Justification These changes are ROCm specific and helps with rocm internal CI validation pipelines. 🚀 Kind of Contribution 🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests 📊 Benchmark (for Performance Improvements) / 🧪 Unit Tests: / 🧪 Execution Tests: / Copybara import of the project: -- 804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>: Add missing cuda-only tag to cuda test -- 44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>: Refactor test scripts -- fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>: Cover more mgpu tests -- 1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>: Switch from rocm_gcc to rocm_ci for sgpu tests -- 00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>: Changing file permissions -- 003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>: Remove unnecessary import -- 214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>: Add license header Merging this change closes #32960 PiperOrigin-RevId: 822245565	2025-10-21 13:25:33 -07:00
Eugene Zhulenev	7a107e3571	[xla:ffi] Rename FFI_TypeID_Register API PiperOrigin-RevId: 822240093	2025-10-21 13:12:40 -07:00
Felix Wang	95f3e6f33c	[XLA:GPU]: Refactor the unit test of matmul_interpolator_test.cc to prepare for adding the mix-precision fp8 unit test. PiperOrigin-RevId: 822239646	2025-10-21 13:02:53 -07:00
Felix Wang	2de2bb8581	Populate the cost for async collective in both async-start and the computation root op. PiperOrigin-RevId: 822223031	2025-10-21 12:22:08 -07:00
Eugene Zhulenev	633c3efcf9	[xla:cpu] Delete unused cpu_function_runtime header PiperOrigin-RevId: 822215543	2025-10-21 12:15:40 -07:00
Eugene Zhulenev	6141496817	[xla:ffi] Document XLA:FFI binary API guarantees and add a supporteded API range check PiperOrigin-RevId: 822214561	2025-10-21 12:02:12 -07:00
Kevin Gleason	fe624fe9ce	[StableHLO->HLO] Only lower MHLO constants in MHLO prepare for export pass. PiperOrigin-RevId: 822198262	2025-10-21 11:34:37 -07:00
A. Unique TensorFlower	7524326efd	Create testing utilities for extension plugin testing PiperOrigin-RevId: 822185331	2025-10-21 11:24:34 -07:00
Maxim Ermilov	f7bc8a8859	Port to new GpuComputeCapability API. Part 2 PiperOrigin-RevId: 822183464	2025-10-21 11:14:17 -07:00
A. Unique TensorFlower	2476ba49e0	Minor code improvement on `tsl::monitoring::testing::CellReader`. Save a `GetLatestValueOrDefault` call when the result is not used. Also save two map lookups. PiperOrigin-RevId: 822182539	2025-10-21 11:02:58 -07:00
Jaroslav Sevcik	4fc74ffdd2	PR #32846 : Allow mixed precision operands for collective permute Imported from GitHub PR https://github.com/openxla/xla/pull/32846 📝 Summary of Changes Allow mixed precision collective-permute in the verifier. 🎯 Justification Partially addresses https://github.com/openxla/xla/issues/32845 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N/A 🧪 Unit Tests: Tests that verifier passes on mixed precision collective-permute. 🧪 Execution Tests: N/A Copybara import of the project: -- 666c38a19005a609d4a7aa8e5e9b9842b1c87175 by Jaroslav Sevcik <jsevcik@nvidia.com>: Allow mixed precision for collective permute Merging this change closes #32846 PiperOrigin-RevId: 822179840	2025-10-21 10:51:30 -07:00
dependabot[bot]	f717c02d1c	PR #32904 : Bump github/codeql-action from 3.30.5 to 4.30.9 Imported from GitHub PR https://github.com/openxla/xla/pull/32904 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p> <blockquote> <h2>v4.30.9</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>4.30.9 - 17 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li> <li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li> </ul> <p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v4.30.8</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>4.30.8 - 10 Oct 2025</h2> <p>No user facing changes.</p> <p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.8/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v4.30.7</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>4.30.7 - 06 Oct 2025</h2> <ul> <li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li> </ul> <p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.7/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v3.30.9</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>3.30.9 - 17 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li> <li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li> </ul> <p>See the full <a href="https://github.com/github/codeql-action/blob/v3.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v3.30.8</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p> <blockquote> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>[UNRELEASED]</h2> <p>No user facing changes.</p> <h2>4.30.9 - 17 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li> <li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li> </ul> <h2>4.30.8 - 10 Oct 2025</h2> <p>No user facing changes.</p> <h2>4.30.7 - 06 Oct 2025</h2> <ul> <li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li> </ul> <h2>3.30.6 - 02 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.2. <a href="https://redirect.github.com/github/codeql-action/pull/3168">#3168</a></li> </ul> <h2>3.30.5 - 26 Sep 2025</h2> <ul> <li>We fixed a bug that was introduced in <code>3.30.4</code> with <code>upload-sarif</code> which resulted in files without a <code>.sarif</code> extension not getting uploaded. <a href="https://redirect.github.com/github/codeql-action/pull/3160">#3160</a></li> </ul> <h2>3.30.4 - 25 Sep 2025</h2> <ul> <li>We have improved the CodeQL Action's ability to validate that the workflow it is used in does not use different versions of the CodeQL Action for different workflow steps. Mixing different versions of the CodeQL Action in the same workflow is unsupported and can lead to unpredictable results. A warning will now be emitted from the <code>codeql-action/init</code> step if different versions of the CodeQL Action are detected in the workflow file. Additionally, an error will now be thrown by the other CodeQL Action steps if they load a configuration file that was generated by a different version of the <code>codeql-action/init</code> step. <a href="https://redirect.github.com/github/codeql-action/pull/3099">#3099</a> and <a href="https://redirect.github.com/github/codeql-action/pull/3100">#3100</a></li> <li>We added support for reducing the size of dependency caches for Java analyses, which will reduce cache usage and speed up workflows. This will be enabled automatically at a later time. <a href="https://redirect.github.com/github/codeql-action/pull/3107">#3107</a></li> <li>You can now run the latest CodeQL nightly bundle by passing <code>tools: nightly</code> to the <code>init</code> action. In general, the nightly bundle is unstable and we only recommend running it when directed by GitHub staff. <a href="https://redirect.github.com/github/codeql-action/pull/3130">#3130</a></li> <li>Update default CodeQL bundle version to 2.23.1. <a href="https://redirect.github.com/github/codeql-action/pull/3118">#3118</a></li> </ul> <h2>3.30.3 - 10 Sep 2025</h2> <p>No user facing changes.</p> <h2>3.30.2 - 09 Sep 2025</h2> <ul> <li>Fixed a bug which could cause language autodetection to fail. <a href="https://redirect.github.com/github/codeql-action/pull/3084">#3084</a></li> <li>Experimental: The <code>quality-queries</code> input that was added in <code>3.29.2</code> as part of an internal experiment is now deprecated and will be removed in an upcoming version of the CodeQL Action. It has been superseded by a new <code>analysis-kinds</code> input, which is part of the same internal experiment. Do not use this in production as it is subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3064">#3064</a></li> </ul> <h2>3.30.1 - 05 Sep 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.0. <a href="https://redirect.github.com/github/codeql-action/pull/3077">#3077</a></li> </ul> <h2>3.30.0 - 01 Sep 2025</h2> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`16140ae1a1`"><code>16140ae</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3213">#3213</a> from github/update-v4.30.9-70205d3d1</li> <li><a href="`30db5fee08`"><code>30db5fe</code></a> Update changelog for v4.30.9</li> <li><a href="`70205d3d12`"><code>70205d3</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3211">#3211</a> from github/mbg/init/starting-partial-config</li> <li><a href="`697c209bfc`"><code>697c209</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li> <li><a href="`1bd53ba38c`"><code>1bd53ba</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3205">#3205</a> from github/update-bundle/codeql-bundle-v2.23.3</li> <li><a href="`cac4df0c79`"><code>cac4df0</code></a> Rebuild</li> <li><a href="`77e5c0d0a2`"><code>77e5c0d</code></a> Merge branch 'main' into update-bundle/codeql-bundle-v2.23.3</li> <li><a href="`97a4f751be`"><code>97a4f75</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3204">#3204</a> from github/mbg/setup-codeql</li> <li><a href="`2d5512b361`"><code>2d5512b</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li> <li><a href="`fa7bdf0559`"><code>fa7bdf0</code></a> Call <code>getAnalysisKinds</code> a second time, and ignore exceptions thrown during th...</li> <li>Additional commits viewable in <a href="`3599b3baa1...16140ae1a1`">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github/codeql-action&package-manager=github_actions&previous-version=3.30.5&new-version=4.30.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Copybara import of the project: -- c14a0d2198bee3dcd76ee7fa733da41a6d1fcd6b by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>: Bump github/codeql-action from 3.30.5 to 4.30.9 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](`3599b3baa1...16140ae1a1`) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.30.9 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Merging this change closes #32904 PiperOrigin-RevId: 822178959	2025-10-21 10:40:39 -07:00
Peter Hawkins	16064a6c08	Reverts `67e5eafb24` PiperOrigin-RevId: 822148458	2025-10-21 09:32:19 -07:00
A. Unique TensorFlower	ffc0e052de	Adds an option for Hlo Module's CreateFromProto to not preserve instruction unique ids and reassigned them in a compacted way. Options is turned off for now but should be the new default moving forward. Deprecates RemapInstructionIds. PiperOrigin-RevId: 822146834	2025-10-21 09:22:36 -07:00
Pedro Gonnet	797ffee48d	[xla:cpu:xnn] Re-enable XNNPACK by default. Reverts `b2f2568bcc` PiperOrigin-RevId: 822127662	2025-10-21 08:33:32 -07:00
Pedro Gonnet	3a5338e48f	Update XNNPACK in XLA PiperOrigin-RevId: 822105743	2025-10-21 07:39:28 -07:00
Shaogang Wang	97c777acc4	PR #32688 : [XLA:GPU] Enable command buffer DynamicSliceCopyFusion command unrolling Imported from GitHub PR https://github.com/openxla/xla/pull/32688 📝 Summary of Changes This PR enables command buffer DynamicSliceCopy command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd 🎯 Justification This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 🧪 Unit Tests: xla/backends/gpu/runtime/command_buffer_cmd_test.cc: CommandBufferCmdTest:DynamicSliceCopyFusionCmd Copybara import of the project: -- feb2902fca397360460f6b9788ac0f7482cb547c by Shawn Wang <shawnw@nvidia.com>: Enable command buffer DynamicSliceCopyFusion command unrolling Merging this change closes #32688 PiperOrigin-RevId: 822104580	2025-10-21 07:24:42 -07:00
Johannes Reifferscheid	c28d80ae66	Reverts `8be12edcfe` PiperOrigin-RevId: 822095736	2025-10-21 06:59:25 -07:00
Will Froom	2abafe5c4d	[XLA][XTile] Make squeeze dims work with xtile extract/insert. PiperOrigin-RevId: 822087994	2025-10-21 06:33:13 -07:00
Shaogang Wang	8c169d147d	PR #32719 : 【XLA:GPU] Command buffer DynamicSliceFusionCmd supports cuda graph loop unrolling Imported from GitHub PR https://github.com/openxla/xla/pull/32719 📝 Summary of Changes This PR enables command buffer DynamicSliceFusion command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd 🎯 Justification This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 🧪 Unit Tests: xla/backends/gpu/codegen/dynamic_slice_fusion_test.cc Copybara import of the project: -- daa975804cbffcc3a6bc5b37e3494b51a2dbe2ca by Shawn Wang <shawnw@nvidia.com>: DynamicSliceFsuionCmd supports unrolling Merging this change closes #32719 PiperOrigin-RevId: 822071751	2025-10-21 05:48:07 -07:00
Mikhail Goncharov	2d4dd83773	[XLA:GPU] enable generic triton emitter for all gemms According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default. Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future. PiperOrigin-RevId: 822067921	2025-10-21 05:34:23 -07:00
Will Froom	bd257617f7	[XLA:GPU][XTile] Move xtile lowering to compilation pipeline. PiperOrigin-RevId: 822066890	2025-10-21 05:23:02 -07:00
A. Unique TensorFlower	ffdd9d06e0	Update autotune results for XLA AOT compile test. The device description in the autotune results now specifies "DNN version: 8.9.4" instead of "DNN version: 0.0.0". PiperOrigin-RevId: 822043677	2025-10-21 04:16:40 -07:00
Dirk Hornung	1aff85868d	[Autotuner] Find dot instructions in nested computations for fusion autotuning. The fission autotuner previously only searched for dot instructions in the entry computation of an HLO module. This caused it to miss dot operations located in nested computations, such as the body of a while loop, preventing the autotuner from applying configurations to them. PiperOrigin-RevId: 822037141	2025-10-21 04:07:19 -07:00
Ilya Tikhonovskiy	0836518bc5	[XLA:GPU] fine tune the xla part of the scaled-dot op implementation in order to support the case when we omit one of the scales and pass the bf16 argument instead. We adjusted the emitter for the case when the scale is missing. Also we relaxed the hlo verifier a bit and tweaked the composite rewriter that should accept the dim indexes passed by jax. PiperOrigin-RevId: 822036474	2025-10-21 03:54:17 -07:00
Will Froom	373abf8de1	[XLA:CPU][XTile] Add support for strided extract/insert tile. PiperOrigin-RevId: 822035319	2025-10-21 03:38:59 -07:00
Adrian Kuegel	e756c21611	[XLA:GPU] Remove a source of non-determinism from DotMerger pass. When removing ops, we need to do that in a deterministic order. The reason is that removing users works by finding the position of the user in the users vector, then swapping with the last element of the vector, then popping the last element of the vector. So if more than one element is removed from a users list, it matters in which order the elements are removed. PiperOrigin-RevId: 822026351	2025-10-21 03:32:31 -07:00
Jaroslav Sevcik	735f4bb631	PR #32905 : Allow mixed precision operands for async collective permute Imported from GitHub PR https://github.com/openxla/xla/pull/32905 📝 Summary of Changes Allow mixed precision asynchronous collective-permute in the verifier. 🎯 Justification Fixes https://github.com/openxla/xla/issues/32845 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N/A 🧪 Unit Tests: Tests that verifier passes on mixed precision collective-permute-start and collective-permute-done. 🧪 Execution Tests: Manually testes the JAX repro from https://github.com/openxla/xla/issues/32845 Copybara import of the project: -- f44faa7ce7ecfbd810983cae170a118bb19a8bb3 by Jaroslav Sevcik <jsevcik@nvidia.com>: Allow mixed precision operands for async collective permute Merging this change closes #32905 PiperOrigin-RevId: 822023349	2025-10-21 03:22:25 -07:00
Aleksa Arsic	b89fdab2a2	PR #32773 : [ROCm] Fix convolution fp16 performance drop on gfx11xx, gfx12xx Imported from GitHub PR https://github.com/openxla/xla/pull/32773 📝 Summary of Changes Remove hardcoded NHWC convolution layout for fp16 precision. 🎯 Justification Performance drops for fp16 precision on gfx11xx and gfx12xx GPUs were observed internally, as well as by the [community](https://github.com/jax-ml/jax/issues/30548). 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark Community member provided the script with whom the [profiling can be done](https://github.com/jax-ml/jax/issues/30548#issue-3270872993). Significant performance improvement for fp16 on gfx12xx: ``` Running on: rocm:0 Testing float32... Avg time: 0.092307 s, Throughput: 1.68 TFLOP/s Testing float16... Avg time: 0.011742 s, Throughput: 13.17 TFLOP/s Testing bfloat16... Avg time: 0.011989 s, Throughput: 12.90 TFLOP/s ``` Results of the profiling before the fix: ``` Running on: rocm:0 Testing float32... Avg time: 0.092312 s, Throughput: 1.67 TFLOP/s Testing float16... Avg time: 0.775142 s, Throughput: 0.20 TFLOP/s Testing bfloat16... Avg time: 0.011990 s, Throughput: 12.90 TFLOP/s ``` @xla-rotation can you please review this PR? Copybara import of the project: -- c9fdba79e32c13d9cbf640e61d941d071fabba9d by Aleksa Arsic <Aleksa.Arsic@amd.com>: Remove hardcoded convolution NCHW layout assignment for fp16 precision. -- 69660d19999a14b24d63b52e6dae310cfbdcbb6b by Aleksa Arsic <Aleksa.Arsic@amd.com>: Add unit tests for ROCm layout assignment. Merging this change closes #32773 PiperOrigin-RevId: 822022522	2025-10-21 03:08:12 -07:00
Will Froom	0f4c3f55b5	[XLA][XTile] Use xtile entry, extract & insert in triton emitter. PiperOrigin-RevId: 822020719	2025-10-21 02:55:47 -07:00
Will Froom	7f64538e67	[XLA:CPU] Make tiled kernel test deterministic. PiperOrigin-RevId: 822018419	2025-10-21 02:42:18 -07:00
Will Froom	900e2d4d5c	[XLA][XTile] Add TiledBuffer interface to insert/extract ops. PiperOrigin-RevId: 822009372	2025-10-21 02:04:04 -07:00
Dimitris Vardoulakis	81f29b3472	PR #32724 : Disable only the test cases that are failing and enable 3 test targets on B200. Imported from GitHub PR https://github.com/openxla/xla/pull/32724 Copybara import of the project: -- c3f4ff8ec6af27d24b61e2aa529585697b8aa77a by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Disable only the test cases that are failing and enable 3 test targets on B200. -- 1f6e52218ec124bb52d4dba70aa7832311762465 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Disable test case in cudnn_test that fails on Google's B200. Keep gpu_compiler_test off CI for now due to memory leak found by ASAN, but don't revert the changes in the file, so it can be enabled more easily in the future. -- 42e501a41e43c174538ab186c659a072101b4ab2 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Disable ConvWgradWithNHWCLayoutExecutesCorrectly only on Blackwell. Merging this change closes #32724 PiperOrigin-RevId: 821992088	2025-10-21 01:05:14 -07:00
Mudit Gokhale	898e238e5d	Enable multi-host support for trace viewer. PiperOrigin-RevId: 821973423	2025-10-21 00:16:33 -07:00
A. Unique TensorFlower	e7e50018a2	Reverts `7144ba7d80` PiperOrigin-RevId: 821972512	2025-10-20 23:54:39 -07:00
A. Unique TensorFlower	fdc9d00af9	Automated Code Change PiperOrigin-RevId: 821967952	2025-10-20 23:37:29 -07:00
A. Unique TensorFlower	a31ff63e54	Refactor resource tracking logic done in GetResourcesForInstructionImpl. PiperOrigin-RevId: 821957518	2025-10-20 22:55:19 -07:00
A. Unique TensorFlower	63d73059b0	Automated Code Change PiperOrigin-RevId: 821947260	2025-10-20 22:21:39 -07:00
A. Unique TensorFlower	e39094330d	Automated Code Change PiperOrigin-RevId: 821946240	2025-10-20 22:11:41 -07:00
Eugene Zhulenev	5caf2a70d8	[xla:ffi] Revert ABI version change Also relax the FFI version check. PiperOrigin-RevId: 821905272	2025-10-20 19:43:47 -07:00
Subhankar Shah	14710459b6	[XLA:MSA] Allow MSA to schedule custom-call prefetches. PiperOrigin-RevId: 821900214	2025-10-20 19:25:21 -07:00
A. Unique TensorFlower	7b8ce05b8b	Integrate LLVM at llvm/llvm-project@32de3b9ef9 Updates LLVM usage to match [32de3b9ef9e7](https://github.com/llvm/llvm-project/commit/32de3b9ef9e7) PiperOrigin-RevId: 821892561	2025-10-20 18:56:42 -07:00
Benjamin Chetioui	fe31cef146	[XLA:GPU] Add a filter in the Triton fusion emitter to allow using a transposed iteration schedule for fusions rooted in dot. The intent is to improve L2 cache hits in the case where the left-hand side argument fully fits in L2. PiperOrigin-RevId: 821882872	2025-10-20 18:15:08 -07:00
Haibo Huang	0e15b80eb6	Implement `LogicalDeviceOfDefaultTypeForId` for GPU / CPU PiperOrigin-RevId: 821873337	2025-10-20 17:53:40 -07:00
Parker Schuh	20d14fbfb7	Rollforward of Update PjRtStreamExecutorRawBuffer::CopyRawHostToDeviceAndReturnEvent to support staging host buffers (for non-pinned memory). This allows replacing the CopyRawToHost functions. Reverts `baf408c724` PiperOrigin-RevId: 821872812	2025-10-20 17:43:51 -07:00
Haibo Huang	d2e02ce8d9	Adds topology utility functions PiperOrigin-RevId: 821858216	2025-10-20 17:06:02 -07:00
A. Unique TensorFlower	ce507e7993	Update XNNPACK in XLA This update includes some workarounds that disable SME for old compilers that don't support. PiperOrigin-RevId: 821848400	2025-10-20 16:51:04 -07:00
Alexander Shaposhnikov	72ba588a80	Clean up debug_options usage. NFC. PiperOrigin-RevId: 821848216	2025-10-20 16:38:52 -07:00
Maxim Ermilov	361f1c64eb	Port to new GpuComputeCapability API PiperOrigin-RevId: 821845460	2025-10-20 16:24:36 -07:00
Will Froom	dd4822d61c	[XLA:CPU] Fix flaky test. PiperOrigin-RevId: 821835738	2025-10-20 16:03:57 -07:00
Hyeontaek Lim	67e5eafb24	[PJRT C] Implement `Executable::GetOutputLayouts()` in the PJRT Layouts extension This change implements a native support for `xla::Executable::GetOutputLayouts()` in PJRT C API, when PJRT Layouts extension is available. This support does not fetch the optimized HLO, and thus this method becomes more reliable and fast. This change strongly recommends the plugin that implemented the Layouts extension v2 to upgrade to v3 to avoid an incompatibility. PiperOrigin-RevId: 821834116	2025-10-20 15:57:07 -07:00
Bill Varcho	a40f3bdebd	[Upkeep] Resolve 4 instances of the following issue: Todo (resolved) PiperOrigin-RevId: 821824006	2025-10-20 15:43:00 -07:00
Gunhyun Park	d2759cb154	Fix typo `invompatible` -> `incompatible` PiperOrigin-RevId: 821816332	2025-10-20 15:16:51 -07:00
Sean Talts	8d940b9cc7	[XLA:CPU] Refactor: intrinsic::Type into its own library, pull out naming functions. PiperOrigin-RevId: 821815511	2025-10-20 15:06:06 -07:00
Hyeontaek Lim	b915d3103a	[NanoRt] NanoRt IFRT now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before. Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future. PiperOrigin-RevId: 821808592	2025-10-20 14:51:22 -07:00
Benjamin Chetioui	88e4cd4d01	[XLA] Use `TiledHloSchedule` in `SymbolicTileAnalysis`. We can now produce arbitrary iteration patterns for output tiles, simply by parametrizing calls to `ComputeTiledHloInstructions` with different `TiledHloSchedule`s. PiperOrigin-RevId: 821796530	2025-10-20 14:12:48 -07:00
Sohaib Iftikhar	cbeeef926f	[XLA:GPU]: Remove unused method from all reduce test PiperOrigin-RevId: 821742010	2025-10-20 12:28:17 -07:00
Hyeontaek Lim	cc9fd2b254	[IFRT Proxy] `Array::pjrt_layout()` uses `nullptr` to indicate a default layout IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before. Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately. PiperOrigin-RevId: 821741105	2025-10-20 12:19:53 -07:00
Eugene Zhulenev	0e09f486e7	[xla:pjrt:ffi] Remove deprecated TypeID registration function PiperOrigin-RevId: 821740142	2025-10-20 12:11:20 -07:00
Michael Kuperstein	b824d4e187	[XLA] Remove verify_unique_channel_ids verifier option. The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use. PiperOrigin-RevId: 821737613	2025-10-20 11:59:45 -07:00
Kanish Anand	16e1567819	Clarify field name PiperOrigin-RevId: 821732575	2025-10-20 11:50:54 -07:00
Zac Mustin	903ce4c6ff	Use `GetPjRtCpuClient` directly when making a PJRT C API CPU client. Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`. This change is a no-op. PiperOrigin-RevId: 821732030	2025-10-20 11:44:41 -07:00
A. Unique TensorFlower	458995b35d	Set DNN version in DeviceDescription for autotuner cache. The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra. PiperOrigin-RevId: 821728904	2025-10-20 11:34:10 -07:00
A. Unique TensorFlower	027a15b3dc	[Autotuner] Early exit if there is only one supported config. - We encounter this case very often (for cublas autotuner), so it makes sense to optimize it. - Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it. PiperOrigin-RevId: 821716593	2025-10-20 11:08:05 -07:00
Eugene Zhulenev	fd948cba88	[xla:cpu] Migrate tf2xla to BufferAllocationInfo Reverts `f2ed04aff6` PiperOrigin-RevId: 821660240	2025-10-20 08:41:06 -07:00
A. Unique TensorFlower	f32acff204	Automated Code Change PiperOrigin-RevId: 821628227	2025-10-20 07:15:56 -07:00
A. Unique TensorFlower	789f9c8c92	Integrate LLVM at llvm/llvm-project@d5ce81dc81 Updates LLVM usage to match [d5ce81dc8143](https://github.com/llvm/llvm-project/commit/d5ce81dc8143) PiperOrigin-RevId: 821627987	2025-10-20 06:57:20 -07:00
Will Froom	4d5ed207cd	[XLA:GPU][XTile] Handle scalar load/store in xtile lowering. PiperOrigin-RevId: 821617202	2025-10-20 06:36:16 -07:00
Alex	8245a623aa	PR #32782 : [ROCm] Fix hermetic build for rocm Imported from GitHub PR https://github.com/openxla/xla/pull/32782 📝 Summary of Changes Fix hermetic build for rocm. 🎯 Justification Introduce missing hipblaslt dependency. Fix invalid libs linking and align with the data directories. 🚀 Kind of Contribution Please remove what does not apply: 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) CI, not relevant 🧪 Unit Tests: Not relevant 🧪 Execution Tests: Not relevant Copybara import of the project: -- f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>: Add missing hermetic lib dependency -- fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>: Add missing hipblaslt deps, fix the targets -- 540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>: Make hipblaslt mandatory -- 3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>: Fix test -- eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>: Ignore asnsigaltstack -- 54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>: [ROCm] Use working sha256 for latest ROCm 7.0 docker image -- 9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>: [ROCm] Add ROCM_PATH repo_env to test scripts -- 1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>: Fix buildifier warning Merging this change closes #32782 PiperOrigin-RevId: 821614030	2025-10-20 06:26:00 -07:00
Kostiantyn Liepieshov	5866a4f621	fix relayout propagation for MPMD. In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts PiperOrigin-RevId: 821612799	2025-10-20 06:11:10 -07:00
A. Unique TensorFlower	5707a02d98	[XLA:GPU] Enable `chlo.asinh` -> `kAsinh` `HloInstruction` lowering. PiperOrigin-RevId: 821610794	2025-10-20 05:57:12 -07:00
Adrian Kuegel	7144ba7d80	Migrate ListScheduler from TuplePointsToAnalysis to HloAliasAnalysis. Reverts `22032a9edb` PiperOrigin-RevId: 821605784	2025-10-20 05:44:20 -07:00
Will Froom	beb48d90e2	[XLA][XTile] Add xtile lowering passes for triton. This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR. The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`. PiperOrigin-RevId: 821593545	2025-10-20 04:57:30 -07:00
Christian Sigg	ea72bd7e48	[XLA:GPU] Allow kPad in nest_gemm_fusion. PiperOrigin-RevId: 821581567	2025-10-20 04:16:25 -07:00
Nikita Putikhin	c842d810f4	[XLA:GPU] Initialize `global_split_limits` with the first config's `split_k`. PiperOrigin-RevId: 821579201	2025-10-20 04:03:13 -07:00
Parker Schuh	f322c0c82d	Refactor the event loop + socket integration so that it is separately testable. This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as the API from the socket integration and subclasses can handle these accordingly. This also moves the responsibility to destroy in the handler logic with the contract that the event is removed from the loop on the second HalfClose event. PiperOrigin-RevId: 821445213	2025-10-19 19:02:26 -07:00
A. Unique TensorFlower	50808df60c	Add random perturbations to the xla_tpu_msa_sort_order_overrides flag Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation. This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features: 1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed - to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime - once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction 2. finding a lot of small changes that together reduce the runtime - we can do this using a "hill-climbing" method - try many perturbations until you find one slightly better than the baseline. - try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again - repeat until no more improvements are found NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs. PiperOrigin-RevId: 821309046	2025-10-19 06:33:49 -07:00
David Majnemer	8cf42017ec	[TSL] Consolidate NUMA code across different platforms. No functional change is intended. PiperOrigin-RevId: 821216963	2025-10-18 22:14:29 -07:00
Subhankar Shah	a139a50e56	[XLA:MSA] Allow allocation requests with a continuous default memory requirement to fall through without resulting in a failure requiring un-commit. PiperOrigin-RevId: 821165250	2025-10-18 18:00:21 -07:00
A. Unique TensorFlower	ac5fb8fb7f	Introduce and connect an XLA:TPU shardy option to fully deduplicate functions in Shardy. This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default. PiperOrigin-RevId: 821039969	2025-10-18 05:48:48 -07:00
Aliia Khasanova	4985a1c2f3	Add proto [de]serialization for HostExecuteDoneThunk PiperOrigin-RevId: 821029998	2025-10-18 04:55:53 -07:00
A. Unique TensorFlower	17117898f9	Automated Code Change PiperOrigin-RevId: 820964120	2025-10-17 23:15:56 -07:00
Maxim Ermilov	4a42fca868	First step to introduce GpuComputeCapability custom class instead of std::variant PiperOrigin-RevId: 820940828	2025-10-17 21:24:11 -07:00
David Majnemer	4d358b2bac	[TSL] Remove unused `integral_types.h` from TSL platform defaults. This header provided typedefs for integral types within the `tsl` namespace, but these are no longer necessary. PiperOrigin-RevId: 820935573	2025-10-17 21:01:34 -07:00
A. Unique TensorFlower	4beacf5a04	Make the chain specifically target the reshape-transpose chain where the transposes are not identity permutations. Identity transposes should be eliminated separately in HandleTranspose already. PiperOrigin-RevId: 820903953	2025-10-17 18:56:47 -07:00
Alexander Shaposhnikov	ce65a0ad5c	[XLA:CPU] Add initial bits for YNNPACK support. + Do not build XLA with YNNPACK on Windows. Co-authored-by: Penporn Koanantakool <penporn@google.com> PiperOrigin-RevId: 820896434	2025-10-17 18:36:53 -07:00
A. Unique TensorFlower	f0057ee4b7	[XLA] Delete stale comment on ShapeUtil::PermuteDimensions usage. PiperOrigin-RevId: 820874660	2025-10-17 16:58:56 -07:00
Zixuan Jiang	1a142dab0a	Refactor shardy_xla_pass. Remove unused code. PiperOrigin-RevId: 820872613	2025-10-17 16:49:40 -07:00
A. Unique TensorFlower	f2ed04aff6	Reverts `0fab8daf15` PiperOrigin-RevId: 820869543	2025-10-17 16:39:59 -07:00
A. Unique TensorFlower	206f1c1891	Update XNNPACK in XLA PiperOrigin-RevId: 820860720	2025-10-17 16:14:15 -07:00
Haibo Huang	a619e2de08	Expose new methods to PjRtTopologyDescription. PiperOrigin-RevId: 820837477	2025-10-17 15:04:17 -07:00
A. Unique TensorFlower	119e1f6731	https://github.com/llvm/llvm-project/pull/162120 removed some automatic namespace determinations, so we need to explicitly specify some namespaces now. This is needed for the LLVM integrate. PiperOrigin-RevId: 820836649	2025-10-17 14:52:43 -07:00
David Majnemer	bdb78510d0	[TSL] Clean up integral types Let's migrate to u?int\d+_t types instead of our own bespoke stuff. PiperOrigin-RevId: 820815523	2025-10-17 14:19:08 -07:00
Eugene Zhulenev	d531cdce30	[xla:ffi] Add TypeRegistry::TypeInfo to be able to register functions to manipulate user-defined types PiperOrigin-RevId: 820811829	2025-10-17 13:41:40 -07:00
Kevin Gleason	46522b8a20	[StableHLO] Add transpose simplification PiperOrigin-RevId: 820804015	2025-10-17 13:31:39 -07:00
Niklas Vangerow	13006913d2	Migrate sample_file_test to HloRunnerPjRt. PiperOrigin-RevId: 820803579	2025-10-17 13:21:59 -07:00
Hyeontaek Lim	05101b9755	[PjRt-IFRT] Temporary workaround for output layout handling PjRt-IFRT directly or indirectly fetched optimized HLO to get the output layout mode and output layouts. This seems to introduce a regression in some jobs that use PJRT C API and have a too large serialized HLO (> 2 GiB). As a workaround, PjRt-IFRT gracefully handles output layout mode and layout discovery errors, and falls back to concrete layouts that are directly obtained from output `PjRtBuffer`s, should give the same behavior before/after the default layout handling change. Further changes will follow to discover default layout modes and layouts without going through `PjRtLoadedExecutable::GetHloModules()`. PiperOrigin-RevId: 820785277	2025-10-17 12:41:35 -07:00
Parker Schuh	b07145966f	Add StatusOr to transfer server BulkTransportInterface on the bond id to forward errors from bond connection failures to the control plane connection. PiperOrigin-RevId: 820783819	2025-10-17 12:28:16 -07:00
Eugene Zhulenev	0fab8daf15	[xla:cpu] Migrate tf2xla to BufferAllocationInfo Reverts `94fbd7554e` PiperOrigin-RevId: 820770766	2025-10-17 11:54:08 -07:00
Benjamin Chetioui	81798b5240	[XLA] Throw away `TilingSpecification` in the `TransposedDotTiledHloSchedule`. After relaxing the constraints related to the iteration space in a recent change, this is no longer necessary. PiperOrigin-RevId: 820766539	2025-10-17 11:33:01 -07:00
A. Unique TensorFlower	94fbd7554e	Reverts `fb52ce8275` PiperOrigin-RevId: 820748684	2025-10-17 10:58:15 -07:00
Penporn Koanantakool	8614a97d98	[xla:cpu:ynn] Add build macros for YNNPACK integration. We won't build XLA with YNNPACK on Windows yet. PiperOrigin-RevId: 820744698	2025-10-17 10:40:45 -07:00
Kostiantyn Liepieshov	f910c98db0	Use `R"hlo(...)hlo"` for HLO text in `sample_text_test.cc`. This improves readability and allows for better syntax highlighting of the embedded HLO strings. PiperOrigin-RevId: 820710394	2025-10-17 09:12:53 -07:00
Eugene Zhulenev	fb52ce8275	[xla:cpu] Migrate tf2xla to BufferAllocationInfo PiperOrigin-RevId: 820707093	2025-10-17 08:59:31 -07:00
Eugene Zhulenev	4752801386	[xla:ffi] Make TypeInfo mandatory in XLA_FFI_REGISTER_TYPE Add placeholders for future Type serialization/deserialization. It's not an ABI breaking change as it's unused today, and it allows to avoid ABI breaking change in the future when FFI will add proper ser/des support for user defined types. PiperOrigin-RevId: 820676169	2025-10-17 07:20:25 -07:00
Aliia Khasanova	30d25d6d18	Add proto [de]serialization for HostExecuteStartThunk PiperOrigin-RevId: 820645056	2025-10-17 05:32:26 -07:00
Karlo Basioli	0bb1532ddf	[XLA] Enable multihost runner to load unoptimized hlo snapshots dumped without custom serialization. PiperOrigin-RevId: 820643951	2025-10-17 05:26:10 -07:00
A. Unique TensorFlower	51fc1ac0d5	Improve logging and error messages from autotuner. - The VLOG messages are updated to more accurately describe whether the autotuner is finding a config in cache, using a default, or actively tuning for the best config. - The error contains the HLO instruction. PiperOrigin-RevId: 820640768	2025-10-17 05:16:19 -07:00
Eugene Zhulenev	52749919c9	[xla:cpu] Add buffer_allocation_info to xla_cpu_runtime_hdrs PiperOrigin-RevId: 820639686	2025-10-17 05:03:10 -07:00
Mohammed Anany	097f587e4e	[XLA:GPU/WS] Adding test coverage for auto warp specialization via Triton. PiperOrigin-RevId: 820637611	2025-10-17 04:49:39 -07:00
Nikita Putikhin	cc58fb18fd	[XLA:GPU] Enable dots with block_n=8 in triton and autotuner This change utilizes recently added Triton support for smaller block sizes. Skipping occupancy optimization for some configs is essentially a workaround for incompatible split_k values. The impact of these configs is limited however because they are only present in non-exhaustive mode, so they mostly get filtered out anyway. PiperOrigin-RevId: 820617352	2025-10-17 03:32:51 -07:00
Will Froom	abc19d2d20	[XLA:CPU] Combine optimization & lowering pass managers by using callback pass. PiperOrigin-RevId: 820610316	2025-10-17 03:07:44 -07:00
Karlo Basioli	5da47fcdd8	[XLA:GPU][codegen] Emit shlo for broadcast_in_dim and lower to equivalent triton op. PiperOrigin-RevId: 820598440	2025-10-17 02:33:27 -07:00
Zixuan Jiang	0ab4818f74	Use all-gather in the spmd_partitioner_test. Before this change, we disallowed all-gather such that the partitioner generates `all-reduce(dynamic-update-slice())` pattern. With this change, we allow all-gather for two reasons. 1. In most cases, all-gather is allowed and preferred. 2. It is easier to read and match the partitioner result. PiperOrigin-RevId: 820593767	2025-10-17 02:02:58 -07:00
Ilia Sergachev	4cd7465b84	PR #32388 : [GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension. Imported from GitHub PR https://github.com/openxla/xla/pull/32388 📝 Summary of Changes Support collectives with non-minor-most last dimension in the sub-byte collective normalization pass. 🎯 Justification Makes more collectives efficient, not require type conversion. 🚀 Kind of Contribution Performance Improvement. 📊 Benchmark (for Performance Improvements) ``` Before: ## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=68384ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=67744ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=66976ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=67040ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=66816ns After: ## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=41216ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=40960ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=40960ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=41056ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=40960ns ``` Measured on 8xH100 DGX. 🧪 Unit Tests: yes 🧪 Execution Tests: yes Copybara import of the project: -- a3777523ffffbcc59da285544e3fb5575d098b9c by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension. Merging this change closes #32388 PiperOrigin-RevId: 820585923	2025-10-17 01:38:24 -07:00
Harsha H S	086937e138	PR #32678 : [ROCm] Use working sha256 for latest ROCm 7.0 docker image and fix test scripts Imported from GitHub PR https://github.com/openxla/xla/pull/32678 📝 Summary of Changes - Fix sha256 of docker image to ensure CI is not broken due to malformed image - Fix test scripts by passing ROCM_PATH to bazel sandbox via repo_env 🎯 Justification Continued CI runs 🚀 Kind of Contribution 🧪 Tests Copybara import of the project: -- 3ca8114613d8e002c137f28bb6608639d08a724a by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>: [ROCm] Use working sha256 for latest ROCm 7.0 docker image -- 09ddfbdf205a6406cdd67e20671f41455fffe0f9 by Harsha HS <Harsha.HavanurShamsundara@amd.com>: [ROCm] Add ROCM_PATH repo_env to test scripts Merging this change closes #32678 PiperOrigin-RevId: 820582560	2025-10-17 01:25:06 -07:00
Shanbin Ke	f573329cc6	PR #32718 : [XLA:GPU] add conv fusion support in cudnn fusion compiler Imported from GitHub PR https://github.com/openxla/xla/pull/32718 📝 Summary of Changes This PR adds conv fusion support in cudnn fusion compiler. * add conv type in `CuDnnFusionConfig` to represent different types of conv. We are getting rid of the conv custom call target so this info has be preserved in fusion config. * add `ConvDimensionAdapter` to generate NCHW logical layout for cudnn frontend while physical layout could be NHWC (most preferable layout) or NCHW (for int conv). Only NHWC layout is used in the unit tests because layout assignment currently doesn't handle conv fusion to transform other layouts to NHWC, this needs to be addressed in separate PR. * add conv translation rule from XLA conv to cudnn frontend graph API. * Other parts of the lowering is taken care automatically by current cudnn fusion compiler: workspace allocation/graph validation/graph compilation/graph serialization. 🎯 Justification This is the first step to unify the conv as cudnn fusion in XLA. Conv custom call will be replaced with conv fusions in the future. 🚀 Kind of Contribution ✨ New Feature 📊 Benchmark (for Performance Improvements) No Performance changes are expected. 🧪 Unit Tests: Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad. 🧪 Execution Tests: Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad. Copybara import of the project: -- 57555cd0e3759aacb7a98135c3261f4cc3f642c2 by Cjkkkk <ske@nvidia.com>: init -- d6edecfa42a6371a0908e22daeb8deaf32998ece by Cjkkkk <ske@nvidia.com>: address comments -- 17df6f8451274f070d7d332a126cfefa1ef7df83 by Cjkkkk <ske@nvidia.com>: removed one comment -- 1b7c63b1ade7751cf8f68c7fb11cd68491440081 by Cjkkkk <ske@nvidia.com>: add const Merging this change closes #32718 PiperOrigin-RevId: 820574737	2025-10-17 00:58:07 -07:00
Jacques Pienaar	2096501975	Remove register everything. This should just be IR one. PiperOrigin-RevId: 820548236	2025-10-16 23:22:26 -07:00
A. Unique TensorFlower	1ddcd859d3	Move absl_thread_pool to XLA as YnnThreadpool PiperOrigin-RevId: 820544939	2025-10-16 23:13:24 -07:00
Christian Sigg	c9d8d37611	[xla:gpu] Relax nested gemm fusion constraints. This change removes dimension ordering constraints in `AcceptDotOperand`. PiperOrigin-RevId: 820542964	2025-10-16 23:02:42 -07:00
A. Unique TensorFlower	d46c1b99a9	Automated Code Change PiperOrigin-RevId: 820542824	2025-10-16 22:51:48 -07:00
Gregory Pataky	c0d9a60f83	Internal changes to project structure PiperOrigin-RevId: 820527062	2025-10-16 21:52:14 -07:00
Penporn Koanantakool	b2f2568bcc	[xla:cpu:xnn] Temporarily disable XNNPACK by default. PiperOrigin-RevId: 820519075	2025-10-16 21:31:15 -07:00
A. Unique TensorFlower	5592d364ec	Automated Code Change PiperOrigin-RevId: 820505039	2025-10-16 20:36:41 -07:00
A. Unique TensorFlower	a8a747470e	Update XNNPACK in XLA PiperOrigin-RevId: 820502825	2025-10-16 20:24:07 -07:00
Eugene Zhulenev	ef3a678718	[xla:cpu] Fix BufferAllocationInfo::InOutParameter constructor PiperOrigin-RevId: 820456592	2025-10-16 17:49:08 -07:00
Kevin Gleason	e0f3263a48	[StableHLO Builder] Add API to set frontend attributes PiperOrigin-RevId: 820455957	2025-10-16 17:34:02 -07:00
Benjamin Chetioui	c19b0d8727	[XLA] Relax restrictions on the `iteration_space` parameter in `Schedule`. We're perfectly able to construct a schedule using only a subset of the iteration space of a `tile_offsets_indexing`---and in fact need to when we are processing nested fusions. PiperOrigin-RevId: 820454010	2025-10-16 17:25:27 -07:00
Haibo Huang	c3ce8a9881	Add `PjRtDeviceDimensions` struct and proto. PiperOrigin-RevId: 820440467	2025-10-16 16:41:32 -07:00
Yulia Baturina	180445df5b	Add more textual hdrs to `mkl_dnn_acl` target to enable usage of `--config=mkl_aarch64_threadpool` with `--config=rbe_cross_compile_linux_aarch64` in JAX wheel builds. This addition fixes the error below: ``` 2025-09-03 23:33:24,197 - INFO - external/mkl_dnn_acl_compatible/src/graph/interface/partition_impl.cpp:17:10: fatal error: 'graph/interface/partition_impl.hpp' file not found 2025-09-03 23:33:24,197 - INFO - 17 \| #include "graph/interface/partition_impl.hpp" 2025-09-03 23:33:24,197 - INFO - \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-09-03 23:33:24,197 - INFO - 1 error generated. ``` PiperOrigin-RevId: 820423794	2025-10-16 15:58:17 -07:00
A. Unique TensorFlower	a04d8eeb75	Update XNNPACK in XLA PiperOrigin-RevId: 820419078	2025-10-16 15:43:29 -07:00
Eugene Zhulenev	a2ab65d20c	[xla:ffi] Fix XLA_FFI_REGISTER_TYPE_X macro to support optional args PiperOrigin-RevId: 820414595	2025-10-16 15:30:26 -07:00
Parker Schuh	fd09236fe4	Rollforward of: Implement PjRtStreamExecutorRawBuffer::CopyToLiteralAsync and allow PjRtStreamExecutorBuffer to just use inherited literal conversion logic. Reverts `009d8fdbf4` PiperOrigin-RevId: 820409509	2025-10-16 15:17:15 -07:00
A. Unique TensorFlower	1f3321fa67	* Adds flag `tf_serialize_mlir_to_compressed_bytecode` to serialize to compressed bytecode. * Deserializing MLIR modules still tries to parse as string first as thats the default, on failure it tries to uncompress and parse. PiperOrigin-RevId: 820396326	2025-10-16 14:49:31 -07:00
A. Unique TensorFlower	5863476a05	Integrate LLVM at llvm/llvm-project@bfee9db785 Updates LLVM usage to match [bfee9db78577](https://github.com/llvm/llvm-project/commit/bfee9db78577) PiperOrigin-RevId: 820396282	2025-10-16 14:38:22 -07:00
Kanish Anand	a6a2128d5f	Add IFTTT for named sharding types PiperOrigin-RevId: 820392329	2025-10-16 14:25:10 -07:00
Maxim Ermilov	5c18a50655	Use nvml impl lib based wrapper PiperOrigin-RevId: 820376041	2025-10-16 13:59:12 -07:00
Michael Kuperstein	83a98ac48d	[XLA] Remove dead module-group related code from HloPassPipelineTest PiperOrigin-RevId: 820373683	2025-10-16 13:50:15 -07:00
Will Froom	168eb2c36a	[XLA:CPU][XTile] Add lowering for tensor extract and from_elements & fix dot with scalar output. PiperOrigin-RevId: 820368257	2025-10-16 13:37:25 -07:00
Michael Kuperstein	4e29e9da6c	[XLA] Remove unused RecordPassStartMetadata overload. PiperOrigin-RevId: 820365407	2025-10-16 13:22:35 -07:00
A. Unique TensorFlower	c8b47aac14	In profile_data_lib.cc, throw std::runtime_error instead of check fail. PiperOrigin-RevId: 820351230	2025-10-16 12:47:14 -07:00
A. Unique TensorFlower	83c407040a	[XLA:GPU] Don't fail Autotuner::GetSupportedConfigs if one of the backend fails PiperOrigin-RevId: 820303427	2025-10-16 10:58:41 -07:00
Yulia Baturina	3c991bd608	Remove linking libnvidia-ml.so from hermetic CUDA forward compatibility mode. `libnvidia-ml.so` version is coupled with kernel mode driver version, hence we can't provide a custom version of `libnvidia-ml.so` if the machine has a different KMD installed on it. PiperOrigin-RevId: 820291348	2025-10-16 10:28:04 -07:00
Michael Whittaker	7a51446051	Propagate NCCL aborts into user exceptions. If a multi-controller JAX program is running a collective (e.g., AllReduce) and one of the collective participants fails, we'll abort the collective. Previously, a JAX programmer didn't have a way to know that their collective was aborted. The collective would just return garbage data. This commit changes the TFRT GPU client to propagate aborted collectives into Python exceptions. There is some subtlety in detecting when we abort a collective. The NCCL API unfortunately doesn't provide a direct way to query whether or not a collective was aborted. Instead, we check the health of the participants after the collective returns. If the participants are stale, then we conservatively assume the collective was canceled. This leads to the possibility that some processes think the collective was aborted and some don't. I'll address this lack of atomicity in future changes. PiperOrigin-RevId: 820272371	2025-10-16 09:49:10 -07:00
A. Unique TensorFlower	c12ac785ab	[XLA] Add `asinh` as a native HLO opcode. This change promotes `asinh` from a composite operation to a native HLO opcode (`kAsinh`). This allows for direct lowering to device-specific intrinsics which should be more performant. Support is added for GPU. PiperOrigin-RevId: 820245338	2025-10-16 08:34:29 -07:00
Sevin Fide Varoglu	b81b3316be	PR #31030 : [XLA:GPU] Move ReduceScatterCreator after AlgebraicSimplifier Imported from GitHub PR https://github.com/openxla/xla/pull/31030 📝 Summary of Changes This PR moves the ReduceScatterCreator pass to run after AlgebraicSimplifier, simplifying the transformation pattern and allowing ReduceScatterCreator to convert more all-reduces into reduce-scatters that would otherwise be missed. 🎯 Justification Running ReduceScatterCreator after AlgebraicSimplifier makes the input patterns easier to recognize. This allows more all-reduces to be converted into reduce-scatters, which would otherwise be missed, leading to better performance. _This was reported internally as an optimization for llama3.3-70b._ 🚀 Kind of Contribution ⚡️ Performance Improvement, 📊 Benchmark (for Performance Improvements) On H100: \| \| PR \| main \| \|----------\|----------\|----------\| \| llama31_8b_bf16_1x8 \| 1372251 us \| 1369631 us \| \| llama31_8b_fp8_1x8 \| 1106135 us \| 1107605 us \| \| llama31_8b_bf16_2x8 \| 1373637 us \| 1370564 us \| \| llama31_8b_fp8_2x8 \| 1111912 us \| 1108061 us \| \| llama31_70b_bf16_16x8 \| 13933022 us \| 13913957 us \| \| llama31_70b_fp8_16x8 \| 9848173 us \| 9867955 us \| \| llama31_70b_bf16_32x8 \| 14103619 us \| 14065225 us \| \| llama31_70b_fp8_32x8 \| 9732961 us \| 9760739 us \| \| llama31_405b_bf16_64x8 \| 52926476 us \| 52886529 us \| \| llama31_405b_fp8_64x8 \| 35576505 us \| 37929776 us \| \| mixtral_8x7b_bf16_1x8 \| 744367 us \| 744491 us \| \| mixtral_8x7b_bf16_2x8 \| 1126425 us \| 1130912 us \| 🧪 Unit Tests: Added a new unit test 🧪 Execution Tests: Tested for functionality with llama3.3 70b zero1 + gradient accumulation and saw ~5% performance improvement. Copybara import of the project: -- 2d999987762ac3d90960179b06587bc95fc954d1 by Sevin Varoglu <svaroglu@nvidia.com>: Move ReduceScatterCreator after AlgebraicSimplifier -- 0e41c2b8281234eec9af21a98fd5f81bd4884689 by Sevin Varoglu <svaroglu@nvidia.com>: Add unit test Merging this change closes #31030 PiperOrigin-RevId: 820221148	2025-10-16 07:56:21 -07:00
Benjamin Chetioui	edf3f8bb43	[XLA] Implement a `TiledHloSchedule` that transposes the iteration order over the non-contracting dimensions of a `dot`. A concrete use case when such a schedule is useful is when we have a matrix multiplication such that a chunk of shape `(block_m, k)` of the left-hand side argument fully fits into L2. The transposed iteration order will step through the `n` dimension first, allowing to hit L2 cache more often when loading tiles of the left-hand side. This schedule is intentionally restricted at the moment in order to unblock launching the generic Triton emitter for GEMMs. PiperOrigin-RevId: 820214481	2025-10-16 07:45:23 -07:00
Will Froom	78e02816b8	[XLA:CPU][XTile] Add lowering for StableHLO DotGeneral. PiperOrigin-RevId: 820214413	2025-10-16 07:32:50 -07:00
Eugene Zhulenev	d52ccd4d4b	[xla:ffi] Fix clang macro expansion warnings PiperOrigin-RevId: 820212080	2025-10-16 07:10:34 -07:00
Aliia Khasanova	f7524f08b8	Add proto [de]serialization for `SelectKThunk`. PiperOrigin-RevId: 820210212	2025-10-16 06:50:53 -07:00
Marcin Radomski	373e68f60c	[XLA:GPU] Change SDC names to more descriptive ones And dump the log proto into file called buffer_debug_log rather than sdc_log Changes to implementation details: - Renames: - SdcLogProto -> BufferDebugLogProto - SdcLog -> BufferDebugLog - SdcBufferId -> ThunkBufferId - SdcThunk -> BuffersChecksumThunk - SdcXorChecksumKernel -> BufferDebugXorChecksumKernel - move BufferDebugLog to stream_executor/gpu from stream_executor/cuda as it's not CUDA-specific PiperOrigin-RevId: 820186034	2025-10-16 05:24:40 -07:00
Will Froom	deac36865f	[XLA:GPU] Unconditionally emit func.func from triton emitter. PiperOrigin-RevId: 820175948	2025-10-16 04:55:04 -07:00
Will Froom	ba9c63910a	[XLA:CPU][XTile] Create simple lowering for tiled ops. PiperOrigin-RevId: 820160792	2025-10-16 04:20:15 -07:00
Mohammed Anany	9d2df1c2a5	[Triton] Fixing getLastInductionValue utility to also accept Index type. This would otherwise crash when warp specialization is enabled. PiperOrigin-RevId: 820159796	2025-10-16 04:08:58 -07:00
Kanish Anand	b31aff76d2	Update mesh definition to better match it's use cases of querying tile index from device id's or vice-versa. Refactor into separate classes. #hloshardingv3 PiperOrigin-RevId: 820154911	2025-10-16 03:54:39 -07:00
Christian Sigg	e17b7829f7	NFC: Simplify padding logic in Split-K GEMM rewriter. Remove checks for `padded_k_size.has_value()` as `padded_k_size` is always expected to be present after its initialization. This simplifies the conditions for needing padding and the calculation of the padding amount. PiperOrigin-RevId: 820148929	2025-10-16 03:28:40 -07:00
Marcin Radomski	72dc64133a	[XLA:GPU] Add experimental buffer checksum tracing A debugging tool meant to pinpoint nondeterministic computations by finding differences in buffer values across multiple runs. It makes XLA calculate checksums of input/output buffers, and dump them to the output directory. Enabling the new `--xla_gpu_experimental_enable_checksum_tracing_on_thunks` flag enables a new ThunkChecksumTracingPass, which adds checksum thunks to the thunk graph: - Inserts SDC log initialization to beginning. - Replaces each thunk with a SequentialThunk [checksum inputs, run original thunk, checksum outputs]. - Inserts a thunk that dumps SDC log to a file at the end of execution. PiperOrigin-RevId: 820148916	2025-10-16 03:18:07 -07:00
A. Unique TensorFlower	41549024ea	Remove obsolete Triton LLVM integration patches. These patches are no longer needed as the corresponding changes have been integrated into the Triton codebase. PiperOrigin-RevId: 820140492	2025-10-16 03:04:09 -07:00
A. Unique TensorFlower	379d3eba7b	Set call result shardings to the out shardings of func that is created or found from cache. It is a no op for `dedupFunctionsFully` is false which is also the default. outShardings is the the output shardings of the named computation at hand. However, if dedupFunctionsFully true, the func we pick from `createFuncOpOrGetFromCache`, which is the func the call will actually be calling, may have a different output sharding than the named computation, and call result sharding should be set to the output sharding it calls. For example, namedComputation1(foo): insharding={"x"} outsharding={"y"} namedComputation2(foo): insharding={"x"} outsharding={"z"} call1 to namedComputation1 call2 to namedComputation2 When dedupFunctionsFully is false, we have separate instances of foo as their outshardings are different. func foo1 insharding={"x"} outsharding={"y"} {...} func foo2 insharding={"x"} outsharding={"z"} {...} call1 to foo_1 resultsharding={"y"} call2 to foo_2 resultsharding={"z"} When dedupFunctionsFully is true, we do not have separate instance of foo, we need to pick either namedComputation1 or namedComputation2, say we pick namedComputation1, hence it becomes: func foo insharding={"x"} outsharding={"y"} {...} call1 to foo resultsharding={"y"} call2 to foo resultsharding={"y"} As a result, call2 should have a resultsharding={"y"} since it is calling foo, instead of the out sharding of namedComputation2 which is {"z"}. PiperOrigin-RevId: 820139879	2025-10-16 02:55:05 -07:00
Christian Sigg	6ac8b2ea6e	[xla:gpu] Add padding to split-k to allow pipelining. Loads are required to be 16-byte aligned for Triton to apply pipelining. This change adds extra padding to both split-k rewriters so that the reduction dimensions are a multiple of 16 bytes. PiperOrigin-RevId: 820134896	2025-10-16 02:42:45 -07:00
A. Unique TensorFlower	c1e8fc6dc4	[XLA:GPU] Add abstract class for multicast memory to GpuExecutor. PiperOrigin-RevId: 820115707	2025-10-16 01:24:32 -07:00
Aliia Khasanova	4dab5ef4a6	Add proto [de]serialization for `Memset32BitValueThunk`. PiperOrigin-RevId: 820109174	2025-10-16 01:02:43 -07:00
Adrian Kuegel	83f3904c5f	[XLA:GPU] Consider multi-output fusions supported by Triton codegen. Currently we would fail when trying to check whether the element type of the root tuple is supported. We should not even access the element type on a tuple shape. Therefore we skip the root tuple. PiperOrigin-RevId: 820096050	2025-10-16 00:16:18 -07:00
Alexander Belyaev	6c440133b8	[XLA:GPU] Move AsyncStreamKind and CollectiveOpGroupMode to xla_data.proto. This is a preparation CL before adding serialization for collective thunks. PiperOrigin-RevId: 820091670	2025-10-16 00:05:34 -07:00
Adrian Kuegel	06ea67005b	[XLA:GPU] Run hlo lit tests on several GPU platforms. This increases test coverage. Also remove the empty test suite mlir_lit_tests. These tests have been moved to another directory long ago. PiperOrigin-RevId: 820074643	2025-10-15 23:10:39 -07:00
A. Unique TensorFlower	2edf3555cf	Sort op's first operand is now generated without duplicates if the sort is stable. PiperOrigin-RevId: 820067242	2025-10-15 22:46:43 -07:00
A. Unique TensorFlower	88d1adfc68	Automated Code Change PiperOrigin-RevId: 820049303	2025-10-15 21:55:41 -07:00
A. Unique TensorFlower	d33383d214	Introduce `tsl::WithCurrentContext` for capturing the current context. PiperOrigin-RevId: 820042807	2025-10-15 21:19:52 -07:00
Subhankar Shah	4df1a3c67f	[XLA:MSA] When block prefetching, finalize the original value if a sliced value is prefetched successfully and the original value is not. We already have a pinned allocation for the original value, it should be finalized to avoid re-allocation causing multiple pinned allocations for the same buffer. PiperOrigin-RevId: 820015337	2025-10-15 19:56:19 -07:00
Hyeontaek Lim	55371dfcb4	[PjRt-IFRT] `ifrt::PjRtArray::pjrt_layout()` uses `nullptr` to indicate a default layout PjRt-IFRT now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before. `ifrt::PjRtArray` creation now request extra information on whether the underlying `PjRtBuffer` is using a custom layout as IFRT tracks the defaultness of array layouts. This information cannot be inferred correctly from `PjRtBuffer` alone because `PjRtBuffer::layout()` only returns a concrete layout. PjRt would mostly work fine today if a default layout is said to be a custom layout, but some strict layout equality check can fail and require more precise information to be supplied. A few test cases in IFRT ArrayImplTest against PjRt CPU and GPU clients have been disabled because the output array does not track the non-default-ness of the layout correctly when `MakeArraysFromHostBufferShards()` is implemented using `ClientMakeArraysFromHostBufferShards()`. PiperOrigin-RevId: 819995407	2025-10-15 18:47:15 -07:00
Parker Schuh	0c8f3eab9a	Change EnterHostCallback() and LeaveHostCallback() to use a c++ raii object to ensure that Enter and Leave are always matched. PiperOrigin-RevId: 819993376	2025-10-15 18:35:59 -07:00
Eugene Zhulenev	61785a4328	[xla:ffi] Add a test for automatic FFI handler signature inference from C++ function PiperOrigin-RevId: 819988900	2025-10-15 18:14:47 -07:00
Yun Peng	0ce64afaf3	Introduce `HERMETIC_PYTHON_VERSION_KIND` for the Bzlmod build. Add a placeholder for `HERMETIC_PYTHON_VERSION_KIND` in the generated `py_version.bzl` file. This new variable is currently set to an empty string until we figure out how to deal with it. PiperOrigin-RevId: 819956767	2025-10-15 16:33:03 -07:00
Parker Schuh	a6a11a6036	Implement StreamExecutorGpuClient::ScheduleRemoteSend. This allows migrating CopyToRemoteDevice to CommonPjRtBuffer APIs. PiperOrigin-RevId: 819949965	2025-10-15 16:12:06 -07:00
Karlo Basioli	dd90f5fa76	[XLA:GPU][codegen] Emit stablehlo for iota and implement lowering of stablehlo.iota to tt.make_range PiperOrigin-RevId: 819934458	2025-10-15 15:31:07 -07:00
A. Unique TensorFlower	644b4a83b5	Replace `stream->BlockHostUntilDone()` with `BlockHostUntilDoneWithHostCallback()`. BlockHostUntilDone calls `cuStreamSynchronize`, which has some performance issues. PiperOrigin-RevId: 819924678	2025-10-15 15:03:44 -07:00
Frederik Gossen	2582934b0b	[XLA:GPU] Add verbose tracing for BlockHostUntilDone and stream synchronization PiperOrigin-RevId: 819914599	2025-10-15 14:35:36 -07:00
Mohammed Anany	f147bddb8d	Extract launch information from the Triton compilation pipeline and use it instead of XLA's calculation. This is necessary in cases where the pipeline overrides the expected launch configuration. This was observed when auto warp specialization was enabled. Triton requires more threads per block than expected, and this information is available in the module attributes. PiperOrigin-RevId: 819893926	2025-10-15 13:57:39 -07:00
A. Unique TensorFlower	c265f586c6	Integrate LLVM at llvm/llvm-project@267fa8dd1e Updates LLVM usage to match [267fa8dd1efc](https://github.com/llvm/llvm-project/commit/267fa8dd1efc) PiperOrigin-RevId: 819892951	2025-10-15 13:51:53 -07:00
Sean Talts	ccd875910a	[XLA:CPU] Use asm to set name of intrinsic generated IR functions. PiperOrigin-RevId: 819885948	2025-10-15 13:42:06 -07:00
Sean Talts	3c7395e014	[XLA:CPU] Fix intrinsic library failing when passed an already vectorized call. From Will Froom. PiperOrigin-RevId: 819867554	2025-10-15 13:18:23 -07:00
Aliia Khasanova	1108cc983b	Add proto [de]serialization for CholeskyThunk. The only non-obvious part of the thunk is `solver_context_creator`, but we can retrieve it during the deserialization from `stream_executor::Platform`, which is available during runtime. PiperOrigin-RevId: 819863398	2025-10-15 13:00:47 -07:00
Sean Talts	23736ecfc6	[XLA:CPU] Add test showing exp intrinsic vectorizations. This test will serve to illustrate an upcoming change in intrinsic_lib's vectorization logic. PiperOrigin-RevId: 819851790	2025-10-15 12:11:45 -07:00
Karlo Basioli	948d0df409	[XLA:GPU][codegen] Emit tensor dialect for bitcast and implement lowering of bitcast from tensor dialect to triton. PiperOrigin-RevId: 819833904	2025-10-15 11:57:19 -07:00
Eugene Zhulenev	503198fb6b	[xla:cpu] Construct BufferAllocationInfo from BufferAssignment This is no-op change, preparing for migration from cpu_function_runtime::BufferInfo to new BufferAllocationInfo type. PiperOrigin-RevId: 819827983	2025-10-15 11:36:46 -07:00
A. Unique TensorFlower	6c32106238	Integrate Triton up to [de2ba394](`de2ba3946b`) https://github.com/openxla/triton/tree/triton_integrate_branch- PiperOrigin-RevId: 819807700	2025-10-15 11:01:37 -07:00
A. Unique TensorFlower	b545b61c0d	[XLA:GPU] Provide functions to setup multicast from a single process. PiperOrigin-RevId: 819790003	2025-10-15 10:48:13 -07:00
Aliia Khasanova	0ab9f48846	Refactor `SelectKThunk` to accept `ThunkInfo` instead of `HloInstruction` pointer. PiperOrigin-RevId: 819786719	2025-10-15 10:37:40 -07:00
Marcin Radomski	f0ea4b75e3	[XLA:GPU] ThunkPassPipeline: pass HloModule* to Run() This allows SDC log dumper to derive unique path for each module execution. PiperOrigin-RevId: 819781581	2025-10-15 10:08:55 -07:00
Peter Hawkins	009d8fdbf4	Reverts `7dbc996979` PiperOrigin-RevId: 819777372	2025-10-15 09:48:29 -07:00
Alexander Shaposhnikov	4626ec956f	Bump XNNPACK version for open source builds. PiperOrigin-RevId: 819774605	2025-10-15 09:33:48 -07:00
A. Unique TensorFlower	0290b24ad8	Internal visibility change. PiperOrigin-RevId: 819771473	2025-10-15 09:26:34 -07:00
Mohammed Anany	6969cce01e	[XLA:GPU/WS] Adding `xla_gpu_experimental_enable_triton_warp_specialization` flag. This is currently only used to decorate the contracting dimension loop for dot fusions going through Triton with `tt.warp_specialize`, enabling the feature in Triton. PiperOrigin-RevId: 819765526	2025-10-15 09:18:52 -07:00
Joshua Lang	1b2ecc8924	Disable broken se_gpu_pjrt_client_test_2gpu_b200 test PiperOrigin-RevId: 819764723	2025-10-15 09:06:41 -07:00
Marcin Radomski	1aa192d839	[XLA:GPU] Avoid use-after-free in StreamExecutorGpuClientTest::CopyRawToHostOutOfRange PiperOrigin-RevId: 819763300	2025-10-15 08:44:41 -07:00
Peter Hawkins	baf408c724	Reverts `5a3a4bcd44` PiperOrigin-RevId: 819762394	2025-10-15 08:23:20 -07:00
Kostiantyn Liepieshov	2b17e0e0c0	Support SparseActivationsUnstack and SparseActivationsUnstackInterleaved custom call always return tuple result PiperOrigin-RevId: 819743515	2025-10-15 07:30:48 -07:00
A. Unique TensorFlower	9567225474	[XLA:GPU] Enable chlo.asin -> kAsin `HloInstruction` lowering. PiperOrigin-RevId: 819720031	2025-10-15 06:49:12 -07:00
Mohammed Anany	aa3cb5c5d8	[NFC] Moving extraction utility out of fusion_emitter to emitter_helpers. Also added a test for coverage as I realize this function wasn't tested. More utilities will follow as part of an upcoming change, so this refactor makes sense to land first. PiperOrigin-RevId: 819716328	2025-10-15 06:35:52 -07:00
Eugene Zhulenev	339325c6d7	[xla:ffi] Add XLA_FFI_TypeInfo in preparation for adding it to TypeRegistry PiperOrigin-RevId: 819715434	2025-10-15 06:22:37 -07:00
Ilia Sergachev	2408b9968e	PR #32003 : [GPU][NFC] Merge methods querying fusion kind. Imported from GitHub PR https://github.com/openxla/xla/pull/32003 Copybara import of the project: -- 2a3ad034522e871edc9c7f580e86fc3980025542 by Ilia Sergachev <isergachev@nvidia.com>: [GPU][NFC] Merge methods querying fusion kind. -- ebeb25599d6017d34ea92ece415a255d109af049 by Ilia Sergachev <isergachev@nvidia.com>: Address review requests. Merging this change closes #32003 PiperOrigin-RevId: 819692807	2025-10-15 04:57:40 -07:00
Aleksa Arsic	9a25b01c7e	PR #32283 : [ROCm] Change misleading method name RocmComputeCapability::has_amd_matrix_core() Imported from GitHub PR https://github.com/openxla/xla/pull/32283 📝 Summary of Changes Change misleading method name RocmComputeCapability::has_amd_matrix_core() to more suitable name has_amd_mat_acc_instructions() as gfx11xx do not have matrix cores, but support matrix acceleration instruction set known as WMMA. 🎯 Justification RocmComputeCapability::has_amd_matrix_core() is misleading as gfx11xx do not have matrix cores but still support matrix acceleration instruction set - WMMA. 🚀 Kind of Contribution ♻️ Cleanup @xla-rotation please review my changes. Copybara import of the project: -- 23cf1ab79fdcc4ee2ee4996973dee2c103d2762a by Aleksa Arsic <aleksa.arsic@amd.com>: Change misleading method name RocmComputeCapability::has_amd_matrix_core() to more suitable name has_amd_mat_acc_instructions() as gfx11xx do not have matrix cores, but support matrix acceleration instruction set known as WMMA. Merging this change closes #32283 PiperOrigin-RevId: 819652238	2025-10-15 02:53:07 -07:00
Thomas Joerg	28c0be7a10	[XLA:GPU] Run `GpuKernelTilingTest`s on default GPU platforms. So far, this test is limited to Pascal. PiperOrigin-RevId: 819650786	2025-10-15 02:36:10 -07:00
Ilia Sergachev	a1891cea11	PR #31994 : [NFC] Move computation simplification methods from command buffer scheduling to a new library. Imported from GitHub PR https://github.com/openxla/xla/pull/31994 Copybara import of the project: -- dd037f3ef1c2da262029a9ebc34845ddb3c8a7f1 by Ilia Sergachev <isergachev@nvidia.com>: [NFC] Move computation simplification methods from command buffer scheduling to a new library. -- 2594c7a473945f5d410ae8e8894b7e90f5812c1e by Ilia Sergachev <isergachev@nvidia.com>: Address review feedback. Merging this change closes #31994 PiperOrigin-RevId: 819631409	2025-10-15 01:57:26 -07:00
Alex	5d0658679a	PR #32642 : [ROCm] Fix invalid run_under script for ci job and asan ignore files Imported from GitHub PR https://github.com/openxla/xla/pull/32642 📝 Summary of Changes Fix rocm build with asan settings 🎯 Justification Fix invalid run under script used in order to pass through the asan ignore lists file. 🚀 Kind of Contribution Please remove what does not apply: 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) Not relevant 🧪 Unit Tests: Not relevant 🧪 Execution Tests: Not relevant Copybara import of the project: -- 708b0b274d18b88ca7467c3ab3f44aaa11710995 by Alexandros Theodoridis <atheodor@amd.com>: Fix invalid run_under script for ci job and asan ignore files Merging this change closes #32642 PiperOrigin-RevId: 819630816	2025-10-15 01:47:03 -07:00
Adrian Kuegel	a150bc01ab	[XLA:GPU] Make sure to simplify both lhs and rhs of convolution. We need to ensure that symbols for trivial dimensions are simplified away consistently. If we simplify it on one side, we also need to simplify it on the other, as we want to use the same iteration space for both lhs and rhs. PiperOrigin-RevId: 819606712	2025-10-15 00:32:36 -07:00
A. Unique TensorFlower	09eaac74e8	Automated Code Change PiperOrigin-RevId: 819566851	2025-10-14 22:14:42 -07:00
Joshua Lang	ae547f429c	Enable multi-GPU tests on B200. PiperOrigin-RevId: 819551562	2025-10-14 21:23:05 -07:00
Parker Schuh	5a3a4bcd44	Update PjRtStreamExecutorRawBuffer::CopyRawHostToDeviceAndReturnEvent to support staging host buffers (for non-pinned memory). This allows replacing the CopyRawToHost functions. PiperOrigin-RevId: 819514361	2025-10-14 19:35:11 -07:00
Parker Schuh	7dbc996979	Implement PjRtStreamExecutorRawBuffer::CopyToLiteralAsync and allow PjRtStreamExecutorBuffer to just use inherited literal conversion logic. PiperOrigin-RevId: 819490746	2025-10-14 18:23:53 -07:00
A. Unique TensorFlower	14d70d20ee	[XLA] Add tests for tile_assignment. PiperOrigin-RevId: 819461056	2025-10-14 17:09:39 -07:00
A. Unique TensorFlower	95340a0066	Use hostcallback for h2d PiperOrigin-RevId: 819459051	2025-10-14 16:59:11 -07:00
Alex Pivovarov	e5c6a2b8ae	Add support for CollectiveBroadcastThunk in Command Buffer. This change integrates `CollectiveBroadcastStartThunk` and `CollectiveBroadcastDoneThunk` into the command buffer execution framework, allowing them to be converted into command buffer commands. It also includes a minor fix to an error message. PiperOrigin-RevId: 819432425	2025-10-14 16:03:15 -07:00
A. Unique TensorFlower	b9024cb4f2	Rollback of PR #32389 Reverts `6e534c2dc1` PiperOrigin-RevId: 819424613	2025-10-14 15:49:43 -07:00
Benjamin Chetioui	2bd860f639	[XLA] Clarify an additional invariant with regards to command buffer compatibility in the FFI API. PiperOrigin-RevId: 819424170	2025-10-14 15:36:43 -07:00
Michael Whittaker	5750ca22b1	Adds more logging to coordination service. PiperOrigin-RevId: 819391653	2025-10-14 14:16:16 -07:00
Parker Schuh	5c5d76e7ad	Add more precise stream synchronization which allows more aggressive stream synchronization in raw buffer APIs (namely CopyRawHostToDeviceAndReturnEvent and CopyRawDeviceToHostAndReturnEvent). Old buffers will require no synchronization, but recent buffers will get a cached compute_stream event and then it will sync with this compute_stream event repeatedly instead of syncing with the stream itself. PiperOrigin-RevId: 819380480	2025-10-14 13:54:10 -07:00
Oleg Shyshkov	256d7c57ca	[XLA:GPU] Add a check that each replica group is equally distributed across hosts. This way we also make sure that we don't decompose ra2a that stays within single hostl. PiperOrigin-RevId: 819311950	2025-10-14 11:31:17 -07:00
Karlo Basioli	c61dd4a830	[XLA:CPU] Erase legacy compute function path in cpu_executable. This has been unused for a while now. PiperOrigin-RevId: 819276283	2025-10-14 10:27:15 -07:00
Will Froom	263a774f70	[XLA][XTile] Create initial shared tiled dialect + ops. PiperOrigin-RevId: 819273024	2025-10-14 10:03:30 -07:00
Yun Peng	26658a5e2e	Reformat MODULE.bazel and adjust pip repo usage. This change reorders arguments within various override and extension calls for consistency. It also removes the explicit `xla_pypi_311_numpy` from the `use_repo(pip, ...)` call. PiperOrigin-RevId: 819260347	2025-10-14 09:53:09 -07:00
Marcin Radomski	5a43dd8d5e	[XLA:GPU] Add SdcThunk A Thunk that calculates checksums for all configured buffers and stores them in an SdcLog. PiperOrigin-RevId: 819258906	2025-10-14 09:34:56 -07:00
Eugene Zhulenev	812c201256	[xla:ffi] Add support for returning TypeId for stateful handlers PiperOrigin-RevId: 819234816	2025-10-14 08:58:32 -07:00
A. Unique TensorFlower	6085657619	Automated Code Change PiperOrigin-RevId: 819231654	2025-10-14 08:44:48 -07:00
Byungchul Kim	f474e0b8f2	Fix incorrect python interpreter path of non-bzlmod Remove duplicated "_host" suffix. PiperOrigin-RevId: 819229058	2025-10-14 08:36:47 -07:00
A. Unique TensorFlower	f23728131e	Make all files depending on IndexingMap to use SymbolicExprContext This CL is a key step in integrating the new SymbolicExpr library into IndexingMap (b/433696544). The primary goal is to replace the existing `mlir::MLIRContext` with `gpu::SymbolicExprContext` throughout every class that depends on IndexingMap. Goal: - Enables Symbolic Reasoning: `SymbolicExprContext` is designed to manage symbolic expressions, which will allow for more powerful analysis and optimization of indexing maps. - Performance: We believe with pifon@ than by using a dedicated context, compilation time could be improved by the fact of not overusing the lock inside MLIRContext (used everywhere). This should be confirmed with real data after finishing the entire migration. - Foundation: This refactoring is a prerequisite for fully replacing `AffineExpr` with `SymbolicExpr` in `IndexingMap`. This should unblock the replacement of the internal implementation (cl/802100018). Changes: - Signature Updates: Function signatures across numerous files in `xla/service/gpu`, `xla/backends/gpu`, `xla/backends/cpu`, and `xla/hlo/analysis` have been updated to accept `gpu::SymbolicExprContext` instead of `mlir::MLIRContext`. - Context Propagation: The `SymbolicExprContext` is now created and owned by `GpuCompiler` and propagated down to the various components, including fusion emitters, autotuners, and performance models. - Test Updates: Unit tests and test utilities have been updated to use the new context. - Some areas required temporary workarounds where the `SymbolicExprContext` is created on the fly from an existing `mlir::MLIRContext`. These are marked with TODOs (b/446856820, b/446856303) to be cleaned up in follow-up CLs as the integration progresses. Ideally no functional changes are intended, but the `mlir::MLIRContext` was inconsistently managed across the codebase, requiring careful tracing and updates to ensure the new `SymbolicExprContext` is correctly propagated everywhere. This made the refactoring process time-consuming and I would appreciate careful review because I could have made some mistakes as well. In this process I had 3 different and unrelated segmentation faults and a crash in StorageUniquer for not using the same context in different places. This CL represents step 2 of the integration plan outlined in b/433696544#comment9. PiperOrigin-RevId: 819228363	2025-10-14 08:24:41 -07:00
Joshua Lang	a88366a806	Add B200 GPU spec to XLA GPU device info tests. PiperOrigin-RevId: 819220544	2025-10-14 08:04:57 -07:00
Eusebio Durán Montaña	d555ed2c74	Add (de)serialization for `FftThunk` This one is a pretty direct mapping from the struct to the proto. PiperOrigin-RevId: 819214943	2025-10-14 07:52:20 -07:00
Michael Kuperstein	0064d2d1bb	Rolling forward with fix. Reverts `eefde23194` PiperOrigin-RevId: 819210668	2025-10-14 07:37:44 -07:00
A. Unique TensorFlower	5e244973e3	Fix test in case when driver is old. PiperOrigin-RevId: 819197347	2025-10-14 07:25:18 -07:00
A. Unique TensorFlower	e57cf67a64	Reverts `fe185826cd` PiperOrigin-RevId: 819194024	2025-10-14 07:10:15 -07:00
Eusebio Durán Montaña	1e7235a721	Clean up `FftThunk` includes and BUILD dependencies Doing this before touching the files, also adding missing brackets in an if. PiperOrigin-RevId: 819191695	2025-10-14 06:57:40 -07:00
A. Unique TensorFlower	fe185826cd	Automated Code Change PiperOrigin-RevId: 819166638	2025-10-14 06:00:33 -07:00
Eusebio Durán Montaña	bcd41217ed	Add (de)serialization for the `ConvolutionReorderThunk` PiperOrigin-RevId: 819165377	2025-10-14 05:48:11 -07:00
A. Unique TensorFlower	51ef995fff	Automated Code Change PiperOrigin-RevId: 819120115	2025-10-14 04:01:56 -07:00
Eusebio Durán Montaña	841ccebbd0	Refactor `ConvolutionReorderThunk` member fields In practice the thunk always has: * an input and output filter * either: * no biases * both an input and output bias So specify this invariant into the data structure, to make this more readable and to make it harder to create an invalid thunk. PiperOrigin-RevId: 819099118	2025-10-14 03:12:50 -07:00
Marcin Radomski	733a820470	[XLA:GPU] SdcXorChecksumKernel: move trait to stream_executor/gpu Use GpuKernelRegistry for loading the kernel rather than `TypedKernelFactory`. The new header will help prevent errors related to use of "gpu"-tagged targets in non-"gpu"-tagged ones. Also, avoid using atomic fetch_add to prevent JAX build failures on <sm60. We're going to ensure that with a runtime check. PiperOrigin-RevId: 819098591	2025-10-14 03:02:11 -07:00
Yunlong Liu	7ea52be53e	PR #32454 : Respect debug options override in LHS. Imported from GitHub PR https://github.com/openxla/xla/pull/32454 Any place inside the compiler has to respect the debug options override set by the users. Copybara import of the project: -- 6971175737582aad4e9256f983890ac04009a074 by Yunlong Liu <yliu120@users.noreply.github.com>: Respect debug options override in LHS. Merging this change closes #32454 PiperOrigin-RevId: 819088927	2025-10-14 02:46:59 -07:00
A. Unique TensorFlower	d0564fd223	[XLA:GPU] Add functions to allocate memory with VMM API. PiperOrigin-RevId: 819074758	2025-10-14 02:02:15 -07:00
Henning Becker	9793c54120	Replace `CustomCallThunk::Slice` by `ShapedSlice` The two types have the same definition and represent the same semantically. PiperOrigin-RevId: 819044248	2025-10-14 00:39:15 -07:00
A. Unique TensorFlower	e50b78b75b	Automated Code Change PiperOrigin-RevId: 819038195	2025-10-14 00:28:17 -07:00
A. Unique TensorFlower	5b18f472ef	Automated Code Change PiperOrigin-RevId: 819010361	2025-10-13 23:10:32 -07:00
Oleg Shyshkov	dd73f5e444	[XLA:GPU] Support arbitrary replica groups in RaggedAllToAlMultiHostDecomposer. This change lifts the original restriction that ra2a should have only one iota replica group. PiperOrigin-RevId: 818996017	2025-10-13 22:24:51 -07:00
A. Unique TensorFlower	7cfed27c49	Automated Code Change PiperOrigin-RevId: 818986694	2025-10-13 22:10:24 -07:00
A. Unique TensorFlower	b136b5e191	Automated Code Change PiperOrigin-RevId: 818968287	2025-10-13 21:48:10 -07:00
A. Unique TensorFlower	ab00f6c182	Automated Code Change PiperOrigin-RevId: 818968250	2025-10-13 21:39:35 -07:00
A. Unique TensorFlower	267f1b7fd7	Automated Code Change PiperOrigin-RevId: 818965009	2025-10-13 21:23:42 -07:00
Eugene Zhulenev	152bbebe79	[tsl:concurrency] Add support for detached futures Running OnReady and Map callbacks in the thread that calls promise.Set() can be dangerous for performance. Add an API to execute all callbacks in the given executor. PiperOrigin-RevId: 818964744	2025-10-13 20:59:26 -07:00
A. Unique TensorFlower	b0674cd448	Automated Code Change PiperOrigin-RevId: 818954805	2025-10-13 20:24:18 -07:00
A. Unique TensorFlower	76e891c4db	Automated Code Change PiperOrigin-RevId: 818950162	2025-10-13 20:13:43 -07:00
Haibo Huang	6a1c2a8a79	Add methods to query chip count and logical devices per chip to PjRtTopologyDescription. PiperOrigin-RevId: 818946853	2025-10-13 19:59:38 -07:00
Frederik Gossen	43f9e0789c	[XLA:GPU] Add verbose kernel scheduling tracing for debugging PiperOrigin-RevId: 818918076	2025-10-13 18:20:43 -07:00
Hyeontaek Lim	631a48b8da	[IFRT] Update the semantics of `ifrt::Array::pjrt_layout()` regarding default layouts `ifrt::Array::pjrt_layout()` will soon be returning a `nullptr` for a default layout (soon with removal of `absl::StatusOr<>` part). The user can continue to get a concrete default layout via `ifrt::Client::GetDefaultPjRtLayout()`. During a transition, IFRT implementations may return either `nullptr` or a concrete default layout, and this state will be permitted temporarily, while they will be migrated to return `nullptr` for default layouts. PiperOrigin-RevId: 818911623	2025-10-13 18:04:27 -07:00
jparkerh	3a43ed26e8	Make the gpu_static_registration work properly gpu_static_registration is intended to pull in all dependencies needed for the target - this ensures that this is the case. PiperOrigin-RevId: 818820817	2025-10-13 14:29:45 -07:00
Matthias Kramm	464202fa3b	Apply memory optimizations if options.allow_in_place_mlir_modification is true. PiperOrigin-RevId: 818818778	2025-10-13 14:06:27 -07:00
Parker Schuh	5c75040b40	Everyone sets untuple_result = true unconditionally because that is the only supported way to return Tuples, so flip the flag by default to true. Callers can now stop setting this. PiperOrigin-RevId: 818803753	2025-10-13 13:29:58 -07:00
Yulia Baturina	9445f93155	Bump rules_ml_toolchain version to integrate @cuda_nvrtc header. PiperOrigin-RevId: 818772198	2025-10-13 12:24:23 -07:00
Michael Kuperstein	eefde23194	Rolling back due to breakage. Reverts `47ec6671b8` PiperOrigin-RevId: 818768109	2025-10-13 12:14:00 -07:00
Allan Renucci	da9375fa74	[NFC] Simplifications to `CommandBufferConversionPass`. PiperOrigin-RevId: 818763449	2025-10-13 12:03:39 -07:00
David Dunleavy	2b947eb9a4	Delete `xla/tests/fuzz` These tests haven't demonstrated much value, and the generated HLOs are no longer used anywhere else. PiperOrigin-RevId: 818759915	2025-10-13 11:53:59 -07:00
Eusebio Durán Montaña	056f013072	Refractor nchw filter dimension parsing in `ConvolutionReorderThunk` Instead of inputing the filter dimensions as a span of integers, which we implicitly expect to be of size 4, we pass this a proto. Using a proto instead of a struct since we'll need the `ConvolutionFilterDimensions` to serialize the `ConvolutionReorderThunk`. (We don't want to serialize the `FilterDescriptor` since most of its fields are only written during execution, so we'll serialize the `ConvolutionFilterDimensions` instead). Not sure where the best place for the `ConvolutionFilterDimensions` proto to live is. Other options would be to define it in: * In the thunk.proto, or; * In some other file more closely related to convolution filters (not sure where that could be). PiperOrigin-RevId: 818726237	2025-10-13 10:40:50 -07:00
Christian Sigg	929dc2d18b	[xla:gpu] NFC: Hoist computation index map creation in Triton dot emitters. This change moves the creation of the `computation_index_map` outside the loop body in `EmitDot` and `EmitScaledDot`, as it does not depend on the loop induction variable. It also simplifies how the tile size is retrieved in `GetDotLoopIterationCount` by using `TiledHloInstruction::tile_size`. PiperOrigin-RevId: 818718168	2025-10-13 10:16:39 -07:00
Felix Wang	3ab1351fe1	Add comments for `GPUCommunicationType`. PiperOrigin-RevId: 818705728	2025-10-13 09:57:36 -07:00
Felix Wang	34ac2390a2	Support async reduce-scatter for `GetLatencyBetween` in sol cost estimator && add verbose log PiperOrigin-RevId: 818699549	2025-10-13 09:39:48 -07:00
A. Unique TensorFlower	893e85d01a	Integrate LLVM at llvm/llvm-project@3a6b818132 Updates LLVM usage to match [3a6b818132e3](https://github.com/llvm/llvm-project/commit/3a6b818132e3) PiperOrigin-RevId: 818692560	2025-10-13 09:24:50 -07:00
A. Unique TensorFlower	66537d1bd8	Remove use of vector.splat Removed in `ea291d0e8c`. The replacement is vector.broadcast. PiperOrigin-RevId: 818680777	2025-10-13 08:57:51 -07:00
A. Unique TensorFlower	ec753f48a1	[XLA] Remove brittleness regarding the `CHECK-HIGH-LEVEL` directives in `chlo_legalize_to_mhlo.mlir` tests by adding a `CHECK-HIGH-LEVEL-LABEL` to discard any output preceding the beginning of the test. Without this, the tests are dependent; a `CHECK-HIGH-LEVEL` can match an output from a previous test, as it matches all output up until the latest preceding `CHECK-HIGH-LEVEL`. To prevent also matching output after the end of the test, a `CHECK-HIGH-LEVEL-LABEL` is added to each test following a test that contains a `CHECK-HIGH-LEVEL`. Also, add `func.func` in some `CHECK-LABEL`s to make them more robust. PiperOrigin-RevId: 818670103	2025-10-13 08:41:04 -07:00
Michael Kuperstein	47ec6671b8	[XLA] Remove RunOnModelGroup. We no longer support module groups with size > 1, so there's no point in supporting generic RunOnModelGroup. One possible use-case of model groups with size == 1 is to be able to replace the module (instead of modifying it in-place). This adds a new interface to support that. PiperOrigin-RevId: 818665806	2025-10-13 08:26:53 -07:00
Alex	c88c02e21b	PR #32475 : [ROCm] Prepare asan builds to be rbe compatible, include sanitizer ignore lists as data dpependency Imported from GitHub PR https://github.com/openxla/xla/pull/32475 📝 Summary of Changes Make asan builds hermetic so they can be used with rbe 🎯 Justification Add sanitizer ignore lists as a dependency to run_under script so they are available in rbe worker 🚀 Kind of Contribution Please remove what does not apply: 🐛 Bug Fix, ♻️ Cleanup 📊 Benchmark (for Performance Improvements) not relevant 🧪 Unit Tests: not relevant 🧪 Execution Tests: not relevant Copybara import of the project: -- cae2ea8d4808c161becb80602fba605ba08a4bd5 by Alexandros Theodoridis <atheodor@amd.com>: Adjust ci script to include asan ignore list as deps Merging this change closes #32475 PiperOrigin-RevId: 818658730	2025-10-13 08:19:23 -07:00
gaurides	d39d15f929	PR #32357 : Update gemma2 keras benchmark script - fix ttft, and use tokenizer Imported from GitHub PR https://github.com/openxla/xla/pull/32357 📝 Summary of Changes 1. Update calculation for TTFT to be the time to first generated token. This will also impact TPOT calculations. 2. Use tokenizer to count the number of tokens generated instead of counting words using space 🎯 Justification Currently the script computes TTFT as time to first token which is from the prompt and still in prefill stage. 🚀 Kind of Contribution 🐛 Bug Fix Copybara import of the project: -- 25178775f936a6f40a205e6969582222f150f0dd by Gauri Deshpande <gauri1.deshpande@intel.com>: Update gemma2 keras benchmark script - fix ttft, and use tokenizer -- 9b20ead588ad38152e648067c7d34314ba8a5645 by Gauri Deshpande <gauri1.deshpande@intel.com>: address review comments Merging this change closes #32357 PiperOrigin-RevId: 818656164	2025-10-13 08:08:09 -07:00
Henning Becker	a3e9afb2e8	Move custom call handler resolution into CustomCallThunk For a custom call we need to resolve a target name (string) to a function pointer. So far this happens in `IrEmitterUnnested` (at the end of the compilation pipeline). But for thunk serialization we need this to happen at runtime (when the thunks are getting reconstructed from the proto representation). Therefore I'm moving this resolving step into the `CustomCallThunk` factory function. Note that there remains a way to construct a `CustomCallThunk` from just a function pointer. These thunks will not be serializable and that's okay. The logic handles these cases and returns an error. It is important for tests to be able to quickly create a CustomCallThunk from a closure. If we had to register these calls in the registry first it would complicate our tests significantly. In detail this change entails: - Move resolver logic in new overloads of the `CustomCallThunk::Create` factory function. - Call these overloads from `IrEmitterUnnested` and the custom kernel fusion emitter. - Add tests for the new overloads - Migrate some tests to FFI (the new custom call registry and API) - Adjust some error codes in custom call tests. (If a custom call was not found now `kNotFound` is returned instead of `kUnimplemented`). PiperOrigin-RevId: 818655300	2025-10-13 07:52:19 -07:00
Hyeontaek Lim	3e425f3c44	[IFRT] Migrate `Array::pjrt_layout()` callers to interpret `nullptr` as a default layout `Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`. This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before. Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`. PiperOrigin-RevId: 818618831	2025-10-13 06:06:16 -07:00
Eusebio Durán Montaña	adf4c2bf01	Split `ConvolutionReorderThunk` into its own file There's no shared logic in between the `ConvolutionReorderThunk` and the `ConvolutionThunk` thunk, so I think its cleaner for each to be defined in their own file. PiperOrigin-RevId: 818613921	2025-10-13 05:50:34 -07:00
Greg Olechwierowicz	c359abbbdf	[XLA:GPU] Add xla:friends to GPU transforms. PiperOrigin-RevId: 818603646	2025-10-13 05:10:30 -07:00
Eusebio Durán Montaña	607bca2e9c	Add proto (de)serialisation for `ConvolutionThunk` PiperOrigin-RevId: 818597563	2025-10-13 04:46:06 -07:00
Kostiantyn Liepieshov	f0cea5d779	Allow `ifrt::CopyArraysOp` as IFRT IR program outputs. The output of an IFRT IR program can be either a direct argument (`mlir::BlockArgument`) or (`xla::ifrt::CallLoadedExecutableOp`) output or (`ifrt::CopyArraysOp`) PiperOrigin-RevId: 818592194	2025-10-13 04:27:03 -07:00
A. Unique TensorFlower	b048c10512	Automated Code Change PiperOrigin-RevId: 818581950	2025-10-13 04:04:36 -07:00
Dragan Mladjenovic	cab160061e	PR #32504 : [ROCm] Remove rocm_diagnostics Imported from GitHub PR https://github.com/openxla/xla/pull/32504 📝 Summary of Changes Remove rocm_diagnostics.cc 🎯 Justification RocmDiagnostics module never worked and provides no meaningful information to the user. 🚀 Kind of Contribution ♻️ Cleanup 📊 Benchmark (for Performance Improvements) N\A 🧪 Unit Tests: None 🧪 Execution Tests: None Copybara import of the project: -- 73c4357ea80c720e2e46ddc0f91c8943e571b1ca by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: [ROCm] Remove rocm_diagnostics Merging this change closes #32504 PiperOrigin-RevId: 818581529	2025-10-13 03:57:36 -07:00
Eusebio Durán Montaña	aae796c51f	Add proto (de)serialisation for `GpuConvDescriptor` We need `GpuConvDescriptor` to be serializable to be able to add (de)serialisation for the `ConvolutionThunk` PiperOrigin-RevId: 818581222	2025-10-13 03:48:05 -07:00
Adam Paszke	0a9ce803d7	[XLA:GPU] Make sure to insert parameter copies if they feed Mosaic GPU collectives Otherwise the kernels don't actually get the operands in symmetric memory and can crash. PiperOrigin-RevId: 818560766	2025-10-13 02:41:38 -07:00
A. Unique TensorFlower	540e6982c0	Automated Code Change PiperOrigin-RevId: 818508156	2025-10-13 00:18:46 -07:00
Henning Becker	ff3c33ab71	Remove unused nvtx_utils target The code is not in use anymore and hasn't been for quite a while. Let's remove it. PiperOrigin-RevId: 818506241	2025-10-13 00:02:46 -07:00
Henning Becker	993e1f75ad	Remove MLIR-related unused code from CollectiveThunk The removed data fields are not used and always initialized from a nullptr. The removed functions are not called from anywhere. So let's remove all of that. PiperOrigin-RevId: 818504444	2025-10-12 23:50:51 -07:00
A. Unique TensorFlower	f37f60d618	Automated Code Change PiperOrigin-RevId: 818504106	2025-10-12 23:44:43 -07:00
A. Unique TensorFlower	a142538d86	Automated Code Change PiperOrigin-RevId: 818504078	2025-10-12 23:35:12 -07:00
A. Unique TensorFlower	c52198581e	Automated Code Change PiperOrigin-RevId: 818496642	2025-10-12 23:01:57 -07:00
Alexander Belyaev	bbde4992ae	[XLA:GPU] Remove unused mlir::Value fields in CollectiveThunk::Buffer. PiperOrigin-RevId: 818477433	2025-10-12 21:48:19 -07:00
A. Unique TensorFlower	9cefa031f1	Support sinking all-reduce feeding an all-gather. PiperOrigin-RevId: 818471462	2025-10-12 21:23:26 -07:00
A. Unique TensorFlower	a9ac35b8f5	- Don't move host offloading annotations in licm. - Allow loop-related instructions between DS and host offloading annotations. PiperOrigin-RevId: 818469793	2025-10-12 21:12:43 -07:00
A. Unique TensorFlower	7aacc01270	Replace a chain of xpose/reshape to a nop if the composite transpose permutation is idendity PiperOrigin-RevId: 818461965	2025-10-12 20:54:18 -07:00
Aiden Grossman	80bc8984ea	Use triple overload of lookupTarget The overload accepting a llvm::StringRef will be removed when LLVM 22 branches. PiperOrigin-RevId: 818460238	2025-10-12 20:42:42 -07:00
Eugene Zhulenev	0c7c2289a9	[tsl:concurrency] Don't submit tasks to executor if Future::Map result is unused PiperOrigin-RevId: 818444632	2025-10-12 19:51:40 -07:00
A. Unique TensorFlower	5af93d1725	Automated Code Change PiperOrigin-RevId: 818400843	2025-10-12 16:16:53 -07:00
A. Unique TensorFlower	dae3e53816	[XLA] Remove a redundant `acosh_complex_f32` legalization test. The same block of lines exists in lines 467-617. PiperOrigin-RevId: 818362259	2025-10-12 12:43:16 -07:00
A. Unique TensorFlower	bfaa51810d	Automated Code Change PiperOrigin-RevId: 818299030	2025-10-12 06:43:47 -07:00
Terry Sun	499c00e520	PR #31795 : [GPU] Assign default color to tuples Imported from GitHub PR https://github.com/openxla/xla/pull/31795 📝 Summary of Changes [Downstream check](https://github.com/openxla/xla/blob/main/xla/pjrt/pjrt_executable.cc#L288-L299) assumes tuples on default memory space, force assign default color to tuples will get around the check. 🎯 Justification NCCL user buffer runs are crashing on MaxText main. This PR fixes the crash. 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N/A. 🧪 Unit Tests: Existing unit tests. 🧪 Execution Tests: Added multiple execution tests. Copybara import of the project: -- c60fe9d62827596eac57df2b480891520b40ab07 by Terry Sun <tesun@nvidia.com>: assign default color yo tuples -- 717412a55a94be71afcbb7627f03905c408f8b6a by Terry Sun <tesun@nvidia.com>: add constant and polish doc string -- c907b2d1ca5a62299b6bfd2534e99c6215313ffd by Terry Sun <tesun@nvidia.com>: update test Merging this change closes #31795 PiperOrigin-RevId: 818295813	2025-10-12 06:26:13 -07:00
Oleg Shyshkov	20e2d74013	[XLA:GPU] Add collective kind to rendezvous name in CollectiveThunk. This makes the rendezvous name more informative by including the specific type of collective operation, which helps in debugging. PiperOrigin-RevId: 818215415	2025-10-11 23:40:18 -07:00
Eugene Zhulenev	e4aca637fe	[tsl:concurrency] Make it always safe to run AsyncValue waiters on executor PiperOrigin-RevId: 818197306	2025-10-11 22:02:09 -07:00
A. Unique TensorFlower	19b9c603a7	Allow annotations to be appended or prepended in AnnotatedUserContext. For some annotations, it is more user-friendly to print the annotation before the annotated context. Add an option for this and remove the automatic "; " delimiter to allow more flexibility in the annotation format. PiperOrigin-RevId: 818118289	2025-10-11 15:56:16 -07:00
Oleg Shyshkov	0721938cea	[XLA:GPU] Only set channel id it is present in the original instruction. PiperOrigin-RevId: 818048132	2025-10-11 09:31:53 -07:00
Kanish Anand	d979afadb7	Add `Proto` suffix to proto types to avoid name conflicts with corresponding cpp types as both are under `xla` namespace. #hloshardingv3 PiperOrigin-RevId: 818005071	2025-10-11 05:41:22 -07:00
Jian Cai	05715ab5d2	[XLA][Numerics][HLO Value Tracking] Create an original value for compiler-inserted tuples during fusion This also removes the code that handles original values in MergeFusionInstructionIntoMultiOutput, as it eventually calls into HloCallableInstruction::CloneAndAppendInstructionIntoCalledComputation to create a tuple result and the corresponding original value. PiperOrigin-RevId: 817944313	2025-10-11 00:32:12 -07:00
Parker Schuh	738dacb09b	Update TfrtGpuExecutable::ExecuteHelper to force untupling to at least be true whenever the result is a tuple (to match other backends). PiperOrigin-RevId: 817897232	2025-10-10 20:46:47 -07:00
Hyeontaek Lim	6c026def7a	[IFRT] Migrate `Array::pjrt_layout()` callers to interpret `nullptr` as a default layout `Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`. This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before. Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`. PiperOrigin-RevId: 817893146	2025-10-10 20:27:55 -07:00
Eugene Zhulenev	99c5f51e9b	[tsl] Add AsExecutor() adaptor to tsl:🧵:ThreadPool It's hard to construct ThreadPoolExecutor from a ThreadPool& reference and correctly manage the lifetime of executor. Instead make it possible to get a tsl::Executor adaptor from a ThreadPool instance. PiperOrigin-RevId: 817872041	2025-10-10 19:11:47 -07:00
Parker Schuh	0573fab4a4	Implement CreateLinkedEventPromise() for PjRtStreamExecutorClient. PiperOrigin-RevId: 817869433	2025-10-10 18:58:54 -07:00
Buddh Prakash	9a534e7b9d	Introduce `XLA_[Q\|D]CHECK` macros for Check failures. These macros are carefully designed to 1. have the exact same API as absl::CHECKs 2. produce error messages and content in exactly the same format as absl::CHECK's. 3. respect absl flags e.g. ABSL_MIN_LOG_LEVEL They only differ from absl::CHECK's in that, for Check failures, they 1. prepend an error code and append a link to the openxla.org webpage to the error message 2. additionally, append DebugMeContext information if available. PiperOrigin-RevId: 817868212	2025-10-10 18:52:47 -07:00
Maxim Ermilov	25554481df	bump rules_ml_toolchain version PiperOrigin-RevId: 817856738	2025-10-10 18:13:51 -07:00
Gunhyun Park	7f6b55adaf	Fix bug in `IsInteger` condition. PiperOrigin-RevId: 817854819	2025-10-10 18:04:16 -07:00
Sean Talts	c79f78b706	[XLA:CPU] Emit basic aarch64 and x86 versions in cc_to_llvm_ir. PiperOrigin-RevId: 817843552	2025-10-10 17:42:48 -07:00
Eugene Zhulenev	1783c8cccf	[xla:ifrt] Replace Future::OnReady with Future::Map PiperOrigin-RevId: 817838912	2025-10-10 17:30:48 -07:00
Eugene Zhulenev	8151205458	[tsl:concurrency] Add Future::map overrides that can take Executor to run the map functor PiperOrigin-RevId: 817835460	2025-10-10 17:16:13 -07:00
Eugene Zhulenev	eba90df183	[xla:cpu] Create xla::cpu::BufferAllocationInfo from deprecated cpu_function_runtime::BufferInfo PiperOrigin-RevId: 817834097	2025-10-10 17:03:10 -07:00
Eugene Zhulenev	c9269c03de	[xla:cpu] Delete unused APIs from cpu_function_runtime PiperOrigin-RevId: 817825043	2025-10-10 16:37:38 -07:00
Yurii Topin	48dc889ddd	SYCL is built by default using hermetic Clang. GCC is no longer supported for SYCL builds. PiperOrigin-RevId: 817821428	2025-10-10 16:26:39 -07:00
Subhankar Shah	10e6b9eaee	[XLA:MSA] Make all block-prefetch tests better readable by refactoring common elements. PiperOrigin-RevId: 817818480	2025-10-10 16:20:37 -07:00
A. Unique TensorFlower	28fd47ca0d	#HLODiffService Add forced mapping options to HLO Diff backend. PiperOrigin-RevId: 817818436	2025-10-10 16:10:53 -07:00
Eugene Zhulenev	1ae52eb0f8	[tsl:concurrency] Add Future::OnReady overrides that can take Executor to run the callback name cpu/op cpu/op vs base BM_CreateOkFuture 1.784n ± 0% 1.860n ± 1% +4.26% (p=0.000 n=40) BM_CopyFuture 1.737n ± 0% 1.727n ± 0% -0.58% (p=0.001 n=40) BM_MapStatelessFuture 14.29n ± 0% 14.27n ± 0% ~ (p=0.283 n=40) BM_TryMapStatelessFuture 14.31n ± 0% 14.25n ± 0% ~ (p=0.062 n=40) BM_MapToFromStatelessFuture 14.43n ± 0% 14.08n ± 1% -2.44% (p=0.000 n=40) BM_MapStatefulFuture 14.55n ± 0% 14.51n ± 0% ~ (p=0.607 n=40) BM_TryMapStatefulFuture 14.54n ± 0% 14.49n ± 1% ~ (p=0.405 n=40) geomean 7.908n 7.908n -0.00% PiperOrigin-RevId: 817817852	2025-10-10 15:59:58 -07:00
Eugene Zhulenev	9f772d8449	[tsl:concurrency] Optimize Future::Map for ready futures ``` name cpu/op cpu/op vs base BM_CreateOkFuture 1.948n ± 0% 1.627n ± 0% -16.50% (n=80) BM_CopyFuture 1.946n ± 0% 1.625n ± 0% -16.52% (n=80) BM_MapStatelessFuture 37.16n ± 0% 15.90n ± 0% -57.21% (n=80) BM_TryMapStatelessFuture 36.03n ± 1% 15.88n ± 0% -55.93% (n=80) BM_MapToFromStatelessFuture 38.04n ± 0% 15.91n ± 0% -58.19% (n=80) BM_MapStatefulFuture 38.43n ± 0% 16.45n ± 0% -57.18% (n=80) BM_TryMapStatefulFuture 37.50n ± 0% 16.45n ± 0% -56.13% (n=80) geomean 16.08n 8.368n -47.97% ``` PiperOrigin-RevId: 817788917	2025-10-10 14:46:39 -07:00
Junwhan Ahn	2dafd2aff3	Avoid changing the MLIR context of an IFRT IR program during compilation if the program does not exclusively own the context PiperOrigin-RevId: 817763225	2025-10-10 13:25:59 -07:00
Junwhan Ahn	f41c626f12	Use `xla::ifrt::HloSharding` for executable outputs PiperOrigin-RevId: 817726203	2025-10-10 11:42:51 -07:00
A. Unique TensorFlower	c2974baa85	[IFRT]Expand visibility of `mpmd_executable`. Adds `friends`, `internal`, and `users` to the visibility list for the `mpmd_executable` target. PiperOrigin-RevId: 817713192	2025-10-10 11:21:50 -07:00
Yun Peng	134503de2b	Support building XLA with Bzlmod - Added MODULE.bazel and module extensions to introduce external dependencies with Bzlmod. - Added a CI config for Linux CPU build with Bzlmod (enabled by `--config=bzlmod`) TODOs: - Support overriding Python runtime - Support build with more configs and platforms PiperOrigin-RevId: 817711851	2025-10-10 11:09:33 -07:00
Eugene Zhulenev	7fd35dfa02	[tsl:concurrency] Add ABSL_ATTRIBUTE_ALWAYS_INLINE to Future::{Map,OnReady} ``` name cpu/op cpu/op vs base BM_MapStatelessFuture 46.03n ± 1% 36.93n ± 0% -19.76% (p=0.000 n=40) BM_TryMapStatelessFuture 46.15n ± 1% 36.00n ± 1% -22.01% (p=0.000 n=40) BM_MapToFromStatelessFuture 49.54n ± 0% 39.19n ± 0% -20.90% (p=0.000 n=40) BM_MapStatefulFuture 45.88n ± 0% 39.97n ± 3% -12.87% (p=0.000 n=40) BM_TryMapStatefulFuture 52.14n ± 0% 36.90n ± 0% -29.24% (p=0.000 n=40) ``` PiperOrigin-RevId: 817699788	2025-10-10 10:38:14 -07:00
Eusebio Durán Montaña	6719e41da5	Add function to convert `ConvolutionKind` proto to the c++ enum For the `ConvolutionThunk` (de)serialisation we need to make the `GpuConvDescriptor` serializable, and for that we need `CudnnConvKind` too. A couple additional changes: * Renamed the existing c++ enum to proto enum to a (hopefully) more readable name. * Enforce that all c++ enums can be mapped to the proto version at compile time. I can't think of a case where we wouldn't want this, and with this change we can get rid of some non-ok Status invariants. PiperOrigin-RevId: 817676211	2025-10-10 09:43:00 -07:00
Shaogang Wang	12bb0a674b	PR #28740 : [XLA:GPU] Lowering dynamic update slice thunk into command buffer if it depends on loop iteration. Imported from GitHub PR https://github.com/openxla/xla/pull/28740 This is PR tries to lower DynamicSliceThunk into command buffer, even if it depends on the loop iteration. The command buffer implementation will also use the same approach (HloEvaluator to get new allocation during runtime) as DynamicSLiceThunk to get the sliced allocations, and for each iteration, CommandBuffer will use HloEvalutor to get the new addresses, and doing graph update with the new address. The major changes to custom.cc file is to resolve the issue that when a module has been parsed by command buffer scheduler, it rewrites the module into nested calls, which breaks the while loop analysis pattern, and module extraction pattern, so the fix is trying to introduced a cloned inline module, and perform the loop analysis and module extraction from the inlined module. Copybara import of the project: -- 2fe7c75a9fcbc9ade65f5a275aba3a2bc996ba07 by Shawn Wang <shawnw@nvidia.com>: add debug information for command_buffer_conversion_pass -- 88183dd7dc53c2bdc80f3a664a99b50e275311b2 by Shawn Wang <shawnw@nvidia.com>: Lower dynamic update slice thunk into command buffer when its offset value depends on loop iteraiton. -- 3cf46be90b3be2185f0b5106ea9eeaa45b088601 by Shawn Wang <shawnw@nvidia.com>: fix -- 45b31f69f9299a13bac24a966625190c9e90c91e by Shawn Wang <shawnw@nvidia.com>: fix -- ce3af2b9b131c9902b45d6d9934424d861656d32 by Shawn Wang <shawnw@nvidia.com>: fix -- a7fc4ab02b5d7dec6d337fcc57bbfd38a3b205ed by Shawn Wang <shawnw@nvidia.com>: fix -- 73784aa6530244559c1530b2f922cf81c6d43822 by Shawn Wang <shawnw@nvidia.com>: change to gemm command for test -- 64b1cf454fc360bcc3255f29bd27c01799537e07 by Shawn Wang <shawnw@nvidia.com>: fix -- 0a3d7a1b6c142a3c9aa2b299d902520ed7f91515 by Shawn Wang <shawnw@nvidia.com>: clang format -- 3105ce82fa3751d73d41b0564402e108328ea147 by Shawn Wang <shawnw@nvidia.com>: fix -- 85ce21672052c4bbfd50db54248dbe1ae2494230 by Shawn Wang <shawnw@nvidia.com>: fix Merging this change closes #28740 PiperOrigin-RevId: 817644265	2025-10-10 08:08:17 -07:00
Karlo Basioli	49b26980d7	[XLA:GPU][codegen] Separate out emitting shared dialect and lowering to triton as APIs used for testing. This change also migrates some device tests to use the API. PiperOrigin-RevId: 817624248	2025-10-10 07:00:46 -07:00
Adrian Kuegel	03ed995f8a	[XLA:GPU] Allow unrolling for ReduceWindow with small window. Experiments show that this performs still better than not unrolling. Also rename the method MayPreventVectorization as the naming is misleading. The other logic makes sure that we can at least vectorize the stores, so this function should check whether there is an expected performance drop due to unrolling, not whether we may be able to vectorize loads. PiperOrigin-RevId: 817621544	2025-10-10 06:47:23 -07:00
Kanish Anand	7bf89c1611	Introduce `NamedSharding` to `OpSharding` proto and `HloSharding`. #hloshardingv3 PiperOrigin-RevId: 817614842	2025-10-10 06:31:34 -07:00
Hyeontaek Lim	e74f6c1368	[IFRT] Migrate `Array::pjrt_layout()` callers to interpret `nullptr` as a default layout `Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`. This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before. Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`. PiperOrigin-RevId: 817600994	2025-10-10 05:56:10 -07:00
Dimitris Vardoulakis	d0850b6100	PR #32229 : Changes to the compute capabilities to account for the two different Blackwell Edge GPUs. Imported from GitHub PR https://github.com/openxla/xla/pull/32229 RTX PRO 6000 has CC 12.0. Spark has CC 12.1. Removed the IsAtLeastBlackwellPro method because there is no guarantee that future data center GPUs will have CC higher than 12.0. Also skipped the latency estimator test on Edge GPUs because it uses the collective performance model and crashes here: `784702574e/xla/service/gpu/model/gpu_collective_performance_model.cc (L239)` Copybara import of the project: -- ca47c656de78f8c5385dcf77b7454d7adc774203 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Some Spark fixes. Rename kBlackwellPro to kBlackwell12, as the sm_12x compute capabilities also include Spark. Fix the latency estimator test and the gemm fusion autotuner test for Spark. Removed the IsAtLeastBlackwellPro method because there is no guarantee that future data center GPUs will have CC higher than 12.0. Merging this change closes #32229 PiperOrigin-RevId: 817600860	2025-10-10 05:34:49 -07:00
Sergey Kozub	e1f227f289	PR #32525 : [XLA:GPU] Fix block scaled dot global scaling for older cuDNN versions Imported from GitHub PR https://github.com/openxla/xla/pull/32525 📝 Summary of Changes Pass cuDNN version to the `BlockScalingRewriter` pass, and make lowering decisions based on that. 🎯 Justification The global scaling factor doesn't work before cuDNN v9.13 (the graph is compiled, but the scaling factor is not applied). Use the slower lowering (apply global scaling factor outside the fusion) in this case. 🚀 Kind of Contribution 🐛 Bug Fix Copybara import of the project: -- a47ef5175d076270e371c9e5cf355fc1ad96efc8 by Sergey Kozub <skozub@nvidia.com>: [XLA:GPU] Fix block scaled dot global scaling for older cuDNN versions Merging this change closes #32525 PiperOrigin-RevId: 817592016	2025-10-10 05:03:09 -07:00
Marcin Radomski	9ba0ed96cf	[XLA:GPU] Add SdcLog::ReadProto A helper that does `SdcLog::ReadFromDevice` and returns the result as `SdcLogProto`. The proto will be dumped to log directory for debugging. PiperOrigin-RevId: 817587228	2025-10-10 04:50:12 -07:00
Henning Becker	fb63bc5217	Sort dumped execution files before checking. The order of files returned by `GetMatchingPaths` is not guaranteed, so sorting ensures deterministic test behavior and should get rid of the flakyness of the test. Also a tiny assertion cleanup for better error messages. PiperOrigin-RevId: 817564449	2025-10-10 03:29:25 -07:00
Sohaib Iftikhar	197b1c2454	[XLA:GPU]: Add support for loading HLO directly from profiler to graphviz No change to the OSS version for this tool. PiperOrigin-RevId: 817546670	2025-10-10 02:44:46 -07:00
Eusebio Durán Montaña	534a24b7a8	Move `GpuConvConfig` creation to the `ConvolutionThunk` constructor To (de)serialize this thunk, we'll be using the `GpuConvDescriptor` instead of the `GpuConvConfig`, since its easier to serialize (most of the config fields actually get populated during execution). So we move the creation to the Thunk, so that in the next CL we can also store the descriptor to use for (de)serialisation. I didn't add the `GpuConvDescriptor descriptor_` field in this CL, since its technically not needed yet. PiperOrigin-RevId: 817523742	2025-10-10 01:07:40 -07:00
Hyeontaek Lim	4bf3e73aa9	[IFRT] Migrate `Array::pjrt_layout()` callers to interpret `nullptr` as a default layout `Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`. This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before. Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`. PiperOrigin-RevId: 817516042	2025-10-10 00:37:53 -07:00
A. Unique TensorFlower	4c9b4048f9	Automated Code Change PiperOrigin-RevId: 817506473	2025-10-10 00:01:31 -07:00
Joshua Lang	3e05740a3d	Increase timeout for `gpu_spmd_e2e_compile_test`. PiperOrigin-RevId: 817494720	2025-10-09 23:14:58 -07:00
A. Unique TensorFlower	103b3b5fe1	Automated Code Change PiperOrigin-RevId: 817489367	2025-10-09 23:05:29 -07:00
A. Unique TensorFlower	09f731ce0d	Automated Code Change PiperOrigin-RevId: 817488802	2025-10-09 22:53:48 -07:00
Junwhan Ahn	c868704652	Reverts `6cb439dba1` PiperOrigin-RevId: 817470084	2025-10-09 21:34:21 -07:00

... 5 6 7 8 9 ...

28187 Commits