Imported from GitHub PR https://github.com/openxla/xla/pull/32838📝 Summary of Changes
The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input.
🎯 Justification
Previously the logic would propose an incompatible architecture in this case.
🚀 Kind of Contribution
🐛 Bug Fix
🧪 Unit Tests:
Added a new test case showing the previously-failing case (it used to propose `sm_110`)
Copybara import of the project:
--
f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>:
Fix family-conditional logic
--
fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>:
Accept CR suggestion
Merging this change closes#32838
PiperOrigin-RevId: 822284790
Imported from GitHub PR https://github.com/openxla/xla/pull/32960📝 Summary of Changes
(Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, 9d358b9b26, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now.
🎯 Justification
These changes are ROCm specific and helps with rocm internal CI validation pipelines.
🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests
📊 Benchmark (for Performance Improvements)
/
🧪 Unit Tests:
/
🧪 Execution Tests:
/
Copybara import of the project:
--
804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>:
Add missing cuda-only tag to cuda test
--
44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>:
Refactor test scripts
--
fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>:
Cover more mgpu tests
--
1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>:
Switch from rocm_gcc to rocm_ci for sgpu tests
--
00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>:
Changing file permissions
--
003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>:
Remove unnecessary import
--
214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
Add license header
Merging this change closes#32960
PiperOrigin-RevId: 822245565
Imported from GitHub PR https://github.com/openxla/xla/pull/32846📝 Summary of Changes
Allow mixed precision collective-permute in the verifier.
🎯 Justification
Partially addresses https://github.com/openxla/xla/issues/32845🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
N/A
🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute.
🧪 Execution Tests:
N/A
Copybara import of the project:
--
666c38a19005a609d4a7aa8e5e9b9842b1c87175 by Jaroslav Sevcik <jsevcik@nvidia.com>:
Allow mixed precision for collective permute
Merging this change closes#32846
PiperOrigin-RevId: 822179840
Imported from GitHub PR https://github.com/openxla/xla/pull/32904
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p>
<blockquote>
<h2>v4.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.8/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v4.30.7</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.7/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.9</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>3.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<p>See the full <a href="https://github.com/github/codeql-action/blob/v3.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p>
<h2>v3.30.8</h2>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p>
<blockquote>
<h1>CodeQL Action Changelog</h1>
<p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p>
<h2>[UNRELEASED]</h2>
<p>No user facing changes.</p>
<h2>4.30.9 - 17 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li>
<li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li>
</ul>
<h2>4.30.8 - 10 Oct 2025</h2>
<p>No user facing changes.</p>
<h2>4.30.7 - 06 Oct 2025</h2>
<ul>
<li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li>
</ul>
<h2>3.30.6 - 02 Oct 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.2. <a href="https://redirect.github.com/github/codeql-action/pull/3168">#3168</a></li>
</ul>
<h2>3.30.5 - 26 Sep 2025</h2>
<ul>
<li>We fixed a bug that was introduced in <code>3.30.4</code> with <code>upload-sarif</code> which resulted in files without a <code>.sarif</code> extension not getting uploaded. <a href="https://redirect.github.com/github/codeql-action/pull/3160">#3160</a></li>
</ul>
<h2>3.30.4 - 25 Sep 2025</h2>
<ul>
<li>We have improved the CodeQL Action's ability to validate that the workflow it is used in does not use different versions of the CodeQL Action for different workflow steps. Mixing different versions of the CodeQL Action in the same workflow is unsupported and can lead to unpredictable results. A warning will now be emitted from the <code>codeql-action/init</code> step if different versions of the CodeQL Action are detected in the workflow file. Additionally, an error will now be thrown by the other CodeQL Action steps if they load a configuration file that was generated by a different version of the <code>codeql-action/init</code> step. <a href="https://redirect.github.com/github/codeql-action/pull/3099">#3099</a> and <a href="https://redirect.github.com/github/codeql-action/pull/3100">#3100</a></li>
<li>We added support for reducing the size of dependency caches for Java analyses, which will reduce cache usage and speed up workflows. This will be enabled automatically at a later time. <a href="https://redirect.github.com/github/codeql-action/pull/3107">#3107</a></li>
<li>You can now run the latest CodeQL nightly bundle by passing <code>tools: nightly</code> to the <code>init</code> action. In general, the nightly bundle is unstable and we only recommend running it when directed by GitHub staff. <a href="https://redirect.github.com/github/codeql-action/pull/3130">#3130</a></li>
<li>Update default CodeQL bundle version to 2.23.1. <a href="https://redirect.github.com/github/codeql-action/pull/3118">#3118</a></li>
</ul>
<h2>3.30.3 - 10 Sep 2025</h2>
<p>No user facing changes.</p>
<h2>3.30.2 - 09 Sep 2025</h2>
<ul>
<li>Fixed a bug which could cause language autodetection to fail. <a href="https://redirect.github.com/github/codeql-action/pull/3084">#3084</a></li>
<li>Experimental: The <code>quality-queries</code> input that was added in <code>3.29.2</code> as part of an internal experiment is now deprecated and will be removed in an upcoming version of the CodeQL Action. It has been superseded by a new <code>analysis-kinds</code> input, which is part of the same internal experiment. Do not use this in production as it is subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3064">#3064</a></li>
</ul>
<h2>3.30.1 - 05 Sep 2025</h2>
<ul>
<li>Update default CodeQL bundle version to 2.23.0. <a href="https://redirect.github.com/github/codeql-action/pull/3077">#3077</a></li>
</ul>
<h2>3.30.0 - 01 Sep 2025</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="16140ae1a1"><code>16140ae</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3213">#3213</a> from github/update-v4.30.9-70205d3d1</li>
<li><a href="30db5fee08"><code>30db5fe</code></a> Update changelog for v4.30.9</li>
<li><a href="70205d3d12"><code>70205d3</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3211">#3211</a> from github/mbg/init/starting-partial-config</li>
<li><a href="697c209bfc"><code>697c209</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="1bd53ba38c"><code>1bd53ba</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3205">#3205</a> from github/update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="cac4df0c79"><code>cac4df0</code></a> Rebuild</li>
<li><a href="77e5c0d0a2"><code>77e5c0d</code></a> Merge branch 'main' into update-bundle/codeql-bundle-v2.23.3</li>
<li><a href="97a4f751be"><code>97a4f75</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3204">#3204</a> from github/mbg/setup-codeql</li>
<li><a href="2d5512b361"><code>2d5512b</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li>
<li><a href="fa7bdf0559"><code>fa7bdf0</code></a> Call <code>getAnalysisKinds</code> a second time, and ignore exceptions thrown during th...</li>
<li>Additional commits viewable in <a href="3599b3baa1...16140ae1a1">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
</details>
Copybara import of the project:
--
c14a0d2198bee3dcd76ee7fa733da41a6d1fcd6b by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>:
Bump github/codeql-action from 3.30.5 to 4.30.9
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](3599b3baa1...16140ae1a1)
---
updated-dependencies:
- dependency-name: github/codeql-action
dependency-version: 4.30.9
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Merging this change closes#32904
PiperOrigin-RevId: 822178959
Imported from GitHub PR https://github.com/openxla/xla/pull/32688📝 Summary of Changes
This PR enables command buffer DynamicSliceCopy command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd
🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.
🚀 Kind of Contribution
Please remove what does not apply:
✨ New Feature
🧪 Unit Tests:
xla/backends/gpu/runtime/command_buffer_cmd_test.cc: CommandBufferCmdTest:DynamicSliceCopyFusionCmd
Copybara import of the project:
--
feb2902fca397360460f6b9788ac0f7482cb547c by Shawn Wang <shawnw@nvidia.com>:
Enable command buffer DynamicSliceCopyFusion command unrolling
Merging this change closes#32688
PiperOrigin-RevId: 822104580
Imported from GitHub PR https://github.com/openxla/xla/pull/32719📝 Summary of Changes
This PR enables command buffer DynamicSliceFusion command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd
🎯 Justification
This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph.
🚀 Kind of Contribution
Please remove what does not apply:
✨ New Feature
🧪 Unit Tests:
xla/backends/gpu/codegen/dynamic_slice_fusion_test.cc
Copybara import of the project:
--
daa975804cbffcc3a6bc5b37e3494b51a2dbe2ca by Shawn Wang <shawnw@nvidia.com>:
DynamicSliceFsuionCmd supports unrolling
Merging this change closes#32719
PiperOrigin-RevId: 822071751
According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default.
Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future.
PiperOrigin-RevId: 822067921
The fission autotuner previously only searched for dot instructions in the entry computation of an HLO module. This caused it to miss dot operations located in nested computations, such as the body of a while loop, preventing the autotuner from applying configurations to them.
PiperOrigin-RevId: 822037141
We adjusted the emitter for the case when the scale is missing.
Also we relaxed the hlo verifier a bit and tweaked the composite rewriter that should accept the dim indexes passed by jax.
PiperOrigin-RevId: 822036474
When removing ops, we need to do that in a deterministic order. The reason is
that removing users works by finding the position of the user in the users
vector, then swapping with the last element of the vector, then popping the
last element of the vector. So if more than one element is removed from a users
list, it matters in which order the elements are removed.
PiperOrigin-RevId: 822026351
Imported from GitHub PR https://github.com/openxla/xla/pull/32905📝 Summary of Changes
Allow mixed precision asynchronous collective-permute in the verifier.
🎯 Justification
Fixes https://github.com/openxla/xla/issues/32845🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
N/A
🧪 Unit Tests:
Tests that verifier passes on mixed precision collective-permute-start and collective-permute-done.
🧪 Execution Tests:
Manually testes the JAX repro from https://github.com/openxla/xla/issues/32845
Copybara import of the project:
--
f44faa7ce7ecfbd810983cae170a118bb19a8bb3 by Jaroslav Sevcik <jsevcik@nvidia.com>:
Allow mixed precision operands for async collective permute
Merging this change closes#32905
PiperOrigin-RevId: 822023349
Imported from GitHub PR https://github.com/openxla/xla/pull/32773📝 Summary of Changes
Remove hardcoded NHWC convolution layout for fp16 precision.
🎯 Justification
Performance drops for fp16 precision on gfx11xx and gfx12xx GPUs were observed internally, as well as by the [community](https://github.com/jax-ml/jax/issues/30548).
🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark
Community member provided the script with whom the [profiling can be done](https://github.com/jax-ml/jax/issues/30548#issue-3270872993).
Significant performance improvement for fp16 on gfx12xx:
```
Running on: rocm:0
Testing float32...
Avg time: 0.092307 s, Throughput: 1.68 TFLOP/s
Testing float16...
Avg time: 0.011742 s, Throughput: 13.17 TFLOP/s
Testing bfloat16...
Avg time: 0.011989 s, Throughput: 12.90 TFLOP/s
```
Results of the profiling before the fix:
```
Running on: rocm:0
Testing float32...
Avg time: 0.092312 s, Throughput: 1.67 TFLOP/s
Testing float16...
Avg time: 0.775142 s, Throughput: 0.20 TFLOP/s
Testing bfloat16...
Avg time: 0.011990 s, Throughput: 12.90 TFLOP/s
```
@xla-rotation can you please review this PR?
Copybara import of the project:
--
c9fdba79e32c13d9cbf640e61d941d071fabba9d by Aleksa Arsic <Aleksa.Arsic@amd.com>:
Remove hardcoded convolution NCHW layout assignment for fp16 precision.
--
69660d19999a14b24d63b52e6dae310cfbdcbb6b by Aleksa Arsic <Aleksa.Arsic@amd.com>:
Add unit tests for ROCm layout assignment.
Merging this change closes#32773
PiperOrigin-RevId: 822022522
Imported from GitHub PR https://github.com/openxla/xla/pull/32724
Copybara import of the project:
--
c3f4ff8ec6af27d24b61e2aa529585697b8aa77a by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Disable only the test cases that are failing and enable 3 test targets on B200.
--
1f6e52218ec124bb52d4dba70aa7832311762465 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Disable test case in cudnn_test that fails on Google's B200.
Keep gpu_compiler_test off CI for now due to memory leak
found by ASAN, but don't revert the changes in the file,
so it can be enabled more easily in the future.
--
42e501a41e43c174538ab186c659a072101b4ab2 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Disable ConvWgradWithNHWCLayoutExecutesCorrectly only on Blackwell.
Merging this change closes#32724
PiperOrigin-RevId: 821992088
This change implements a native support for `xla::Executable::GetOutputLayouts()` in PJRT C API, when PJRT Layouts extension is available. This support does not fetch the optimized HLO, and thus this method becomes more reliable and fast.
This change strongly recommends the plugin that implemented the Layouts extension v2 to upgrade to v3 to avoid an incompatibility.
PiperOrigin-RevId: 821834116
Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future.
PiperOrigin-RevId: 821808592
We can now produce arbitrary iteration patterns for output tiles, simply by
parametrizing calls to `ComputeTiledHloInstructions` with different
`TiledHloSchedule`s.
PiperOrigin-RevId: 821796530
IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.
Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately.
PiperOrigin-RevId: 821741105
The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use.
PiperOrigin-RevId: 821737613
Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`.
This change is a no-op.
PiperOrigin-RevId: 821732030
The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra.
PiperOrigin-RevId: 821728904
- We encounter this case very often (for cublas autotuner), so it makes sense to optimize it.
- Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it.
PiperOrigin-RevId: 821716593
Imported from GitHub PR https://github.com/openxla/xla/pull/32782📝 Summary of Changes
Fix hermetic build for rocm.
🎯 Justification
Introduce missing hipblaslt dependency.
Fix invalid libs linking and align with the data directories.
🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
CI, not relevant
🧪 Unit Tests:
Not relevant
🧪 Execution Tests:
Not relevant
Copybara import of the project:
--
f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>:
Add missing hermetic lib dependency
--
fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>:
Add missing hipblaslt deps, fix the targets
--
540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>:
Make hipblaslt mandatory
--
3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>:
Fix test
--
eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>:
Ignore asnsigaltstack
--
54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:
[ROCm] Use working sha256 for latest ROCm 7.0 docker image
--
9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
[ROCm] Add ROCM_PATH repo_env to test scripts
--
1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>:
Fix buildifier warning
Merging this change closes#32782
PiperOrigin-RevId: 821614030
In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts
PiperOrigin-RevId: 821612799
This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR.
The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`.
PiperOrigin-RevId: 821593545
This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as
the API from the socket integration and subclasses can handle these
accordingly. This also moves the responsibility to destroy in the handler logic
with the contract that the event is removed from the loop on the second HalfClose event.
PiperOrigin-RevId: 821445213
Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation.
This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features:
1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed
- to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime
- once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction
2. finding a lot of small changes that together reduce the runtime
- we can do this using a "hill-climbing" method
- try many perturbations until you find one slightly better than the baseline.
- try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again
- repeat until no more improvements are found
NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs.
PiperOrigin-RevId: 821309046
This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default.
PiperOrigin-RevId: 821039969
the transposes are not identity permutations. Identity transposes
should be eliminated separately in HandleTranspose already.
PiperOrigin-RevId: 820903953
PjRt-IFRT directly or indirectly fetched optimized HLO to get the output
layout mode and output layouts. This seems to introduce a regression in
some jobs that use PJRT C API and have a too large serialized HLO (> 2 GiB).
As a workaround, PjRt-IFRT gracefully handles output layout mode and
layout discovery errors, and falls back to concrete layouts that are
directly obtained from output `PjRtBuffer`s, should give the same
behavior before/after the default layout handling change.
Further changes will follow to discover default layout modes and layouts
without going through `PjRtLoadedExecutable::GetHloModules()`.
PiperOrigin-RevId: 820785277
Add placeholders for future Type serialization/deserialization. It's not an ABI breaking change as it's unused today, and it allows to avoid ABI breaking change in the future when FFI will add proper ser/des support for user defined types.
PiperOrigin-RevId: 820676169
- The VLOG messages are updated to more accurately describe whether the autotuner is finding a config in cache, using a default, or actively tuning for the best config.
- The error contains the HLO instruction.
PiperOrigin-RevId: 820640768
This change utilizes recently added Triton support for smaller block sizes.
Skipping occupancy optimization for some configs is essentially a workaround for incompatible split_k values. The impact of these configs is limited however because they are only present in non-exhaustive mode, so they mostly get filtered out anyway.
PiperOrigin-RevId: 820617352
Before this change, we disallowed all-gather such that the partitioner generates `all-reduce(dynamic-update-slice())` pattern. With this change, we allow all-gather for two reasons.
1. In most cases, all-gather is allowed and preferred.
2. It is easier to read and match the partitioner result.
PiperOrigin-RevId: 820593767
Imported from GitHub PR https://github.com/openxla/xla/pull/32388📝 Summary of Changes
Support collectives with non-minor-most last dimension in the sub-byte collective normalization pass.
🎯 Justification
Makes more collectives efficient, not require type conversion.
🚀 Kind of Contribution
Performance Improvement.
📊 Benchmark (for Performance Improvements)
```
Before:
## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=68384ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=67744ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=66976ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=67040ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=66816ns
After:
## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=41216ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=40960ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=41056ns
## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=40960ns
```
Measured on 8xH100 DGX.
🧪 Unit Tests:
yes
🧪 Execution Tests:
yes
Copybara import of the project:
--
a3777523ffffbcc59da285544e3fb5575d098b9c by Ilia Sergachev <isergachev@nvidia.com>:
[GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension.
Merging this change closes#32388
PiperOrigin-RevId: 820585923
Imported from GitHub PR https://github.com/openxla/xla/pull/32678📝 Summary of Changes
- Fix sha256 of docker image to ensure CI is not broken due to malformed image
- Fix test scripts by passing ROCM_PATH to bazel sandbox via repo_env
🎯 Justification
Continued CI runs
🚀 Kind of Contribution
🧪 Tests
Copybara import of the project:
--
3ca8114613d8e002c137f28bb6608639d08a724a by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:
[ROCm] Use working sha256 for latest ROCm 7.0 docker image
--
09ddfbdf205a6406cdd67e20671f41455fffe0f9 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
[ROCm] Add ROCM_PATH repo_env to test scripts
Merging this change closes#32678
PiperOrigin-RevId: 820582560
Imported from GitHub PR https://github.com/openxla/xla/pull/32718📝 Summary of Changes
This PR adds conv fusion support in cudnn fusion compiler.
* add conv type in `CuDnnFusionConfig` to represent different types of conv. We are getting rid of the conv custom call target so this info has be preserved in fusion config.
* add `ConvDimensionAdapter` to generate NCHW **logical layout** for cudnn frontend while physical layout could be NHWC (most preferable layout) or NCHW (for int conv). Only NHWC layout is used in the unit tests because layout assignment currently doesn't handle conv fusion to transform other layouts to NHWC, this needs to be addressed in separate PR.
* add conv translation rule from XLA conv to cudnn frontend graph API.
* Other parts of the lowering is taken care automatically by current cudnn fusion compiler: workspace allocation/graph validation/graph compilation/graph serialization.
🎯 Justification
This is the first step to unify the conv as cudnn fusion in XLA. Conv custom call will be replaced with conv fusions in the future.
🚀 Kind of Contribution
✨ New Feature
📊 Benchmark (for Performance Improvements)
No Performance changes are expected.
🧪 Unit Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.
🧪 Execution Tests:
Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad.
Copybara import of the project:
--
57555cd0e3759aacb7a98135c3261f4cc3f642c2 by Cjkkkk <ske@nvidia.com>:
init
--
d6edecfa42a6371a0908e22daeb8deaf32998ece by Cjkkkk <ske@nvidia.com>:
address comments
--
17df6f8451274f070d7d332a126cfefa1ef7df83 by Cjkkkk <ske@nvidia.com>:
removed one comment
--
1b7c63b1ade7751cf8f68c7fb11cd68491440081 by Cjkkkk <ske@nvidia.com>:
add const
Merging this change closes#32718
PiperOrigin-RevId: 820574737
We're perfectly able to construct a schedule using only a subset of the
iteration space of a `tile_offsets_indexing`---and in fact need to when we are
processing nested fusions.
PiperOrigin-RevId: 820454010
* Deserializing MLIR modules still tries to parse as string first as thats the default, on failure it tries to uncompress and parse.
PiperOrigin-RevId: 820396326
`libnvidia-ml.so` version is coupled with kernel mode driver version, hence we can't provide a custom version of `libnvidia-ml.so` if the machine has a different KMD installed on it.
PiperOrigin-RevId: 820291348
If a multi-controller JAX program is running a collective (e.g., AllReduce) and
one of the collective participants fails, we'll abort the collective.
Previously, a JAX programmer didn't have a way to know that their collective
was aborted. The collective would just return garbage data. This commit changes
the TFRT GPU client to propagate aborted collectives into Python exceptions.
There is some subtlety in detecting when we abort a collective. The NCCL API
unfortunately doesn't provide a direct way to query whether or not a collective
was aborted. Instead, we check the health of the participants after the
collective returns. If the participants are stale, then we conservatively
assume the collective was canceled.
This leads to the possibility that some processes think the collective was
aborted and some don't. I'll address this lack of atomicity in future changes.
PiperOrigin-RevId: 820272371
This change promotes `asinh` from a composite operation to a native HLO opcode (`kAsinh`). This allows for direct lowering to device-specific intrinsics which should be more performant. Support is added for GPU.
PiperOrigin-RevId: 820245338
Imported from GitHub PR https://github.com/openxla/xla/pull/31030📝 Summary of Changes
This PR moves the ReduceScatterCreator pass to run after AlgebraicSimplifier, simplifying the transformation pattern and allowing ReduceScatterCreator to convert more all-reduces into reduce-scatters that would otherwise be missed.
🎯 Justification
Running ReduceScatterCreator after AlgebraicSimplifier makes the input patterns easier to recognize. This allows more all-reduces to be converted into reduce-scatters, which would otherwise be missed, leading to better performance. _This was reported internally as an optimization for llama3.3-70b._
🚀 Kind of Contribution
⚡️ Performance Improvement,
📊 Benchmark (for Performance Improvements)
On H100:
| | PR | main |
|----------|----------|----------|
| llama31_8b_bf16_1x8 | 1372251 us | 1369631 us |
| llama31_8b_fp8_1x8 | 1106135 us | 1107605 us |
| llama31_8b_bf16_2x8 | 1373637 us | 1370564 us |
| llama31_8b_fp8_2x8 | 1111912 us | 1108061 us |
| llama31_70b_bf16_16x8 | 13933022 us | 13913957 us |
| llama31_70b_fp8_16x8 | 9848173 us | 9867955 us |
| llama31_70b_bf16_32x8 | 14103619 us | 14065225 us |
| llama31_70b_fp8_32x8 | 9732961 us | 9760739 us |
| llama31_405b_bf16_64x8 | 52926476 us | 52886529 us |
| llama31_405b_fp8_64x8 | 35576505 us | 37929776 us |
| mixtral_8x7b_bf16_1x8 | 744367 us | 744491 us |
| mixtral_8x7b_bf16_2x8 | 1126425 us | 1130912 us |
🧪 Unit Tests:
Added a new unit test
🧪 Execution Tests:
Tested for functionality with llama3.3 70b zero1 + gradient accumulation and saw ~5% performance improvement.
Copybara import of the project:
--
2d999987762ac3d90960179b06587bc95fc954d1 by Sevin Varoglu <svaroglu@nvidia.com>:
Move ReduceScatterCreator after AlgebraicSimplifier
--
0e41c2b8281234eec9af21a98fd5f81bd4884689 by Sevin Varoglu <svaroglu@nvidia.com>:
Add unit test
Merging this change closes#31030
PiperOrigin-RevId: 820221148
A concrete use case when such a schedule is useful is when we have a matrix
multiplication such that a chunk of shape `(block_m, k)` of the left-hand
side argument fully fits into L2. The transposed iteration order will step
through the `n` dimension first, allowing to hit L2 cache more often when
loading tiles of the left-hand side.
This schedule is intentionally restricted at the moment in order to unblock
launching the generic Triton emitter for GEMMs.
PiperOrigin-RevId: 820214481
And dump the log proto into file called buffer_debug_log rather than sdc_log
Changes to implementation details:
- Renames:
- SdcLogProto -> BufferDebugLogProto
- SdcLog -> BufferDebugLog
- SdcBufferId -> ThunkBufferId
- SdcThunk -> BuffersChecksumThunk
- SdcXorChecksumKernel -> BufferDebugXorChecksumKernel
- move BufferDebugLog to stream_executor/gpu from stream_executor/cuda as it's not CUDA-specific
PiperOrigin-RevId: 820186034
Remove checks for `padded_k_size.has_value()` as `padded_k_size` is always expected to be present after its initialization. This simplifies the conditions for needing padding and the calculation of the padding amount.
PiperOrigin-RevId: 820148929
A debugging tool meant to pinpoint nondeterministic computations by finding
differences in buffer values across multiple runs. It makes XLA calculate
checksums of input/output buffers, and dump them to the output directory.
Enabling the new `--xla_gpu_experimental_enable_checksum_tracing_on_thunks`
flag enables a new ThunkChecksumTracingPass, which adds checksum thunks to the
thunk graph:
- Inserts SDC log initialization to beginning.
- Replaces each thunk with a SequentialThunk [checksum inputs, run original
thunk, checksum outputs].
- Inserts a thunk that dumps SDC log to a file at the end of execution.
PiperOrigin-RevId: 820148916
It is a no op for `dedupFunctionsFully` is false which is also the default.
outShardings is the the output shardings of the named computation at hand. However, if dedupFunctionsFully true, the func we pick from `createFuncOpOrGetFromCache`, which is the func the call will actually be calling, may have a different output sharding than the named computation, and call result sharding should be set to the output sharding it calls. For example,
namedComputation1(foo): insharding={"x"} outsharding={"y"}
namedComputation2(foo): insharding={"x"} outsharding={"z"}
call1 to namedComputation1
call2 to namedComputation2
When dedupFunctionsFully is false, we have separate instances of foo as their outshardings are different.
func foo1 insharding={"x"} outsharding={"y"} {...}
func foo2 insharding={"x"} outsharding={"z"} {...}
call1 to foo_1 resultsharding={"y"}
call2 to foo_2 resultsharding={"z"}
When dedupFunctionsFully is true, we do not have separate instance of foo, we need to pick either namedComputation1 or namedComputation2, say we pick namedComputation1, hence it becomes:
func foo insharding={"x"} outsharding={"y"} {...}
call1 to foo resultsharding={"y"}
call2 to foo resultsharding={"y"}
As a result, call2 should have a resultsharding={"y"} since it is calling foo, instead of the out sharding of namedComputation2 which is {"z"}.
PiperOrigin-RevId: 820139879
Loads are required to be 16-byte aligned for Triton to apply pipelining. This change adds extra padding to both split-k rewriters so that the reduction dimensions are a multiple of 16 bytes.
PiperOrigin-RevId: 820134896
Currently we would fail when trying to check whether the element type of the
root tuple is supported. We should not even access the element type on a tuple
shape. Therefore we skip the root tuple.
PiperOrigin-RevId: 820096050
This increases test coverage.
Also remove the empty test suite mlir_lit_tests. These tests have been moved to
another directory long ago.
PiperOrigin-RevId: 820074643
We already have a pinned allocation for the original value, it should be finalized to avoid re-allocation causing multiple pinned allocations for the same buffer.
PiperOrigin-RevId: 820015337
PjRt-IFRT now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.
`ifrt::PjRtArray` creation now request extra information on whether the underlying `PjRtBuffer` is using a custom layout as IFRT tracks the defaultness of array layouts. This information cannot be inferred correctly from `PjRtBuffer` alone because `PjRtBuffer::layout()` only returns a concrete layout. PjRt would mostly work fine today if a default layout is said to be a custom layout, but some strict layout equality check can fail and require more precise information to be supplied.
A few test cases in IFRT ArrayImplTest against PjRt CPU and GPU clients
have been disabled because the output array does not track the
non-default-ness of the layout correctly when
`MakeArraysFromHostBufferShards()` is implemented using
`ClientMakeArraysFromHostBufferShards()`.
PiperOrigin-RevId: 819995407
Add a placeholder for `HERMETIC_PYTHON_VERSION_KIND` in the generated `py_version.bzl` file. This new variable is currently set to an empty string until we figure out how to deal with it.
PiperOrigin-RevId: 819956767
This was observed when auto warp specialization was enabled. Triton requires more threads per block than expected, and this information is available in the module attributes.
PiperOrigin-RevId: 819893926
The only non-obvious part of the thunk is `solver_context_creator`, but we can retrieve it during the deserialization from `stream_executor::Platform`, which is available during runtime.
PiperOrigin-RevId: 819863398
Imported from GitHub PR https://github.com/openxla/xla/pull/32003
Copybara import of the project:
--
2a3ad034522e871edc9c7f580e86fc3980025542 by Ilia Sergachev <isergachev@nvidia.com>:
[GPU][NFC] Merge methods querying fusion kind.
--
ebeb25599d6017d34ea92ece415a255d109af049 by Ilia Sergachev <isergachev@nvidia.com>:
Address review requests.
Merging this change closes#32003
PiperOrigin-RevId: 819692807
Imported from GitHub PR https://github.com/openxla/xla/pull/32283📝 Summary of Changes
Change misleading method name RocmComputeCapability::has_amd_matrix_core() to more suitable name has_amd_mat_acc_instructions() as gfx11xx do not have matrix cores, but support matrix acceleration instruction set known as WMMA.
🎯 Justification
RocmComputeCapability::has_amd_matrix_core() is misleading as gfx11xx do not have matrix cores but still support matrix acceleration instruction set - WMMA.
🚀 Kind of Contribution
♻️ Cleanup
@xla-rotation please review my changes.
Copybara import of the project:
--
23cf1ab79fdcc4ee2ee4996973dee2c103d2762a by Aleksa Arsic <aleksa.arsic@amd.com>:
Change misleading method name RocmComputeCapability::has_amd_matrix_core() to more suitable name has_amd_mat_acc_instructions() as gfx11xx do not have matrix cores, but support matrix acceleration instruction set known as WMMA.
Merging this change closes#32283
PiperOrigin-RevId: 819652238
Imported from GitHub PR https://github.com/openxla/xla/pull/31994
Copybara import of the project:
--
dd037f3ef1c2da262029a9ebc34845ddb3c8a7f1 by Ilia Sergachev <isergachev@nvidia.com>:
[NFC] Move computation simplification methods from command buffer scheduling to a new library.
--
2594c7a473945f5d410ae8e8894b7e90f5812c1e by Ilia Sergachev <isergachev@nvidia.com>:
Address review feedback.
Merging this change closes#31994
PiperOrigin-RevId: 819631409
Imported from GitHub PR https://github.com/openxla/xla/pull/32642📝 Summary of Changes
Fix rocm build with asan settings
🎯 Justification
Fix invalid run under script used in order to pass through the asan ignore lists file.
🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
Not relevant
🧪 Unit Tests:
Not relevant
🧪 Execution Tests:
Not relevant
Copybara import of the project:
--
708b0b274d18b88ca7467c3ab3f44aaa11710995 by Alexandros Theodoridis <atheodor@amd.com>:
Fix invalid run_under script for ci job and asan ignore files
Merging this change closes#32642
PiperOrigin-RevId: 819630816
We need to ensure that symbols for trivial dimensions are simplified away
consistently. If we simplify it on one side, we also need to simplify it on
the other, as we want to use the same iteration space for both lhs and rhs.
PiperOrigin-RevId: 819606712
This change integrates `CollectiveBroadcastStartThunk` and `CollectiveBroadcastDoneThunk` into the command buffer execution framework, allowing them to be converted into command buffer commands. It also includes a minor fix to an error message.
PiperOrigin-RevId: 819432425
synchronization in raw buffer APIs (namely CopyRawHostToDeviceAndReturnEvent
and CopyRawDeviceToHostAndReturnEvent). Old buffers will require no
synchronization, but recent buffers will get a cached compute_stream event
and then it will sync with this compute_stream event repeatedly instead of
syncing with the stream itself.
PiperOrigin-RevId: 819380480
This change reorders arguments within various override and extension calls for consistency. It also removes the explicit `xla_pypi_311_numpy` from the `use_repo(pip, ...)` call.
PiperOrigin-RevId: 819260347
This CL is a key step in integrating the new SymbolicExpr library into IndexingMap (b/433696544). The primary goal is to replace the existing `mlir::MLIRContext` with `gpu::SymbolicExprContext` throughout every class that depends on IndexingMap.
Goal:
- Enables Symbolic Reasoning: `SymbolicExprContext` is designed to manage symbolic expressions, which will allow for more powerful analysis and optimization of indexing maps.
- Performance: We believe with pifon@ than by using a dedicated context, compilation time could be improved by the fact of not overusing the lock inside MLIRContext (used everywhere). This should be confirmed with real data after finishing the entire migration.
- Foundation: This refactoring is a prerequisite for fully replacing `AffineExpr` with `SymbolicExpr` in `IndexingMap`. This should unblock the replacement of the internal implementation (cl/802100018).
Changes:
- Signature Updates: Function signatures across numerous files in `xla/service/gpu`, `xla/backends/gpu`, `xla/backends/cpu`, and `xla/hlo/analysis` have been updated to accept `gpu::SymbolicExprContext*` instead of `mlir::MLIRContext*`.
- Context Propagation: The `SymbolicExprContext` is now created and owned by `GpuCompiler` and propagated down to the various components, including fusion emitters, autotuners, and performance models.
- Test Updates: Unit tests and test utilities have been updated to use the new context.
- Some areas required temporary workarounds where the `SymbolicExprContext` is created on the fly from an existing `mlir::MLIRContext`. These are marked with TODOs (b/446856820, b/446856303) to be cleaned up in follow-up CLs as the integration progresses.
Ideally no functional changes are intended, but the `mlir::MLIRContext` was inconsistently managed across the codebase, requiring careful tracing and updates to ensure the new `SymbolicExprContext` is correctly propagated everywhere. This made the refactoring process time-consuming and I would appreciate careful review because I could have made some mistakes as well. In this process I had 3 different and unrelated segmentation faults and a crash in StorageUniquer for not using the same context in different places.
This CL represents step 2 of the integration plan outlined in b/433696544#comment9.
PiperOrigin-RevId: 819228363
In practice the thunk always has:
* an input and output filter
* either:
* no biases
* both an input and output bias
So specify this invariant into the data structure, to make this more readable and to make it harder to create an invalid thunk.
PiperOrigin-RevId: 819099118
Use GpuKernelRegistry for loading the kernel rather than `TypedKernelFactory`.
The new header will help prevent errors related to use of "gpu"-tagged targets
in non-"gpu"-tagged ones.
Also, avoid using atomic fetch_add to prevent JAX build failures on <sm60.
We're going to ensure that with a runtime check.
PiperOrigin-RevId: 819098591
Imported from GitHub PR https://github.com/openxla/xla/pull/32454
Any place inside the compiler has to respect the debug options override set by the users.
Copybara import of the project:
--
6971175737582aad4e9256f983890ac04009a074 by Yunlong Liu <yliu120@users.noreply.github.com>:
Respect debug options override in LHS.
Merging this change closes#32454
PiperOrigin-RevId: 819088927
Running OnReady and Map callbacks in the thread that calls promise.Set() can be dangerous for performance. Add an API to execute all callbacks in the given executor.
PiperOrigin-RevId: 818964744
`ifrt::Array::pjrt_layout()` will soon be returning a `nullptr` for a default layout (soon with removal of `absl::StatusOr<>` part). The user can continue to get a concrete default layout via `ifrt::Client::GetDefaultPjRtLayout()`.
During a transition, IFRT implementations may return either `nullptr` or a concrete default layout, and this state will be permitted temporarily, while they will be migrated to return `nullptr` for default layouts.
PiperOrigin-RevId: 818911623
gpu_static_registration is intended to pull in all dependencies needed for the target - this ensures that this is the case.
PiperOrigin-RevId: 818820817
because that is the only supported way to return Tuples, so
flip the flag by default to true.
Callers can now stop setting this.
PiperOrigin-RevId: 818803753
Instead of inputing the filter dimensions as a span of integers, which we implicitly expect to be of size 4, we pass this a proto.
Using a proto instead of a struct since we'll need the `ConvolutionFilterDimensions` to serialize the `ConvolutionReorderThunk`. (We don't want to serialize the `FilterDescriptor` since most of its fields are only written during execution, so we'll serialize the `ConvolutionFilterDimensions` instead).
Not sure where the best place for the `ConvolutionFilterDimensions` proto to live is. Other options would be to define it in:
* In the thunk.proto, or;
* In some other file more closely related to convolution filters (not sure where that could be).
PiperOrigin-RevId: 818726237
This change moves the creation of the `computation_index_map` outside the loop body in `EmitDot` and `EmitScaledDot`, as it does not depend on the loop induction variable. It also simplifies how the tile size is retrieved in `GetDotLoopIterationCount` by using `TiledHloInstruction::tile_size`.
PiperOrigin-RevId: 818718168
We no longer support module groups with size > 1, so there's no point in supporting generic RunOnModelGroup. One possible use-case of model groups with size == 1 is to be able to *replace* the module (instead of modifying it in-place). This adds a new interface to support that.
PiperOrigin-RevId: 818665806
Imported from GitHub PR https://github.com/openxla/xla/pull/32475📝 Summary of Changes
Make asan builds hermetic so they can be used with rbe
🎯 Justification
Add sanitizer ignore lists as a dependency to run_under script so they are available in rbe worker
🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix, ♻️ Cleanup
📊 Benchmark (for Performance Improvements)
not relevant
🧪 Unit Tests:
not relevant
🧪 Execution Tests:
not relevant
Copybara import of the project:
--
cae2ea8d4808c161becb80602fba605ba08a4bd5 by Alexandros Theodoridis <atheodor@amd.com>:
Adjust ci script to include asan ignore list as deps
Merging this change closes#32475
PiperOrigin-RevId: 818658730
Imported from GitHub PR https://github.com/openxla/xla/pull/32357📝 Summary of Changes
1. Update calculation for TTFT to be the time to first generated token. This will also impact TPOT calculations.
2. Use tokenizer to count the number of tokens generated instead of counting words using space
🎯 Justification
Currently the script computes TTFT as time to first token which is from the prompt and still in prefill stage.
🚀 Kind of Contribution
🐛 Bug Fix
Copybara import of the project:
--
25178775f936a6f40a205e6969582222f150f0dd by Gauri Deshpande <gauri1.deshpande@intel.com>:
Update gemma2 keras benchmark script - fix ttft, and use tokenizer
--
9b20ead588ad38152e648067c7d34314ba8a5645 by Gauri Deshpande <gauri1.deshpande@intel.com>:
address review comments
Merging this change closes#32357
PiperOrigin-RevId: 818656164
For a custom call we need to resolve a target name (string) to a function pointer. So far this happens in `IrEmitterUnnested` (at the end of the compilation pipeline). But for thunk serialization we need this to happen at runtime (when the thunks are getting reconstructed from the proto representation). Therefore I'm moving this resolving step into the `CustomCallThunk` factory function.
Note that there remains a way to construct a `CustomCallThunk` from just a function pointer. These thunks will not be serializable and that's okay. The logic handles these cases and returns an error. It is important for tests to be able to quickly create a CustomCallThunk from a closure. If we had to register these calls in the registry first it would complicate our tests significantly.
In detail this change entails:
- Move resolver logic in new overloads of the `CustomCallThunk::Create` factory function.
- Call these overloads from `IrEmitterUnnested` and the custom kernel fusion emitter.
- Add tests for the new overloads
- Migrate some tests to FFI (the new custom call registry and API)
- Adjust some error codes in custom call tests. (If a custom call was not found now `kNotFound` is returned instead of `kUnimplemented`).
PiperOrigin-RevId: 818655300
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.
This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.
Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.
PiperOrigin-RevId: 818618831
There's no shared logic in between the `ConvolutionReorderThunk` and the `ConvolutionThunk` thunk, so I think its cleaner for each to be defined in their own file.
PiperOrigin-RevId: 818613921
The output of an IFRT IR program can be either a direct argument (`mlir::BlockArgument`) or (`xla::ifrt::CallLoadedExecutableOp`) output or (`ifrt::CopyArraysOp`)
PiperOrigin-RevId: 818592194
Imported from GitHub PR https://github.com/openxla/xla/pull/32504📝 Summary of Changes
Remove rocm_diagnostics.cc
🎯 Justification
RocmDiagnostics module never worked and provides no meaningful information to the user.
🚀 Kind of Contribution
♻️ Cleanup
📊 Benchmark (for Performance Improvements)
N\A
🧪 Unit Tests:
None
🧪 Execution Tests:
None
Copybara import of the project:
--
73c4357ea80c720e2e46ddc0f91c8943e571b1ca by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:
[ROCm] Remove rocm_diagnostics
Merging this change closes#32504
PiperOrigin-RevId: 818581529
The removed data fields are not used and always initialized from a nullptr.
The removed functions are not called from anywhere. So let's remove all of that.
PiperOrigin-RevId: 818504444
Imported from GitHub PR https://github.com/openxla/xla/pull/31795📝 Summary of Changes
[Downstream check](https://github.com/openxla/xla/blob/main/xla/pjrt/pjrt_executable.cc#L288-L299) assumes tuples on default memory space, force assign default color to tuples will get around the check.
🎯 Justification
NCCL user buffer runs are crashing on MaxText main. This PR fixes the crash.
🚀 Kind of Contribution
🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
N/A.
🧪 Unit Tests:
Existing unit tests.
🧪 Execution Tests:
Added multiple execution tests.
Copybara import of the project:
--
c60fe9d62827596eac57df2b480891520b40ab07 by Terry Sun <tesun@nvidia.com>:
assign default color yo tuples
--
717412a55a94be71afcbb7627f03905c408f8b6a by Terry Sun <tesun@nvidia.com>:
add constant and polish doc string
--
c907b2d1ca5a62299b6bfd2534e99c6215313ffd by Terry Sun <tesun@nvidia.com>:
update test
Merging this change closes#31795
PiperOrigin-RevId: 818295813
This makes the rendezvous name more informative by including the specific type of collective operation, which helps in debugging.
PiperOrigin-RevId: 818215415
For some annotations, it is more user-friendly to print the annotation
before the annotated context. Add an option for this and remove the
automatic "; " delimiter to allow more flexibility in the annotation
format.
PiperOrigin-RevId: 818118289
This also removes the code that handles original values in MergeFusionInstructionIntoMultiOutput, as it eventually calls into HloCallableInstruction::CloneAndAppendInstructionIntoCalledComputation to create a tuple result and the corresponding original value.
PiperOrigin-RevId: 817944313
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.
This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.
Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.
PiperOrigin-RevId: 817893146
It's hard to construct ThreadPoolExecutor from a ThreadPool& reference and correctly manage the lifetime of executor. Instead make it possible to get a tsl::Executor adaptor from a ThreadPool instance.
PiperOrigin-RevId: 817872041
These macros are carefully designed to
1. have the exact same API as absl::CHECKs
2. produce error messages and content in exactly the same format as absl::CHECK's.
3. respect absl flags e.g. ABSL_MIN_LOG_LEVEL
They only differ from absl::CHECK's in that, for Check failures, they
1. prepend an error code and append a link to the openxla.org webpage to the error message
2. additionally, append DebugMeContext information if available.
PiperOrigin-RevId: 817868212
- Added MODULE.bazel and module extensions to introduce external dependencies with Bzlmod.
- Added a CI config for Linux CPU build with Bzlmod (enabled by `--config=bzlmod`)
TODOs:
- Support overriding Python runtime
- Support build with more configs and platforms
PiperOrigin-RevId: 817711851
For the `ConvolutionThunk` (de)serialisation we need to make the `GpuConvDescriptor` serializable, and for that we need `CudnnConvKind` too.
A couple additional changes:
* Renamed the existing c++ enum to proto enum to a (hopefully) more readable name.
* Enforce that all c++ enums can be mapped to the proto version at compile time. I can't think of a case where we wouldn't want this, and with this change we can get rid of some non-ok Status invariants.
PiperOrigin-RevId: 817676211
Imported from GitHub PR https://github.com/openxla/xla/pull/28740
This is PR tries to lower DynamicSliceThunk into command buffer, even if it depends on the loop iteration.
The command buffer implementation will also use the same approach (HloEvaluator to get new allocation during runtime) as DynamicSLiceThunk to get the sliced allocations, and for each iteration, CommandBuffer will use HloEvalutor to get the new addresses, and doing graph update with the new address.
The major changes to custom.cc file is to resolve the issue that when a module has been parsed by command buffer scheduler, it rewrites the module into nested calls, which breaks the while loop analysis pattern, and module extraction pattern, so the fix is trying to introduced a cloned inline module, and perform the loop analysis and module extraction from the inlined module.
Copybara import of the project:
--
2fe7c75a9fcbc9ade65f5a275aba3a2bc996ba07 by Shawn Wang <shawnw@nvidia.com>:
add debug information for command_buffer_conversion_pass
--
88183dd7dc53c2bdc80f3a664a99b50e275311b2 by Shawn Wang <shawnw@nvidia.com>:
Lower dynamic update slice thunk into command buffer when its offset
value depends on loop iteraiton.
--
3cf46be90b3be2185f0b5106ea9eeaa45b088601 by Shawn Wang <shawnw@nvidia.com>:
fix
--
45b31f69f9299a13bac24a966625190c9e90c91e by Shawn Wang <shawnw@nvidia.com>:
fix
--
ce3af2b9b131c9902b45d6d9934424d861656d32 by Shawn Wang <shawnw@nvidia.com>:
fix
--
a7fc4ab02b5d7dec6d337fcc57bbfd38a3b205ed by Shawn Wang <shawnw@nvidia.com>:
fix
--
73784aa6530244559c1530b2f922cf81c6d43822 by Shawn Wang <shawnw@nvidia.com>:
change to gemm command for test
--
64b1cf454fc360bcc3255f29bd27c01799537e07 by Shawn Wang <shawnw@nvidia.com>:
fix
--
0a3d7a1b6c142a3c9aa2b299d902520ed7f91515 by Shawn Wang <shawnw@nvidia.com>:
clang format
--
3105ce82fa3751d73d41b0564402e108328ea147 by Shawn Wang <shawnw@nvidia.com>:
fix
--
85ce21672052c4bbfd50db54248dbe1ae2494230 by Shawn Wang <shawnw@nvidia.com>:
fix
Merging this change closes#28740
PiperOrigin-RevId: 817644265
Experiments show that this performs still better than not unrolling.
Also rename the method MayPreventVectorization as the naming is misleading. The
other logic makes sure that we can at least vectorize the stores, so this
function should check whether there is an expected performance drop due to
unrolling, not whether we may be able to vectorize loads.
PiperOrigin-RevId: 817621544
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.
This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.
Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.
PiperOrigin-RevId: 817600994
Imported from GitHub PR https://github.com/openxla/xla/pull/32229
RTX PRO 6000 has CC 12.0.
Spark has CC 12.1.
Removed the IsAtLeastBlackwellPro method because there is no guarantee that future data center GPUs will have CC higher than 12.0.
Also skipped the latency estimator test on Edge GPUs because it uses the collective performance model and crashes here:
784702574e/xla/service/gpu/model/gpu_collective_performance_model.cc (L239)
Copybara import of the project:
--
ca47c656de78f8c5385dcf77b7454d7adc774203 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:
Some Spark fixes. Rename kBlackwellPro to kBlackwell12, as the sm_12x compute capabilities also include Spark.
Fix the latency estimator test and the gemm fusion autotuner test for Spark.
Removed the IsAtLeastBlackwellPro method because there is no guarantee that future
data center GPUs will have CC higher than 12.0.
Merging this change closes#32229
PiperOrigin-RevId: 817600860
Imported from GitHub PR https://github.com/openxla/xla/pull/32525📝 Summary of Changes
Pass cuDNN version to the `BlockScalingRewriter` pass, and make lowering decisions based on that.
🎯 Justification
The global scaling factor doesn't work before cuDNN v9.13 (the graph is compiled, but the scaling factor is not applied).
Use the slower lowering (apply global scaling factor outside the fusion) in this case.
🚀 Kind of Contribution
🐛 Bug Fix
Copybara import of the project:
--
a47ef5175d076270e371c9e5cf355fc1ad96efc8 by Sergey Kozub <skozub@nvidia.com>:
[XLA:GPU] Fix block scaled dot global scaling for older cuDNN versions
Merging this change closes#32525
PiperOrigin-RevId: 817592016
A helper that does `SdcLog::ReadFromDevice` and returns the result as
`SdcLogProto`. The proto will be dumped to log directory for debugging.
PiperOrigin-RevId: 817587228
The order of files returned by `GetMatchingPaths` is not guaranteed, so sorting ensures deterministic test behavior and should get rid of the flakyness of the test.
Also a tiny assertion cleanup for better error messages.
PiperOrigin-RevId: 817564449
To (de)serialize this thunk, we'll be using the `GpuConvDescriptor` instead of the `GpuConvConfig`, since its easier to serialize (most of the config fields actually get populated during execution).
So we move the creation to the Thunk, so that in the next CL we can also store the descriptor to use for (de)serialisation. I didn't add the `GpuConvDescriptor descriptor_` field in this CL, since its technically not needed yet.
PiperOrigin-RevId: 817523742
`Array::pjrt_layout()` will be changed to return `nullptr` to indicate a default layout, where the callers can obtain the corresponding concrete default layout by using `Client::GetDefaultPjRtLayout()`.
This change adds `nullptr` handling preemptively before the new `Array::pjrt_layout()` semantics becomes effective so that the existing code works as before.
Tests using `Array::pjrt_layout()` method calls are minimally updated to add a non-nullness check. They will be updated as `Array::pjrt_layout()` actually returns `nullptr`.
PiperOrigin-RevId: 817516042