tensorflow

mirror of https://github.com/zebrajr/tensorflow.git synced 2025-12-06 00:19:58 +01:00

Author	SHA1	Message	Date
Ilya Tikhonovskiy	13376b4b8a	[XLA:GPU] change 'checksum' field name to 'value' We use this field for two different buffer debug kernels that have different semantic. Technically we could have two different structures but it does not makes much sense at the moment. Let's use the one that we already have with the generic name. PiperOrigin-RevId: 824532743	2025-10-27 08:51:52 -07:00
Eusebio Durán Montaña	5e5976e01f	Clean up includes and dependencies in `../gpu/runtime` directory. Had to manually add a `IWYU pragma: keep` in select_k_exec_stub.cc, otherwise the `::xla::bfloat16` type isn't found. PiperOrigin-RevId: 824529669	2025-10-27 08:39:25 -07:00
A. Unique TensorFlower	fd2941bc67	Update calls to `HloModule::CreateFromProto` in hlo_module_util to remap instruction ids by default. This should speed up compilation. PiperOrigin-RevId: 824521542	2025-10-27 08:16:29 -07:00
dependabot[bot]	60ac8fa628	PR #32968 : Bump keras from 3.9.0 to 3.11.3 in /xla/backends/cpu/benchmarks/e2e/gemma2/keras Imported from GitHub PR https://github.com/openxla/xla/pull/32968 Bumps [keras](https://github.com/keras-team/keras) from 3.9.0 to 3.11.3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/keras-team/keras/releases">keras's releases</a>.</em></p> <blockquote> <h2>Keras 3.11.3</h2> <h2>What's Changed</h2> <ul> <li>Version bump to 3.11.3 by <a href="https://github.com/rtg0795"><code>@rtg0795</code></a> in <a href="https://redirect.github.com/keras-team/keras/pull/21607">keras-team/keras#21607</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/keras-team/keras/compare/v3.11.2...v3.11.3">https://github.com/keras-team/keras/compare/v3.11.2...v3.11.3</a></p> <h2>Keras 3.11.2</h2> <h2>What's Changed</h2> <ul> <li>Version bump 3.11.2 and nnx fix <a href="https://redirect.github.com/keras-team/keras/issues/21565">#21565</a> by <a href="https://github.com/laxmareddyp"><code>@laxmareddyp</code></a> in <a href="https://redirect.github.com/keras-team/keras/pull/21570">keras-team/keras#21570</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/laxmareddyp"><code>@laxmareddyp</code></a> made their first contribution in <a href="https://redirect.github.com/keras-team/keras/pull/21570">keras-team/keras#21570</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/keras-team/keras/compare/v3.11.1...v3.11.2">https://github.com/keras-team/keras/compare/v3.11.1...v3.11.2</a></p> <h2>Keras 3.11.1</h2> <h2>What's Changed</h2> <ul> <li>Version bump 3.11.1 by <a href="https://github.com/rtg0795"><code>@rtg0795</code></a> in <a href="https://redirect.github.com/keras-team/keras/pull/21535">keras-team/keras#21535</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/keras-team/keras/compare/v3.11.0...v3.11.1">https://github.com/keras-team/keras/compare/v3.11.0...v3.11.1</a></p> <h2>Keras 3.11.0</h2> <h2>What's Changed</h2> <ul> <li>Add int4 quantization support.</li> <li>Support <a href="https://github.com/google/grain">Grain</a> data loaders in <code>fit()</code>/<code>evaluate()</code>/<code>predict()</code>.</li> <li>Add <code>keras.ops.kaiser</code> function.</li> <li>Add <code>keras.ops.hanning</code> function.</li> <li>Add <code>keras.ops.cbrt</code> function.</li> <li>Add <code>keras.ops.deg2rad</code> function.</li> <li>Add <code>keras.ops.layer_normalization</code> function to leverage backend-specific performance optimizations.</li> <li>Various bug fixes and performance optimizations.</li> </ul> <h2>Backend-specific changes</h2> <h3>JAX backend</h3> <ul> <li>Support NNX library. It is now possible to use Keras layers and models as NNX modules.</li> <li>Support shape -1 for slice op.</li> </ul> <h3>TensorFlow backend</h3> <ul> <li>Add support for multiple dynamic dimensions in <code>Flatten</code> layer.</li> </ul> <h3>OpenVINO backend</h3> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b491c860fc`"><code>b491c86</code></a> Version bump to 3.11.3 (<a href="https://redirect.github.com/keras-team/keras/issues/21607">#21607</a>)</li> <li><a href="`251ac3422f`"><code>251ac34</code></a> Version bump 3.11.2 and nnx fix <a href="https://redirect.github.com/keras-team/keras/issues/21565">#21565</a> (<a href="https://redirect.github.com/keras-team/keras/issues/21570">#21570</a>)</li> <li><a href="`0e11071e8e`"><code>0e11071</code></a> Version bump 3.11.1 (<a href="https://redirect.github.com/keras-team/keras/issues/21535">#21535</a>)</li> <li><a href="`7bf852c211`"><code>7bf852c</code></a> Update flax (<a href="https://redirect.github.com/keras-team/keras/issues/21527">#21527</a>)</li> <li><a href="`4085046b13`"><code>4085046</code></a> [OpenVINO backend] fix openvino model exported names to match keras names (<a href="https://redirect.github.com/keras-team/keras/issues/2">#2</a>...</li> <li><a href="`6bc62031ad`"><code>6bc6203</code></a> Fix a few typos in comments (<a href="https://redirect.github.com/keras-team/keras/issues/21525">#21525</a>)</li> <li><a href="`8bf6a58276`"><code>8bf6a58</code></a> Add <code>VectorizedMap</code> op class. (<a href="https://redirect.github.com/keras-team/keras/issues/21516">#21516</a>)</li> <li><a href="`7cb0e48957`"><code>7cb0e48</code></a> update python version (<a href="https://redirect.github.com/keras-team/keras/issues/21517">#21517</a>)</li> <li><a href="`7b9ab6a537`"><code>7b9ab6a</code></a> Fix: UpSampling2D bilinear set_image_data_format(channels_first) bug (<a href="https://redirect.github.com/keras-team/keras/issues/21456">#21456</a>)</li> <li><a href="`90c8da6809`"><code>90c8da6</code></a> Fix <code>_can_use_flash_attention</code>. (<a href="https://redirect.github.com/keras-team/keras/issues/21512">#21512</a>)</li> <li>Additional commits viewable in <a href="https://github.com/keras-team/keras/compare/v3.9.0...v3.11.3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=keras&package-manager=pip&previous-version=3.9.0&new-version=3.11.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/openxla/xla/network/alerts). </details> Copybara import of the project: -- 103d4253e3cb9ef8885a36014359c4a437c465a6 by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>: Bump keras in /xla/backends/cpu/benchmarks/e2e/gemma2/keras Bumps [keras](https://github.com/keras-team/keras) from 3.9.0 to 3.11.3. - [Release notes](https://github.com/keras-team/keras/releases) - [Commits](https://github.com/keras-team/keras/compare/v3.9.0...v3.11.3) --- updated-dependencies: - dependency-name: keras dependency-version: 3.11.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Merging this change closes #32968 PiperOrigin-RevId: 824516553	2025-10-27 07:57:21 -07:00
Eugene Zhulenev	5dfa57fd92	[xla:ffi] Use same id sequence for internal and external types Add an API to lookup type id and info by type name. We can't rely on type ids for serialization, as they are not stable and assigned at run time depending on the type registration order. Type names on the other hand must be stable. PiperOrigin-RevId: 824512487	2025-10-27 07:47:23 -07:00
A. Unique TensorFlower	0e809d4bc8	[XLA:GPU] Add simple multimem one-shot example. PiperOrigin-RevId: 824508652	2025-10-27 07:34:24 -07:00
A. Unique TensorFlower	51d2e6931b	Update pip dependency reference from @pypi_XXX//:pkg to @pypi//XXX. PiperOrigin-RevId: 824505081	2025-10-27 07:21:22 -07:00
A. Unique TensorFlower	b15498a538	Add Symbolic/Affine convertor methods for IndexingMap - Renamed and make public SymbolicToAffine to SymbolicExprToAffineExpr (needed for IndexingMap::GetConstraints) - Renamed AffineToSymbolicExpr to AffineExprToSymbolicExpr - Added AffineExprsToSymbolicExprs to convert a list of mlir::AffineExpr to a vector of xla::gpu::SymbolicExpr (needed for IndexingMap::ConstraintsSatisfied) PiperOrigin-RevId: 824492246	2025-10-27 06:51:35 -07:00
Ilya Tikhonovskiy	aded8e05e0	[XLA:GPU] add buffer_nan_count_thunk for the buffer_nan_count_kernel In the follow up cl we will need to add this thunk to the buffer debug pass. Also there we will need to infer the buffer element type. Another refactoring would be to change the name of the payload which is the checksum at the moment to something more generic like 'value' or 'result'. One more thing we could do is to reduce the code duplication by merging together both thunks, the checksum one and nan counter one. PiperOrigin-RevId: 824491914	2025-10-27 06:41:06 -07:00
Karlo Basioli	3f5b49f242	[XLA:CPU] Add target machine options to compilation result proto and check compilation arch when loading aot result. Used to check if the runtime and compilation env are compatible. PiperOrigin-RevId: 824481786	2025-10-27 06:05:08 -07:00
Thomas Joerg	e34b86def5	[XLA:GPU] Do not create `transpose` ops with non-default layout in `DotDecomposer`. The `DotDecomposer` pass runs ahead of layout assignment. Introducing non-default layouts at this stage causes complications for subsequent passes, in particular the `DotMerger` pass. PiperOrigin-RevId: 824476578	2025-10-27 05:54:10 -07:00
Dragan Mladjenovic	77bed2c6ef	PR #32439 : [ROCm] Fix and enable xla_gpu_use_embeded_device_lib and xla_gpu_use_… Imported from GitHub PR https://github.com/openxla/xla/pull/32439 …inprocess_lld 📝 Summary of Changes Enable embedded device libs and in-process lld by default. 🎯 Justification Moves amdgpu backend to be more filesystem layout independent. 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N\A 🧪 Unit Tests: None 🧪 Execution Tests: None Copybara import of the project: -- 46a100377d00d30dbc79e34c977b9219c54bda4b by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: [ROCm] Fix and enable xla_gpu_use_embeded_device_lib and xla_gpu_use_inprocess_lld Merging this change closes #32439 PiperOrigin-RevId: 824476138	2025-10-27 05:44:57 -07:00
Adrian Kuegel	4f15c0c9d3	[XLA:GPU] Choose a deterministic function name for nested computations absl::Hash is not deterministic over different runs of the same program. Use Fingerprint128 instead, and don't include the address of the computation. PiperOrigin-RevId: 824460524	2025-10-27 04:53:10 -07:00
Alexander Grund	9ec8d8ece3	PR #31886 : Fix libdevice search Imported from GitHub PR https://github.com/openxla/xla/pull/31886 📝 Summary of Changes This enhances the search for the CUDA libdevice path: - Fix an invalid empty path added when `TF_CUDA_TOOLKIT_PATH` which may be empty - Fix invalid paths based on runtime folders: `runfiles_dir.substr(0, runfiles_ind + runfiles_suffix.length())` is not meaningful when `runfiles_ind` isn't valid, i.e. `std::string::npos` - Add `$CUDA_HOME` to the search paths. This is also used in TensorFlow already 🎯 Justification Without this the libdevice file won't be found if CUDA isn't installed in a standard location or e.g. an updated version is available in a different location. This is the case for e.g. HPC systems where multiple CUDA versions are available side-by-side. 🚀 Kind of Contribution 🐛 Bug Fix, ♻️ Cleanup Fixes #28590 🧪 Unit Tests: Simple test that when `CandidateCudaRoots` returns anything it contains `$CUDA_HOME` Copybara import of the project: -- 01788b896900717ee916377a71d5c14963e0176d by Alexander Grund <alexander.grund@tu-dresden.de>: Fix libdevice search when outside test environment When there is no `runfiles_suffix` the `rfind` returns `std::string::npos` which should be handled to not add meaningless paths. -- 900715a846102bacdfc7688f14713cbe6101506d by Alexander Grund <alexander.grund@tu-dresden.de>: Use `$CUDA_HOME` when searching for libdevice. With a CUDA installed to a non-default location XLA/TF fails with: > gpu_backend_lib.cc:579] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice. > Searched for CUDA in the following directories: > ./cuda_sdk_lib > /builddir/TensorFlow/TensorFlow-2.x_mnist-test.py.runfiles/cuda_nvcc > /buildi/cuda_nvcc > > /usr/local/cuda > /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc > /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc > /software/TensorFlow/lib/python3.12/site-packages/tensorflow/python/platform/../../cuda Consider $CUDA_HOME as an additional location after the runfiles dirs (used for tests) -- 905d0596d199598036032f0f84b4487e9afd2bef by Alexander Grund <alexander.grund@tu-dresden.de>: Don't add empty TF_CUDA_TOOLKIT_PATH to libdevice search At least in some environments that define is the empty string which doesn't make sense to add to the search paths. Add a check for that. -- 23eb59bfabd570caabf0b9ec3515233f46a4fae7 by Alexander Grund <alexander.grund@tu-dresden.de>: Add test for $CUDA_HOME in CandidateCudaRoots -- a8c215bc222b4ba8581f2f44549613ebd59b9cbb by Alexander Grund <alexander.grund@tu-dresden.de>: Add braces to loops/conditions -- 39efc67f8b1d44e131f993c8040b7eb69ff52f0c by Alexander Grund <alexander.grund@tu-dresden.de>: Use kIsOpenSource in skip condition Merging this change closes #31886 PiperOrigin-RevId: 824450284	2025-10-27 04:10:14 -07:00
Will Froom	4d623afca2	[XLA][XTile] Make transpose folder work with xtile extract. PiperOrigin-RevId: 824439434	2025-10-27 03:39:25 -07:00
Henning Becker	76a084f181	Move Attribute types from call_frame.cc into attribute_map.cc This is moving `Scalar`, `Array`, `Dictionary`, `FlatAttribute`, `FlatAttributeMap`, and `AttributeMap` from `CallFrameBuilder` into the `xla::ffi` namespace. It also moves the code into `attribute_map.{cc\|h}`. All these types are basically aliases from some kind of `std::variant` type. This change is a preparation for making them proper types and add `ToProto` and `FromProto` methods. PiperOrigin-RevId: 824435281	2025-10-27 03:22:14 -07:00
Ilya Tikhonovskiy	78a0ca0b60	[XLA:GPU] add nan count cuda kernel The kernel is similar to the one for the checksum calculation PiperOrigin-RevId: 824428856	2025-10-27 03:08:29 -07:00
Will Froom	dfeccf211b	[XLA:CPU][XTile] Implement pass to rewrite dynamic vector extracts to static. PiperOrigin-RevId: 824427163	2025-10-27 02:56:12 -07:00
Junwhan Ahn	9add8b7e61	Fix a bug where `S2`/`U2` dtypes were missing proto conversion Also fixed the round trip test to not ignore `kInvalid` returned from proto conversion, which is why we didn't catch this bug. PiperOrigin-RevId: 824419619	2025-10-27 02:41:11 -07:00
Adrian Kuegel	5d42d91467	Make sure to produce a deterministic Memory usage report. PiperOrigin-RevId: 824403476	2025-10-27 01:41:57 -07:00
Eugene Zhulenev	3630944d0f	[xla:ffi] Add support for binding Context object to the handler PiperOrigin-RevId: 824278531	2025-10-26 16:48:26 -07:00
Michael Kuperstein	2de4be94aa	[XLA] Remove HLO unstacker. The pass is not used. PiperOrigin-RevId: 824274493	2025-10-26 16:31:48 -07:00
Eugene Zhulenev	cc5ee2577c	[xla:ffi] Add support for std::variant<> attributes decoding PiperOrigin-RevId: 824272994	2025-10-26 16:24:59 -07:00
A. Unique TensorFlower	32c1551f24	Add support for int8 dots, and allow bf16 to be used on any CPU. PiperOrigin-RevId: 824272399	2025-10-26 16:13:53 -07:00
Eugene Zhulenev	5edcd28152	[xla:cpu:ynn] Do not track work stealing workers ``` name cpu/op cpu/op vs base BM_ParallelFor/8/1/process_time [#threads=8, #threadpools=1 ] 5.470m ± 5% 5.095m ± 3% -6.87% (p=0.000 n=80) BM_ParallelFor/8/2/process_time [#threads=8, #threadpools=2 ] 2.857m ± 1% 2.595m ± 2% -9.15% (n=80) BM_ParallelFor/8/4/process_time [#threads=8, #threadpools=4 ] 1.447m ± 10% 1.328m ± 1% -8.23% (p=0.000 n=80) BM_ParallelFor/8/8/process_time [#threads=8, #threadpools=8 ] 1058.1µ ± 20% 974.5µ ± 1% -7.90% (p=0.000 n=80) BM_ParallelFor/8/16/process_time [#threads=8, #threadpools=16 ] 741.5µ ± 26% 705.8µ ± 1% -4.81% (p=0.000 n=80) BM_ParallelFor/16/1/process_time [#threads=16, #threadpools=1 ] 9.796m ± 29% 9.972m ± 2% ~ (p=0.312 n=80) BM_ParallelFor/16/2/process_time [#threads=16, #threadpools=2 ] 7.871m ± 28% 7.706m ± 1% -2.10% (p=0.030 n=80) BM_ParallelFor/16/4/process_time [#threads=16, #threadpools=4 ] 4.330m ± 2% 4.157m ± 1% -3.99% (p=0.000 n=80) BM_ParallelFor/16/8/process_time [#threads=16, #threadpools=8 ] 2.678m ± 2% 2.638m ± 1% -1.49% (p=0.014 n=80) BM_ParallelFor/16/16/process_time [#threads=16, #threadpools=16] 1.791m ± 1% 1.807m ± 1% ~ (p=0.325 n=80) BM_ParallelFor/32/1/process_time [#threads=32, #threadpools=1 ] 15.33m ± 1% 15.41m ± 1% ~ (p=0.215 n=80) BM_ParallelFor/32/2/process_time [#threads=32, #threadpools=2 ] 13.99m ± 1% 13.80m ± 2% ~ (p=0.400 n=80) BM_ParallelFor/32/4/process_time [#threads=32, #threadpools=4 ] 9.415m ± 1% 9.172m ± 1% -2.58% (p=0.000 n=80) BM_ParallelFor/32/8/process_time [#threads=32, #threadpools=8 ] 5.759m ± 1% 5.647m ± 1% -1.95% (p=0.004 n=80) BM_ParallelFor/32/16/process_time [#threads=32, #threadpools=16] 3.932m ± 1% 3.864m ± 1% -1.72% (p=0.006 n=80) geomean 4.051m 3.916m -3.32% name time/op time/op vs base BM_ParallelFor/8/1/process_time [#threads=8, #threadpools=1 ] 651.2µ ± 3% 600.3µ ± 4% -7.80% (p=0.000 n=80) BM_ParallelFor/8/2/process_time [#threads=8, #threadpools=2 ] 329.4µ ± 0% 298.6µ ± 2% -9.35% (n=80) BM_ParallelFor/8/4/process_time [#threads=8, #threadpools=4 ] 169.3µ ± 12% 155.7µ ± 1% -8.05% (p=0.000 n=80) BM_ParallelFor/8/8/process_time [#threads=8, #threadpools=8 ] 125.8µ ± 21% 115.7µ ± 1% -8.08% (p=0.000 n=80) BM_ParallelFor/8/16/process_time [#threads=8, #threadpools=16 ] 95.41µ ± 24% 89.56µ ± 1% -6.13% (p=0.000 n=80) BM_ParallelFor/16/1/process_time [#threads=16, #threadpools=1 ] 1015.8µ ± 1% 952.0µ ± 1% -6.29% (n=80) BM_ParallelFor/16/2/process_time [#threads=16, #threadpools=2 ] 556.5µ ± 1% 522.6µ ± 1% -6.09% (n=80) BM_ParallelFor/16/4/process_time [#threads=16, #threadpools=4 ] 289.7µ ± 2% 274.4µ ± 1% -5.30% (p=0.000 n=80) BM_ParallelFor/16/8/process_time [#threads=16, #threadpools=8 ] 178.8µ ± 2% 174.1µ ± 1% -2.59% (p=0.000 n=80) BM_ParallelFor/16/16/process_time [#threads=16, #threadpools=16] 123.9µ ± 2% 123.0µ ± 1% ~ (p=0.098 n=80) BM_ParallelFor/32/1/process_time [#threads=32, #threadpools=1 ] 1.526m ± 3% 1.433m ± 3% -6.07% (p=0.000 n=80) BM_ParallelFor/32/2/process_time [#threads=32, #threadpools=2 ] 835.2µ ± 2% 783.5µ ± 2% -6.19% (p=0.000 n=80) BM_ParallelFor/32/4/process_time [#threads=32, #threadpools=4 ] 471.6µ ± 2% 455.1µ ± 1% -3.52% (p=0.000 n=80) BM_ParallelFor/32/8/process_time [#threads=32, #threadpools=8 ] 296.1µ ± 2% 287.0µ ± 2% -3.08% (p=0.000 n=80) BM_ParallelFor/32/16/process_time [#threads=32, #threadpools=16] 215.0µ ± 2% 211.6µ ± 1% -1.59% (p=0.018 n=80) geomean 330.2µ 312.3µ -5.42% ``` PiperOrigin-RevId: 824259124	2025-10-26 15:16:20 -07:00
Eugene Zhulenev	e65144c31f	[xla:ffi] Check that Type used as a state is registered before the handler PiperOrigin-RevId: 824258481	2025-10-26 15:06:40 -07:00
Eugene Zhulenev	87e3b84514	[tsl:concurrency] In Promise replace IsUnique() with NumRef() == 1 The meaning of AsyncValue::IsUnique() is fuzzy for the chain of indirect async values. Prefer simpler check for uniqueness in Future/Promise library. Also update AsyncValue::IsUnique() documentation. PiperOrigin-RevId: 824256830	2025-10-26 14:48:28 -07:00
Eugene Zhulenev	72d04ced58	[xla:cpu] Correctly measure CPU time in slinky thread pool benchmark PiperOrigin-RevId: 824253351	2025-10-26 14:23:48 -07:00
Ivo Ristovski List	f3689e1314	Automated Code Change PiperOrigin-RevId: 824094790	2025-10-26 00:36:30 -07:00
A. Unique TensorFlower	0338b08bee	Add mechanism to prioritize ForceDelay custom calls PiperOrigin-RevId: 823973702	2025-10-25 14:20:20 -07:00
A. Unique TensorFlower	6bede44c1a	Integrate LLVM at llvm/llvm-project@621ed04e28 Updates LLVM usage to match [621ed04e2878](https://github.com/llvm/llvm-project/commit/621ed04e2878) PiperOrigin-RevId: 823941203	2025-10-25 11:22:30 -07:00
Christian Sigg	c50123703d	Increment XLA GPU autotune cache version to 15. This change invalidates the autotune cache, which is necessary because enabling the generic emitter (cl/823475406) affected autotuning results. PiperOrigin-RevId: 823818338	2025-10-25 00:26:42 -07:00
Alexander Shaposhnikov	171247d500	Temporarily bring back the old logic for capturing RHS. PiperOrigin-RevId: 823712382	2025-10-24 17:00:45 -07:00
Abhinav Gunjal	c7b4a8e3a5	Automated Code Change Reverts `1b838a947b` PiperOrigin-RevId: 823696235	2025-10-24 16:11:34 -07:00
A. Unique TensorFlower	05c94a96e4	Integrate LLVM at llvm/llvm-project@704240125d Updates LLVM usage to match [704240125ddf](https://github.com/llvm/llvm-project/commit/704240125ddf) PiperOrigin-RevId: 823662883	2025-10-24 14:26:34 -07:00
A. Unique TensorFlower	42d764666d	Integrate LLVM at llvm/llvm-project@917d1f20ae Updates LLVM usage to match [917d1f20aecf](https://github.com/llvm/llvm-project/commit/917d1f20aecf) PiperOrigin-RevId: 823542980	2025-10-24 08:50:00 -07:00
A. Unique TensorFlower	69c93c6f6a	Reshard on call output if sharding mismatches with the func result. It is no-op behaviorally for shardy. Because the call output and func result may mismatch only if dedup-functions-fully options is true, and this option is false by default. Shardy will add explicit reshards (during shardy partitioner) on those operations that use the output of named computation and it will do so assuming the sharding of the named computation is sharded as specified in the out shardings of the named computation. When dedup-functions-fully option is true, however, the function that is actually called may end up having a different output sharding than the corresponding named computation. So, the users of the output shardings should still use sharding as in the output shardings the named computation. Hence, if there is a mismatch between the output sharding of the named computation and the result sharding of the function, we add a reshard on the output of the call. PiperOrigin-RevId: 823494391	2025-10-24 05:59:15 -07:00
Ilya Tikhonovskiy	0c0947cea6	[XLA:GPU] Initialize PrecisionConfig for ScaledDot in composite rewriter. Explicitly set the operand precisions to `PrecisionConfig::DEFAULT` when creating a `ScaledDot` instruction from a composite call. PiperOrigin-RevId: 823488638	2025-10-24 05:38:20 -07:00
Eugene Zhulenev	a5fca6a9b5	[tsl:concurrency] Do not use executor is detached future is unused + use `ptr` when using `AsPtr()` for consistency + rename `Wrap` to `AndThen` as it's more meaningful and makes profiles readable PiperOrigin-RevId: 823476695	2025-10-24 04:55:18 -07:00
Christian Sigg	c8cc7f2fbb	[XLA:GPU] Enable generic triton emitter for all gemms, second attempt. According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default. Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future. Reverts `85c99b1ecb` PiperOrigin-RevId: 823475406	2025-10-24 04:46:17 -07:00
Benjamin Chetioui	acf7f31c31	[XLA:GPU] Fix index of operand in call to `GetNonContractingDims`. PiperOrigin-RevId: 823358506	2025-10-23 22:57:57 -07:00
A. Unique TensorFlower	cbbed7a2fd	Automated Code Change PiperOrigin-RevId: 823350718	2025-10-23 22:40:22 -07:00
Eugene Zhulenev	ef326c74ef	[xla:cpu] Add benchmarks for SlinkyThreadPool ``` BM_ParallelFor/8/1 364687 ns 228963 ns 2974 items_per_second=43.6752M/s #threads=8, #threadpools=1 BM_ParallelFor/8/2 226687 ns 176171 ns 2877 items_per_second=56.763M/s #threads=8, #threadpools=2 BM_ParallelFor/8/4 211589 ns 184345 ns 5816 items_per_second=54.2462M/s #threads=8, #threadpools=4 BM_ParallelFor/8/8 177793 ns 162265 ns 3788 items_per_second=61.6275M/s #threads=8, #threadpools=8 BM_ParallelFor/8/16 206898 ns 192792 ns 3339 items_per_second=51.8693M/s #threads=8, #threadpools=16 ``` PiperOrigin-RevId: 823321692	2025-10-23 21:14:28 -07:00
Alexander Shaposhnikov	3be9a21d7e	Add initial support for offloading dots to YNNPACK. PiperOrigin-RevId: 823318539	2025-10-23 21:04:45 -07:00
Zac Mustin	5893a54e81	Add PJRT c sandwich benchmarks to nanort benchmarks. PiperOrigin-RevId: 823259666	2025-10-23 18:15:39 -07:00
Benjamin Chetioui	4ed3ee15e7	[XLA:GPU] Allow simplifying some `dot` point dimensions in `SymbolicTileAnalysis`. Previously, we would never allow simplification when encountering a `dot` instruction. But this constraint was overly conservative; the only dimensions that we shouldn't simplify are those along which we intend to perform non-standard padding to fit to hardware restrictions, i.e. the non-contracting and contracting dimensions. Restricting this pattern further works around a bug whereby expanding a non-standardly padded dimension into a `1` dim can result in propagating a tile with the wrong size. The underlying reason for this is a bug in the `kPreserve` behaviour of `IndexingMap` simplification, which will need to be fixed separately (the new tiling should avoid this issue, since it shouldn't rely on the correctness of `IndexingMap` simplification at this level). PiperOrigin-RevId: 823258725	2025-10-23 18:03:30 -07:00
A. Unique TensorFlower	ce800f5880	Re-enable testing of the thread pool in YnnFusionThunkTest This was disabled when we didn't have the thread pool available, but now we do. PiperOrigin-RevId: 823247913	2025-10-23 17:27:12 -07:00
Matthias Guenther	1b838a947b	Enable Stablehlo -> HLO lowering by default. Note that, in order to maintain parity with MHLO optimizations, this enables the `assume-no-undeclared-side-effects` option. This matches the default behavior for MHLO, but StableHLO is more cautious by default. Empirically, past evidence suggests it's pretty safe given that MHLO has been doing it all this time. Disabling the flag can result in significantly larger HLO after lowering, so we enable it here. PiperOrigin-RevId: 823234079	2025-10-23 16:43:13 -07:00
A. Unique TensorFlower	774bc48035	Update XNNPACK in XLA PiperOrigin-RevId: 823199731	2025-10-23 15:24:59 -07:00
Maxim Ermilov	9ee1d967e1	initialize nvml in CudaPlatform PiperOrigin-RevId: 823193773	2025-10-23 15:03:33 -07:00
Eugene Zhulenev	d55e5c1d9f	[xla:cpu] Use work_item instead of a task in WorkQueue/Worker API To avoid confusion because of different kinds of tasks we have in Worker/WorkQueue and a SlinklyThreadPool in XLA use a more generic "work item" name. PiperOrigin-RevId: 823191886	2025-10-23 14:46:12 -07:00
A. Unique TensorFlower	512b85961f	[XLA] Support passing a random engine to `LoadAndRunAndDump` By passing a random engine the user can run HLOs with deterministic random inputs. PiperOrigin-RevId: 823184431	2025-10-23 14:28:48 -07:00
Eugene Zhulenev	7ba3317857	[xla:cpu:ynn] Implement SlinkyThreadPool on top of WorkQueue and Worker APIs Remove `work_queue` and `worker` that were originally forked from `xla::cpu::WorkQueue` and `Worker` PiperOrigin-RevId: 823179793	2025-10-23 14:14:06 -07:00
Matthias Guenther	96c1b6c0a6	Integrate StableHLO at openxla/stablehlo@baaf7475 PiperOrigin-RevId: 823160034	2025-10-23 13:24:44 -07:00
Eugene Zhulenev	eb36f8770a	[xla:ffi] Add type checking isa<T>() APIs to RemainingArgs, RemainingRets and Dictionary PiperOrigin-RevId: 823157056	2025-10-23 13:10:46 -07:00
Niklas Vangerow	809d5c7895	cpu_compiler_internals_test is compiler-specific. Invoke CpuCompiler directly. PiperOrigin-RevId: 823138394	2025-10-23 12:19:12 -07:00
A. Unique TensorFlower	d631de7076	Rename ynn_threadpool_impl.cc -> ynn_threadpool.cc PiperOrigin-RevId: 823106503	2025-10-23 11:05:33 -07:00
Yun Peng	e0592b3e22	Introduce a `REQUIREMENTS` variable for @python_version_repo in Bzlmod This is just a short term solution to allow loading https://github.com/jax-ml/jax/blob/main/build/BUILD.bazel successfully. We'll need to figure out a better solution when working on supporting multiple python versions. PiperOrigin-RevId: 823093519	2025-10-23 10:34:35 -07:00
Parker Schuh	aa21448fea	Remove PjRtStreamExecutorBuffer in favor of CommonPjRtBufferImpl. PiperOrigin-RevId: 823081955	2025-10-23 10:08:29 -07:00
Alex	bdb6ba9aa6	PR #32954 : [ROCm] Introduce pool name for rbe Imported from GitHub PR https://github.com/openxla/xla/pull/32954 📝 Summary of Changes Introduce pool name for rbe builds 🎯 Justification Need separate pool name for gpu tests execution. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 📊 Benchmark (for Performance Improvements) Rbe support for rocm config ci job 🧪 Unit Tests: Not relevant 🧪 Execution Tests: Not relevant Copybara import of the project: -- d675bf9efcc44a8d740c1be7537737af3cd90f0b by Alexandros Theodoridis <alexandros.theodoridis@amd.com>: Introduce pool name for rbe -- d5ee82757aa74785bd2a1c68e3639c49d17ba740 by Alexandros Theodoridis <atheodor@amd.com>: Introduce rocm rbe pools -- 36bfa7b258cb3e58430087faccccb413f9bf8a7c by Alexandros Theodoridis <atheodor@amd.com>: First check for multigpu tag -- 9efa0a7cdfa76bb0d5102ebbee1f9b6a3dab270c by Alexandros Theodoridis <atheodor@amd.com>: Address review comments -- 5b854a7f5915d0c106fd2ba9bc6ff774a885f907 by Alexandros Theodoridis <atheodor@amd.com>: Fix buildifier issue Merging this change closes #32954 PiperOrigin-RevId: 823077515	2025-10-23 09:52:20 -07:00
A. Unique TensorFlower	d98db379d1	Move "cell_reader_test.cc" to xla/tsl/lib/monitoring. PiperOrigin-RevId: 823067036	2025-10-23 09:24:45 -07:00
A. Unique TensorFlower	c234838b70	Use MLIRContext's affine uniquer for SymbolicExpr. This change modifies `SymbolicExprContext` to use the `mlir::StorageUniquer` provided by `mlir::MLIRContext::getAffineUniquer()` instead of maintaining its own. This produces SymbolicExprContext creation to be very lightweight. PiperOrigin-RevId: 823052287	2025-10-23 08:41:23 -07:00
Dimitar (Mitko) Asenov	e9356bcee5	[XLA:GPU] Fix an edge-case with choosing autotuning configs based on scratch. The old code did not update `min_duration_with_optimzed_scratch_bytes` in case the scratch sizes are equal. This could lead to subtle situation where a kernel with the most optimal time and cache is not picked, if all scratch sizes are the same, but the optimal one in terms of time does not appear at the end. I've updated the associated test to verify this situation. The new test fails before this CL. PiperOrigin-RevId: 823019660	2025-10-23 06:58:09 -07:00
A. Unique TensorFlower	72bc89a5c6	Use `YnnThreadpool` This change moves `YnnThreadpool` to the runtime/ynnpack/ subfolder, and changes the runtime to use our custom YnnThreadpool, instead of using a thread pool created by `ynn_create_threadpool`. PiperOrigin-RevId: 822883993	2025-10-22 23:03:18 -07:00
Parker Schuh	2cb47f1b59	Remove usage of PjRtStreamExecutorBuffer from StreamExecutorGpuClient::MakeCrossHostReceiveBuffers. PiperOrigin-RevId: 822860960	2025-10-22 21:33:06 -07:00
Mehrdad Khani	3f23e5eb0d	[XLA:TPU] We introduce MutateBackendConfig(). This change allows to apply in-place modifications to the underlying backend config proto in a thread-safe manner. PiperOrigin-RevId: 822825180	2025-10-22 19:12:44 -07:00
Parker Schuh	c3e202374c	Implement TrackedDeviceBuffer::GetReadyFuture, TrackedDeviceBuffer::Delete, and TrackedDeviceBuffer::CloneWithControlDependency. PiperOrigin-RevId: 822804278	2025-10-22 17:51:52 -07:00
Sean Talts	03be5156fe	[XLA:CPU] Split IntrinsicFunction into its own lib for later use. PiperOrigin-RevId: 822800836	2025-10-22 17:38:09 -07:00
Seher Ellis	6bd65b3630	[XLA] Refactor HloSchedule::Verify to allow per-computation verification. PiperOrigin-RevId: 822796366	2025-10-22 17:24:59 -07:00
Eugene Zhulenev	ac8332f616	[jax:ffi] Do not pass uninitialized type_id value to type registration API Setting type_id value to 0 is required for XLA to assign unique type id, otherwise type gets assigned a random value that happens to be on the caller stack. PiperOrigin-RevId: 822782898	2025-10-22 16:42:00 -07:00
Parker Schuh	cf3e49ba23	Delete unused PjRtStreamExecutorBuffer::ScopedHold related code. PiperOrigin-RevId: 822779025	2025-10-22 16:24:44 -07:00
Yulia Baturina	bcc803eeb6	Remove usage of mirrored `tar` files from CI because hermetic `xz` tool helps to unpack `tar.xz` faster. PiperOrigin-RevId: 822773874	2025-10-22 16:08:18 -07:00
Parker Schuh	13ea97f3a9	Update PjRtStreamExecutorClient main execute path to use CommonPjRtBuffer::ScopedHold. Crucially this now passes reference_held=true always. This is fine because the only time this was ever passed as false was if this was already on the compute stream and this bool is basically ignored if the stream is the compute stream (see MaybeWaitForEventOnStream). PiperOrigin-RevId: 822758577	2025-10-22 15:35:02 -07:00
A. Unique TensorFlower	880f245b56	Allow TSL CellReader to work with lazy metrics. PiperOrigin-RevId: 822757884	2025-10-22 15:25:07 -07:00
Parker Schuh	420ca15b61	Promote check to connection close. PiperOrigin-RevId: 822746430	2025-10-22 14:52:22 -07:00
Eugene Zhulenev	2cdd8ff5ce	[xla:ffi] Keep FFI handler metadata with handler registration PiperOrigin-RevId: 822741325	2025-10-22 14:34:37 -07:00
Hyeontaek Lim	70111bb38f	Reverts `16064a6c08` PiperOrigin-RevId: 822724128	2025-10-22 14:04:41 -07:00
A. Unique TensorFlower	aeda5dabd4	[XLA] Handle nested while loops in CollectivePipeliner. This CL modifies the collective pipeliner to generate unique body and condition computations for newly generated while loop instructions. PiperOrigin-RevId: 822719229	2025-10-22 13:47:32 -07:00
Maxim Ermilov	7b277367dc	Remove inheritance of GpuComputeCapability from std::variant PiperOrigin-RevId: 822701900	2025-10-22 13:33:16 -07:00
Parker Schuh	a6889b6922	Switch to using CommonAsyncHostToDeviceTransferManager. PiperOrigin-RevId: 822701589	2025-10-22 13:21:45 -07:00
Matthias Guenther	6d1a7019f0	Fix issues in optimization patterns for `broadcast_in_dim` and `pad` ops. - Prioritize replacing `broadcast_in_dim` with `reshape` over merging nested `broadcast_in_dim` ops. The new behavior matches the relevant MHLO optimization behavior, which proved to be preferable. - Fix an issue where `pad` ops that didn't change the dimensions would be removed even if they shifted elements around within the tensor (e.g. padding by -1 on one side and +1 on the opposite side). PiperOrigin-RevId: 822701252	2025-10-22 13:11:10 -07:00
mmakevic-amd	a5524d43e6	PR #33008 : [ROCm] Add CI specific bazelrc file Imported from GitHub PR https://github.com/openxla/xla/pull/33008 📝 Summary of Changes Add CI-specific bazelrc that will import both `rocm.bazelrc` from `/usertools` and `rocm_xla.bazelrc` 🎯 Justification Temporary workaround until split logic in CI (which relies on `/usertools/rocm.bazelrc`) is removed Copybara import of the project: -- bb4cbf0c4fbf2c171110040c5c1470bddced203b by Milica Makevic <Milica.Makevic@amd.com>: Add CI specific bazelrc Merging this change closes #33008 PiperOrigin-RevId: 822700005	2025-10-22 12:50:14 -07:00
Zixuan Jiang	4d53eda2fe	Refactor spmd partitioner. PiperOrigin-RevId: 822689391	2025-10-22 12:23:05 -07:00
Maxim Ermilov	1b08f96abf	Port to new GpuComputeCapability API. Last part PiperOrigin-RevId: 822676102	2025-10-22 11:59:58 -07:00
Oleg Shyshkov	3503a61282	[XLA:GPU] Combine metadata AllToAlls in RaggedAllToAllMultiHostDecomposer. Instead of performing four separate AllToAll operations, the metadata tensors are reshaped, concatenated, and then a single AllToAll is executed. The result is then sliced back into the individual metadata tensors. This reduces latency required to initiate separate collective operations. PiperOrigin-RevId: 822674605	2025-10-22 11:49:53 -07:00
Ken Franko	85c99b1ecb	Reverts `2d4dd83773` PiperOrigin-RevId: 822637158	2025-10-22 10:17:06 -07:00
Eugene Zhulenev	4827802e7c	[xla:pjrt:ffi] Remove unused type id registration API PiperOrigin-RevId: 822630041	2025-10-22 10:01:45 -07:00
Will Froom	3353eeeab7	[XLA:CPU] Only add reassoc flag to reductions with a single floating point op. PiperOrigin-RevId: 822598746	2025-10-22 08:33:14 -07:00
Dimitar (Mitko) Asenov	bbea04967a	Reverts `c28d80ae66` PiperOrigin-RevId: 822586242	2025-10-22 08:02:26 -07:00
Marcin Radomski	94d00be0e6	[XLA:GPU] Fix incorrect namespace in buffer_debug_log.* It was moved to stream_executor/gpu, but code remained in stream_executor::cuda namespace. PiperOrigin-RevId: 822584666	2025-10-22 07:51:36 -07:00
Oleg Shyshkov	53499fe9d0	[XLA:GPU] Move offset correction logic in a helper function. PiperOrigin-RevId: 822572708	2025-10-22 07:29:58 -07:00
Alexander Belyaev	a34be3eb68	[XLA:GPU] Ignore zero-sized constants in layout normalization. PiperOrigin-RevId: 822571991	2025-10-22 07:16:10 -07:00
A. Unique TensorFlower	39506ad1cd	Deduplicate functions on the one with largest number of call sites. Instead of picking arbitrarily. PiperOrigin-RevId: 822566069	2025-10-22 06:55:15 -07:00
Thomas Joerg	83b84b3c46	[XLA:GPU] Add tests for transpose ops inserted by DotDecomposer. Also be more precise about what is considered normal form and what is not. PiperOrigin-RevId: 822554350	2025-10-22 06:18:34 -07:00
Kostiantyn Liepieshov	b5d09010cd	Make adding missing shardings to control flow configurable in StableHLO export. Introduce `addMissingShardingToControlFlow` option in `StablehloExportPipelineOptions` to control whether `ExportStablehloShardingsPass` adds missing shardings to control flow ops. Disable this option in `mlir_to_hlo.cc` when converting MLIR to HLO. PiperOrigin-RevId: 822542288	2025-10-22 05:37:59 -07:00
A. Unique TensorFlower	3cc86433e3	Correctly set dnn_version in device_description when parsing from proto. Removing the setting from the other 2 places as it is no longer necessary. PiperOrigin-RevId: 822533070	2025-10-22 05:02:14 -07:00
Chenhao Jiang	75fa34bbde	PR #32231 : Support forward conv with dilation and add basic heuristic for differ… Imported from GitHub PR https://github.com/openxla/xla/pull/32231 📝 Summary of Changes The changes enable native support for forward convolutions with window dilation in XLA's GPU backend. Previously, all dilated convolutions were treated as non-canonical and required explicit padding materialization. Now, forward convolutions with window dilation (but not base dilation) are preserved and handled natively by cuDNN, avoiding unnecessary padding overhead. 🎯 Justification Performance Problem: JAX shows 15-23x slower performance than PyTorch for dilated convolutions (33.5ms vs 1.4ms at dilation rate 2). This is because XLA materializes dilated convolutions as padded convolutions instead of using cuDNN's native support. Solution: Allow forward convolutions with window dilation to bypass padding materialization and use cuDNN's native dilated convolution kernels directly. 🚀 Kind of Contribution Performance Improvement 📊 Benchmark (for Performance Improvements) dilation 1: prev: 1.08 ms now: 1.07 ms dilation 2: prev: 25.79 ms now: 0.91 ms dilation 1024: prev: 26.24 ms now: 2.34 ms Copybara import of the project: -- b5a38df2ed4715b43fc8ca8d652005a35290d47e by Chenhao Jiang <chenhaoj@nvidia.com>: Support forward conv with dilation and add basic heuristic for differentiating forward/backward Merging this change closes #32231 PiperOrigin-RevId: 822482265	2025-10-22 02:03:50 -07:00
Jian Cai	95d3b6fe36	[XLA][Numerics][HLO Value Tracking] Handle original values in while loop fusible sinking pass This reconstructs the original value for while loops with a rewritten input/output shape during the pass. PiperOrigin-RevId: 822465131	2025-10-22 01:08:37 -07:00
Felix Wang	add51a87c3	[XLA:GPU] Update latency hiding scheduler cost models for B200/H100 FP8 matmul PiperOrigin-RevId: 822446122	2025-10-22 00:01:00 -07:00
A. Unique TensorFlower	ca2365df32	Make ApproxTopK Op don't fail with kMhloFrontendAttributes. PiperOrigin-RevId: 822427505	2025-10-21 22:51:17 -07:00
Parker Schuh	68ad2b30fa	Implement PjRtStreamExecutorRawBuffer::CopyTo in terms of raw buffers. PiperOrigin-RevId: 822345080	2025-10-21 17:58:31 -07:00
Haibo Huang	bdb268c5c5	Add helper functions to check PjRtPlatformId types. PiperOrigin-RevId: 822333726	2025-10-21 17:13:03 -07:00
Eugene Zhulenev	90491b0a55	[xla:pjrt:ffi] Prepare for legacy type registration removal PiperOrigin-RevId: 822309311	2025-10-21 16:13:04 -07:00
Paul Ganssle	512611da80	Internal code migration PiperOrigin-RevId: 822300362	2025-10-21 15:34:56 -07:00
Haibo Huang	b7d9295b52	Replace `ComputationOrigin` with the more general `PjRtDeviceDimensions` PiperOrigin-RevId: 822288293	2025-10-21 15:11:47 -07:00
Olli Lupton	3cdcb03f18	PR #32838 : Fix family-conditional logic Imported from GitHub PR https://github.com/openxla/xla/pull/32838 📝 Summary of Changes The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input. 🎯 Justification Previously the logic would propose an incompatible architecture in this case. 🚀 Kind of Contribution 🐛 Bug Fix 🧪 Unit Tests: Added a new test case showing the previously-failing case (it used to propose `sm_110`) Copybara import of the project: -- f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>: Fix family-conditional logic -- fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>: Accept CR suggestion Merging this change closes #32838 PiperOrigin-RevId: 822284790	2025-10-21 14:59:18 -07:00
Eugene Zhulenev	0fc052399b	[xla:cpu] Fix data race in ThunkExecutor Also add tsl::down_pointer_cast to improve usability. PiperOrigin-RevId: 822257137	2025-10-21 13:46:24 -07:00
Michael Whittaker	5776d2771c	Pipe incarnations to `jax.live_devices`. PiperOrigin-RevId: 822250955	2025-10-21 13:35:27 -07:00
mmakevic-amd	47cd01d4a5	PR #32960 : [ROCm] Refactor testing scripts Imported from GitHub PR https://github.com/openxla/xla/pull/32960 📝 Summary of Changes (Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, `9d358b9b26`, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now. 🎯 Justification These changes are ROCm specific and helps with rocm internal CI validation pipelines. 🚀 Kind of Contribution 🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests 📊 Benchmark (for Performance Improvements) / 🧪 Unit Tests: / 🧪 Execution Tests: / Copybara import of the project: -- 804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>: Add missing cuda-only tag to cuda test -- 44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>: Refactor test scripts -- fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>: Cover more mgpu tests -- 1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>: Switch from rocm_gcc to rocm_ci for sgpu tests -- 00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>: Changing file permissions -- 003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>: Remove unnecessary import -- 214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>: Add license header Merging this change closes #32960 PiperOrigin-RevId: 822245565	2025-10-21 13:25:33 -07:00
Eugene Zhulenev	7a107e3571	[xla:ffi] Rename FFI_TypeID_Register API PiperOrigin-RevId: 822240093	2025-10-21 13:12:40 -07:00
Felix Wang	95f3e6f33c	[XLA:GPU]: Refactor the unit test of matmul_interpolator_test.cc to prepare for adding the mix-precision fp8 unit test. PiperOrigin-RevId: 822239646	2025-10-21 13:02:53 -07:00
Felix Wang	2de2bb8581	Populate the cost for async collective in both async-start and the computation root op. PiperOrigin-RevId: 822223031	2025-10-21 12:22:08 -07:00
Eugene Zhulenev	633c3efcf9	[xla:cpu] Delete unused cpu_function_runtime header PiperOrigin-RevId: 822215543	2025-10-21 12:15:40 -07:00
Eugene Zhulenev	6141496817	[xla:ffi] Document XLA:FFI binary API guarantees and add a supporteded API range check PiperOrigin-RevId: 822214561	2025-10-21 12:02:12 -07:00
Kevin Gleason	fe624fe9ce	[StableHLO->HLO] Only lower MHLO constants in MHLO prepare for export pass. PiperOrigin-RevId: 822198262	2025-10-21 11:34:37 -07:00
A. Unique TensorFlower	7524326efd	Create testing utilities for extension plugin testing PiperOrigin-RevId: 822185331	2025-10-21 11:24:34 -07:00
Maxim Ermilov	f7bc8a8859	Port to new GpuComputeCapability API. Part 2 PiperOrigin-RevId: 822183464	2025-10-21 11:14:17 -07:00
A. Unique TensorFlower	2476ba49e0	Minor code improvement on `tsl::monitoring::testing::CellReader`. Save a `GetLatestValueOrDefault` call when the result is not used. Also save two map lookups. PiperOrigin-RevId: 822182539	2025-10-21 11:02:58 -07:00
Jaroslav Sevcik	4fc74ffdd2	PR #32846 : Allow mixed precision operands for collective permute Imported from GitHub PR https://github.com/openxla/xla/pull/32846 📝 Summary of Changes Allow mixed precision collective-permute in the verifier. 🎯 Justification Partially addresses https://github.com/openxla/xla/issues/32845 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N/A 🧪 Unit Tests: Tests that verifier passes on mixed precision collective-permute. 🧪 Execution Tests: N/A Copybara import of the project: -- 666c38a19005a609d4a7aa8e5e9b9842b1c87175 by Jaroslav Sevcik <jsevcik@nvidia.com>: Allow mixed precision for collective permute Merging this change closes #32846 PiperOrigin-RevId: 822179840	2025-10-21 10:51:30 -07:00
dependabot[bot]	f717c02d1c	PR #32904 : Bump github/codeql-action from 3.30.5 to 4.30.9 Imported from GitHub PR https://github.com/openxla/xla/pull/32904 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p> <blockquote> <h2>v4.30.9</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>4.30.9 - 17 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li> <li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li> </ul> <p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v4.30.8</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>4.30.8 - 10 Oct 2025</h2> <p>No user facing changes.</p> <p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.8/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v4.30.7</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>4.30.7 - 06 Oct 2025</h2> <ul> <li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li> </ul> <p>See the full <a href="https://github.com/github/codeql-action/blob/v4.30.7/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v3.30.9</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>3.30.9 - 17 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li> <li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li> </ul> <p>See the full <a href="https://github.com/github/codeql-action/blob/v3.30.9/CHANGELOG.md">CHANGELOG.md</a> for more information.</p> <h2>v3.30.8</h2> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p> <blockquote> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>[UNRELEASED]</h2> <p>No user facing changes.</p> <h2>4.30.9 - 17 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.3. <a href="https://redirect.github.com/github/codeql-action/pull/3205">#3205</a></li> <li>Experimental: A new <code>setup-codeql</code> action has been added which is similar to <code>init</code>, except it only installs the CodeQL CLI and does not initialize a database. Do not use this in production as it is part of an internal experiment and subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3204">#3204</a></li> </ul> <h2>4.30.8 - 10 Oct 2025</h2> <p>No user facing changes.</p> <h2>4.30.7 - 06 Oct 2025</h2> <ul> <li>[v4+ only] The CodeQL Action now runs on Node.js v24. <a href="https://redirect.github.com/github/codeql-action/pull/3169">#3169</a></li> </ul> <h2>3.30.6 - 02 Oct 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.2. <a href="https://redirect.github.com/github/codeql-action/pull/3168">#3168</a></li> </ul> <h2>3.30.5 - 26 Sep 2025</h2> <ul> <li>We fixed a bug that was introduced in <code>3.30.4</code> with <code>upload-sarif</code> which resulted in files without a <code>.sarif</code> extension not getting uploaded. <a href="https://redirect.github.com/github/codeql-action/pull/3160">#3160</a></li> </ul> <h2>3.30.4 - 25 Sep 2025</h2> <ul> <li>We have improved the CodeQL Action's ability to validate that the workflow it is used in does not use different versions of the CodeQL Action for different workflow steps. Mixing different versions of the CodeQL Action in the same workflow is unsupported and can lead to unpredictable results. A warning will now be emitted from the <code>codeql-action/init</code> step if different versions of the CodeQL Action are detected in the workflow file. Additionally, an error will now be thrown by the other CodeQL Action steps if they load a configuration file that was generated by a different version of the <code>codeql-action/init</code> step. <a href="https://redirect.github.com/github/codeql-action/pull/3099">#3099</a> and <a href="https://redirect.github.com/github/codeql-action/pull/3100">#3100</a></li> <li>We added support for reducing the size of dependency caches for Java analyses, which will reduce cache usage and speed up workflows. This will be enabled automatically at a later time. <a href="https://redirect.github.com/github/codeql-action/pull/3107">#3107</a></li> <li>You can now run the latest CodeQL nightly bundle by passing <code>tools: nightly</code> to the <code>init</code> action. In general, the nightly bundle is unstable and we only recommend running it when directed by GitHub staff. <a href="https://redirect.github.com/github/codeql-action/pull/3130">#3130</a></li> <li>Update default CodeQL bundle version to 2.23.1. <a href="https://redirect.github.com/github/codeql-action/pull/3118">#3118</a></li> </ul> <h2>3.30.3 - 10 Sep 2025</h2> <p>No user facing changes.</p> <h2>3.30.2 - 09 Sep 2025</h2> <ul> <li>Fixed a bug which could cause language autodetection to fail. <a href="https://redirect.github.com/github/codeql-action/pull/3084">#3084</a></li> <li>Experimental: The <code>quality-queries</code> input that was added in <code>3.29.2</code> as part of an internal experiment is now deprecated and will be removed in an upcoming version of the CodeQL Action. It has been superseded by a new <code>analysis-kinds</code> input, which is part of the same internal experiment. Do not use this in production as it is subject to change at any time. <a href="https://redirect.github.com/github/codeql-action/pull/3064">#3064</a></li> </ul> <h2>3.30.1 - 05 Sep 2025</h2> <ul> <li>Update default CodeQL bundle version to 2.23.0. <a href="https://redirect.github.com/github/codeql-action/pull/3077">#3077</a></li> </ul> <h2>3.30.0 - 01 Sep 2025</h2> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`16140ae1a1`"><code>16140ae</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3213">#3213</a> from github/update-v4.30.9-70205d3d1</li> <li><a href="`30db5fee08`"><code>30db5fe</code></a> Update changelog for v4.30.9</li> <li><a href="`70205d3d12`"><code>70205d3</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3211">#3211</a> from github/mbg/init/starting-partial-config</li> <li><a href="`697c209bfc`"><code>697c209</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li> <li><a href="`1bd53ba38c`"><code>1bd53ba</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3205">#3205</a> from github/update-bundle/codeql-bundle-v2.23.3</li> <li><a href="`cac4df0c79`"><code>cac4df0</code></a> Rebuild</li> <li><a href="`77e5c0d0a2`"><code>77e5c0d</code></a> Merge branch 'main' into update-bundle/codeql-bundle-v2.23.3</li> <li><a href="`97a4f751be`"><code>97a4f75</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3204">#3204</a> from github/mbg/setup-codeql</li> <li><a href="`2d5512b361`"><code>2d5512b</code></a> Merge remote-tracking branch 'origin/main' into mbg/init/starting-partial-config</li> <li><a href="`fa7bdf0559`"><code>fa7bdf0</code></a> Call <code>getAnalysisKinds</code> a second time, and ignore exceptions thrown during th...</li> <li>Additional commits viewable in <a href="`3599b3baa1...16140ae1a1`">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github/codeql-action&package-manager=github_actions&previous-version=3.30.5&new-version=4.30.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Copybara import of the project: -- c14a0d2198bee3dcd76ee7fa733da41a6d1fcd6b by dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>: Bump github/codeql-action from 3.30.5 to 4.30.9 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 4.30.9. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](`3599b3baa1...16140ae1a1`) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.30.9 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Merging this change closes #32904 PiperOrigin-RevId: 822178959	2025-10-21 10:40:39 -07:00
Peter Hawkins	16064a6c08	Reverts `67e5eafb24` PiperOrigin-RevId: 822148458	2025-10-21 09:32:19 -07:00
A. Unique TensorFlower	ffc0e052de	Adds an option for Hlo Module's CreateFromProto to not preserve instruction unique ids and reassigned them in a compacted way. Options is turned off for now but should be the new default moving forward. Deprecates RemapInstructionIds. PiperOrigin-RevId: 822146834	2025-10-21 09:22:36 -07:00
Pedro Gonnet	797ffee48d	[xla:cpu:xnn] Re-enable XNNPACK by default. Reverts `b2f2568bcc` PiperOrigin-RevId: 822127662	2025-10-21 08:33:32 -07:00
Pedro Gonnet	3a5338e48f	Update XNNPACK in XLA PiperOrigin-RevId: 822105743	2025-10-21 07:39:28 -07:00
Shaogang Wang	97c777acc4	PR #32688 : [XLA:GPU] Enable command buffer DynamicSliceCopyFusion command unrolling Imported from GitHub PR https://github.com/openxla/xla/pull/32688 📝 Summary of Changes This PR enables command buffer DynamicSliceCopy command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd 🎯 Justification This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 🧪 Unit Tests: xla/backends/gpu/runtime/command_buffer_cmd_test.cc: CommandBufferCmdTest:DynamicSliceCopyFusionCmd Copybara import of the project: -- feb2902fca397360460f6b9788ac0f7482cb547c by Shawn Wang <shawnw@nvidia.com>: Enable command buffer DynamicSliceCopyFusion command unrolling Merging this change closes #32688 PiperOrigin-RevId: 822104580	2025-10-21 07:24:42 -07:00
Johannes Reifferscheid	c28d80ae66	Reverts `8be12edcfe` PiperOrigin-RevId: 822095736	2025-10-21 06:59:25 -07:00
Will Froom	2abafe5c4d	[XLA][XTile] Make squeeze dims work with xtile extract/insert. PiperOrigin-RevId: 822087994	2025-10-21 06:33:13 -07:00
Shaogang Wang	8c169d147d	PR #32719 : 【XLA:GPU] Command buffer DynamicSliceFusionCmd supports cuda graph loop unrolling Imported from GitHub PR https://github.com/openxla/xla/pull/32719 📝 Summary of Changes This PR enables command buffer DynamicSliceFusion command to be recorded into an unrolled cuda-graph, when it is surrounded by WhileCmd 🎯 Justification This feature is required if we want to fully command buffer WhileCmd into an unrolled cuda-graph. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 🧪 Unit Tests: xla/backends/gpu/codegen/dynamic_slice_fusion_test.cc Copybara import of the project: -- daa975804cbffcc3a6bc5b37e3494b51a2dbe2ca by Shawn Wang <shawnw@nvidia.com>: DynamicSliceFsuionCmd supports unrolling Merging this change closes #32719 PiperOrigin-RevId: 822071751	2025-10-21 05:48:07 -07:00
Mikhail Goncharov	2d4dd83773	[XLA:GPU] enable generic triton emitter for all gemms According to benchmarks we have reached the neutrality with the legacy emitter. Switching to the new emitter by default. Legacy emitter will be kept for some time but is considered depricated and should not be used. It will be deleted in the near future. PiperOrigin-RevId: 822067921	2025-10-21 05:34:23 -07:00
Will Froom	bd257617f7	[XLA:GPU][XTile] Move xtile lowering to compilation pipeline. PiperOrigin-RevId: 822066890	2025-10-21 05:23:02 -07:00
A. Unique TensorFlower	ffdd9d06e0	Update autotune results for XLA AOT compile test. The device description in the autotune results now specifies "DNN version: 8.9.4" instead of "DNN version: 0.0.0". PiperOrigin-RevId: 822043677	2025-10-21 04:16:40 -07:00
Dirk Hornung	1aff85868d	[Autotuner] Find dot instructions in nested computations for fusion autotuning. The fission autotuner previously only searched for dot instructions in the entry computation of an HLO module. This caused it to miss dot operations located in nested computations, such as the body of a while loop, preventing the autotuner from applying configurations to them. PiperOrigin-RevId: 822037141	2025-10-21 04:07:19 -07:00
Ilya Tikhonovskiy	0836518bc5	[XLA:GPU] fine tune the xla part of the scaled-dot op implementation in order to support the case when we omit one of the scales and pass the bf16 argument instead. We adjusted the emitter for the case when the scale is missing. Also we relaxed the hlo verifier a bit and tweaked the composite rewriter that should accept the dim indexes passed by jax. PiperOrigin-RevId: 822036474	2025-10-21 03:54:17 -07:00
Will Froom	373abf8de1	[XLA:CPU][XTile] Add support for strided extract/insert tile. PiperOrigin-RevId: 822035319	2025-10-21 03:38:59 -07:00
Adrian Kuegel	e756c21611	[XLA:GPU] Remove a source of non-determinism from DotMerger pass. When removing ops, we need to do that in a deterministic order. The reason is that removing users works by finding the position of the user in the users vector, then swapping with the last element of the vector, then popping the last element of the vector. So if more than one element is removed from a users list, it matters in which order the elements are removed. PiperOrigin-RevId: 822026351	2025-10-21 03:32:31 -07:00
Jaroslav Sevcik	735f4bb631	PR #32905 : Allow mixed precision operands for async collective permute Imported from GitHub PR https://github.com/openxla/xla/pull/32905 📝 Summary of Changes Allow mixed precision asynchronous collective-permute in the verifier. 🎯 Justification Fixes https://github.com/openxla/xla/issues/32845 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N/A 🧪 Unit Tests: Tests that verifier passes on mixed precision collective-permute-start and collective-permute-done. 🧪 Execution Tests: Manually testes the JAX repro from https://github.com/openxla/xla/issues/32845 Copybara import of the project: -- f44faa7ce7ecfbd810983cae170a118bb19a8bb3 by Jaroslav Sevcik <jsevcik@nvidia.com>: Allow mixed precision operands for async collective permute Merging this change closes #32905 PiperOrigin-RevId: 822023349	2025-10-21 03:22:25 -07:00
Aleksa Arsic	b89fdab2a2	PR #32773 : [ROCm] Fix convolution fp16 performance drop on gfx11xx, gfx12xx Imported from GitHub PR https://github.com/openxla/xla/pull/32773 📝 Summary of Changes Remove hardcoded NHWC convolution layout for fp16 precision. 🎯 Justification Performance drops for fp16 precision on gfx11xx and gfx12xx GPUs were observed internally, as well as by the [community](https://github.com/jax-ml/jax/issues/30548). 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark Community member provided the script with whom the [profiling can be done](https://github.com/jax-ml/jax/issues/30548#issue-3270872993). Significant performance improvement for fp16 on gfx12xx: ``` Running on: rocm:0 Testing float32... Avg time: 0.092307 s, Throughput: 1.68 TFLOP/s Testing float16... Avg time: 0.011742 s, Throughput: 13.17 TFLOP/s Testing bfloat16... Avg time: 0.011989 s, Throughput: 12.90 TFLOP/s ``` Results of the profiling before the fix: ``` Running on: rocm:0 Testing float32... Avg time: 0.092312 s, Throughput: 1.67 TFLOP/s Testing float16... Avg time: 0.775142 s, Throughput: 0.20 TFLOP/s Testing bfloat16... Avg time: 0.011990 s, Throughput: 12.90 TFLOP/s ``` @xla-rotation can you please review this PR? Copybara import of the project: -- c9fdba79e32c13d9cbf640e61d941d071fabba9d by Aleksa Arsic <Aleksa.Arsic@amd.com>: Remove hardcoded convolution NCHW layout assignment for fp16 precision. -- 69660d19999a14b24d63b52e6dae310cfbdcbb6b by Aleksa Arsic <Aleksa.Arsic@amd.com>: Add unit tests for ROCm layout assignment. Merging this change closes #32773 PiperOrigin-RevId: 822022522	2025-10-21 03:08:12 -07:00
Will Froom	0f4c3f55b5	[XLA][XTile] Use xtile entry, extract & insert in triton emitter. PiperOrigin-RevId: 822020719	2025-10-21 02:55:47 -07:00
Will Froom	7f64538e67	[XLA:CPU] Make tiled kernel test deterministic. PiperOrigin-RevId: 822018419	2025-10-21 02:42:18 -07:00
Will Froom	900e2d4d5c	[XLA][XTile] Add TiledBuffer interface to insert/extract ops. PiperOrigin-RevId: 822009372	2025-10-21 02:04:04 -07:00
Dimitris Vardoulakis	81f29b3472	PR #32724 : Disable only the test cases that are failing and enable 3 test targets on B200. Imported from GitHub PR https://github.com/openxla/xla/pull/32724 Copybara import of the project: -- c3f4ff8ec6af27d24b61e2aa529585697b8aa77a by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Disable only the test cases that are failing and enable 3 test targets on B200. -- 1f6e52218ec124bb52d4dba70aa7832311762465 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Disable test case in cudnn_test that fails on Google's B200. Keep gpu_compiler_test off CI for now due to memory leak found by ASAN, but don't revert the changes in the file, so it can be enabled more easily in the future. -- 42e501a41e43c174538ab186c659a072101b4ab2 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>: Disable ConvWgradWithNHWCLayoutExecutesCorrectly only on Blackwell. Merging this change closes #32724 PiperOrigin-RevId: 821992088	2025-10-21 01:05:14 -07:00
Mudit Gokhale	898e238e5d	Enable multi-host support for trace viewer. PiperOrigin-RevId: 821973423	2025-10-21 00:16:33 -07:00
A. Unique TensorFlower	e7e50018a2	Reverts `7144ba7d80` PiperOrigin-RevId: 821972512	2025-10-20 23:54:39 -07:00
A. Unique TensorFlower	fdc9d00af9	Automated Code Change PiperOrigin-RevId: 821967952	2025-10-20 23:37:29 -07:00
A. Unique TensorFlower	a31ff63e54	Refactor resource tracking logic done in GetResourcesForInstructionImpl. PiperOrigin-RevId: 821957518	2025-10-20 22:55:19 -07:00
A. Unique TensorFlower	63d73059b0	Automated Code Change PiperOrigin-RevId: 821947260	2025-10-20 22:21:39 -07:00
A. Unique TensorFlower	e39094330d	Automated Code Change PiperOrigin-RevId: 821946240	2025-10-20 22:11:41 -07:00
Eugene Zhulenev	5caf2a70d8	[xla:ffi] Revert ABI version change Also relax the FFI version check. PiperOrigin-RevId: 821905272	2025-10-20 19:43:47 -07:00
Subhankar Shah	14710459b6	[XLA:MSA] Allow MSA to schedule custom-call prefetches. PiperOrigin-RevId: 821900214	2025-10-20 19:25:21 -07:00
A. Unique TensorFlower	7b8ce05b8b	Integrate LLVM at llvm/llvm-project@32de3b9ef9 Updates LLVM usage to match [32de3b9ef9e7](https://github.com/llvm/llvm-project/commit/32de3b9ef9e7) PiperOrigin-RevId: 821892561	2025-10-20 18:56:42 -07:00
Benjamin Chetioui	fe31cef146	[XLA:GPU] Add a filter in the Triton fusion emitter to allow using a transposed iteration schedule for fusions rooted in dot. The intent is to improve L2 cache hits in the case where the left-hand side argument fully fits in L2. PiperOrigin-RevId: 821882872	2025-10-20 18:15:08 -07:00
Haibo Huang	0e15b80eb6	Implement `LogicalDeviceOfDefaultTypeForId` for GPU / CPU PiperOrigin-RevId: 821873337	2025-10-20 17:53:40 -07:00
Parker Schuh	20d14fbfb7	Rollforward of Update PjRtStreamExecutorRawBuffer::CopyRawHostToDeviceAndReturnEvent to support staging host buffers (for non-pinned memory). This allows replacing the CopyRawToHost functions. Reverts `baf408c724` PiperOrigin-RevId: 821872812	2025-10-20 17:43:51 -07:00
Haibo Huang	d2e02ce8d9	Adds topology utility functions PiperOrigin-RevId: 821858216	2025-10-20 17:06:02 -07:00
A. Unique TensorFlower	ce507e7993	Update XNNPACK in XLA This update includes some workarounds that disable SME for old compilers that don't support. PiperOrigin-RevId: 821848400	2025-10-20 16:51:04 -07:00
Alexander Shaposhnikov	72ba588a80	Clean up debug_options usage. NFC. PiperOrigin-RevId: 821848216	2025-10-20 16:38:52 -07:00
Maxim Ermilov	361f1c64eb	Port to new GpuComputeCapability API PiperOrigin-RevId: 821845460	2025-10-20 16:24:36 -07:00
Will Froom	dd4822d61c	[XLA:CPU] Fix flaky test. PiperOrigin-RevId: 821835738	2025-10-20 16:03:57 -07:00
Hyeontaek Lim	67e5eafb24	[PJRT C] Implement `Executable::GetOutputLayouts()` in the PJRT Layouts extension This change implements a native support for `xla::Executable::GetOutputLayouts()` in PJRT C API, when PJRT Layouts extension is available. This support does not fetch the optimized HLO, and thus this method becomes more reliable and fast. This change strongly recommends the plugin that implemented the Layouts extension v2 to upgrade to v3 to avoid an incompatibility. PiperOrigin-RevId: 821834116	2025-10-20 15:57:07 -07:00
Bill Varcho	a40f3bdebd	[Upkeep] Resolve 4 instances of the following issue: Todo (resolved) PiperOrigin-RevId: 821824006	2025-10-20 15:43:00 -07:00
Gunhyun Park	d2759cb154	Fix typo `invompatible` -> `incompatible` PiperOrigin-RevId: 821816332	2025-10-20 15:16:51 -07:00
Sean Talts	8d940b9cc7	[XLA:CPU] Refactor: intrinsic::Type into its own library, pull out naming functions. PiperOrigin-RevId: 821815511	2025-10-20 15:06:06 -07:00
Hyeontaek Lim	b915d3103a	[NanoRt] NanoRt IFRT now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before. Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future. PiperOrigin-RevId: 821808592	2025-10-20 14:51:22 -07:00
Benjamin Chetioui	88e4cd4d01	[XLA] Use `TiledHloSchedule` in `SymbolicTileAnalysis`. We can now produce arbitrary iteration patterns for output tiles, simply by parametrizing calls to `ComputeTiledHloInstructions` with different `TiledHloSchedule`s. PiperOrigin-RevId: 821796530	2025-10-20 14:12:48 -07:00
Sohaib Iftikhar	cbeeef926f	[XLA:GPU]: Remove unused method from all reduce test PiperOrigin-RevId: 821742010	2025-10-20 12:28:17 -07:00
Hyeontaek Lim	cc9fd2b254	[IFRT Proxy] `Array::pjrt_layout()` uses `nullptr` to indicate a default layout IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before. Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately. PiperOrigin-RevId: 821741105	2025-10-20 12:19:53 -07:00
Eugene Zhulenev	0e09f486e7	[xla:pjrt:ffi] Remove deprecated TypeID registration function PiperOrigin-RevId: 821740142	2025-10-20 12:11:20 -07:00
Michael Kuperstein	b824d4e187	[XLA] Remove verify_unique_channel_ids verifier option. The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use. PiperOrigin-RevId: 821737613	2025-10-20 11:59:45 -07:00
Kanish Anand	16e1567819	Clarify field name PiperOrigin-RevId: 821732575	2025-10-20 11:50:54 -07:00
Zac Mustin	903ce4c6ff	Use `GetPjRtCpuClient` directly when making a PJRT C API CPU client. Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`. This change is a no-op. PiperOrigin-RevId: 821732030	2025-10-20 11:44:41 -07:00
A. Unique TensorFlower	458995b35d	Set DNN version in DeviceDescription for autotuner cache. The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra. PiperOrigin-RevId: 821728904	2025-10-20 11:34:10 -07:00
A. Unique TensorFlower	027a15b3dc	[Autotuner] Early exit if there is only one supported config. - We encounter this case very often (for cublas autotuner), so it makes sense to optimize it. - Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it. PiperOrigin-RevId: 821716593	2025-10-20 11:08:05 -07:00
Eugene Zhulenev	fd948cba88	[xla:cpu] Migrate tf2xla to BufferAllocationInfo Reverts `f2ed04aff6` PiperOrigin-RevId: 821660240	2025-10-20 08:41:06 -07:00
A. Unique TensorFlower	f32acff204	Automated Code Change PiperOrigin-RevId: 821628227	2025-10-20 07:15:56 -07:00
A. Unique TensorFlower	789f9c8c92	Integrate LLVM at llvm/llvm-project@d5ce81dc81 Updates LLVM usage to match [d5ce81dc8143](https://github.com/llvm/llvm-project/commit/d5ce81dc8143) PiperOrigin-RevId: 821627987	2025-10-20 06:57:20 -07:00
Will Froom	4d5ed207cd	[XLA:GPU][XTile] Handle scalar load/store in xtile lowering. PiperOrigin-RevId: 821617202	2025-10-20 06:36:16 -07:00
Alex	8245a623aa	PR #32782 : [ROCm] Fix hermetic build for rocm Imported from GitHub PR https://github.com/openxla/xla/pull/32782 📝 Summary of Changes Fix hermetic build for rocm. 🎯 Justification Introduce missing hipblaslt dependency. Fix invalid libs linking and align with the data directories. 🚀 Kind of Contribution Please remove what does not apply: 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) CI, not relevant 🧪 Unit Tests: Not relevant 🧪 Execution Tests: Not relevant Copybara import of the project: -- f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>: Add missing hermetic lib dependency -- fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>: Add missing hipblaslt deps, fix the targets -- 540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>: Make hipblaslt mandatory -- 3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>: Fix test -- eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>: Ignore asnsigaltstack -- 54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>: [ROCm] Use working sha256 for latest ROCm 7.0 docker image -- 9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>: [ROCm] Add ROCM_PATH repo_env to test scripts -- 1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>: Fix buildifier warning Merging this change closes #32782 PiperOrigin-RevId: 821614030	2025-10-20 06:26:00 -07:00
Kostiantyn Liepieshov	5866a4f621	fix relayout propagation for MPMD. In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts PiperOrigin-RevId: 821612799	2025-10-20 06:11:10 -07:00
A. Unique TensorFlower	5707a02d98	[XLA:GPU] Enable `chlo.asinh` -> `kAsinh` `HloInstruction` lowering. PiperOrigin-RevId: 821610794	2025-10-20 05:57:12 -07:00
Adrian Kuegel	7144ba7d80	Migrate ListScheduler from TuplePointsToAnalysis to HloAliasAnalysis. Reverts `22032a9edb` PiperOrigin-RevId: 821605784	2025-10-20 05:44:20 -07:00
Will Froom	beb48d90e2	[XLA][XTile] Add xtile lowering passes for triton. This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR. The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`. PiperOrigin-RevId: 821593545	2025-10-20 04:57:30 -07:00
Christian Sigg	ea72bd7e48	[XLA:GPU] Allow kPad in nest_gemm_fusion. PiperOrigin-RevId: 821581567	2025-10-20 04:16:25 -07:00
Nikita Putikhin	c842d810f4	[XLA:GPU] Initialize `global_split_limits` with the first config's `split_k`. PiperOrigin-RevId: 821579201	2025-10-20 04:03:13 -07:00
Parker Schuh	f322c0c82d	Refactor the event loop + socket integration so that it is separately testable. This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as the API from the socket integration and subclasses can handle these accordingly. This also moves the responsibility to destroy in the handler logic with the contract that the event is removed from the loop on the second HalfClose event. PiperOrigin-RevId: 821445213	2025-10-19 19:02:26 -07:00
A. Unique TensorFlower	50808df60c	Add random perturbations to the xla_tpu_msa_sort_order_overrides flag Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation. This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features: 1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed - to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime - once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction 2. finding a lot of small changes that together reduce the runtime - we can do this using a "hill-climbing" method - try many perturbations until you find one slightly better than the baseline. - try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again - repeat until no more improvements are found NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs. PiperOrigin-RevId: 821309046	2025-10-19 06:33:49 -07:00
David Majnemer	8cf42017ec	[TSL] Consolidate NUMA code across different platforms. No functional change is intended. PiperOrigin-RevId: 821216963	2025-10-18 22:14:29 -07:00
Subhankar Shah	a139a50e56	[XLA:MSA] Allow allocation requests with a continuous default memory requirement to fall through without resulting in a failure requiring un-commit. PiperOrigin-RevId: 821165250	2025-10-18 18:00:21 -07:00
A. Unique TensorFlower	ac5fb8fb7f	Introduce and connect an XLA:TPU shardy option to fully deduplicate functions in Shardy. This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default. PiperOrigin-RevId: 821039969	2025-10-18 05:48:48 -07:00
Aliia Khasanova	4985a1c2f3	Add proto [de]serialization for HostExecuteDoneThunk PiperOrigin-RevId: 821029998	2025-10-18 04:55:53 -07:00
A. Unique TensorFlower	17117898f9	Automated Code Change PiperOrigin-RevId: 820964120	2025-10-17 23:15:56 -07:00
Maxim Ermilov	4a42fca868	First step to introduce GpuComputeCapability custom class instead of std::variant PiperOrigin-RevId: 820940828	2025-10-17 21:24:11 -07:00
David Majnemer	4d358b2bac	[TSL] Remove unused `integral_types.h` from TSL platform defaults. This header provided typedefs for integral types within the `tsl` namespace, but these are no longer necessary. PiperOrigin-RevId: 820935573	2025-10-17 21:01:34 -07:00
A. Unique TensorFlower	4beacf5a04	Make the chain specifically target the reshape-transpose chain where the transposes are not identity permutations. Identity transposes should be eliminated separately in HandleTranspose already. PiperOrigin-RevId: 820903953	2025-10-17 18:56:47 -07:00
Alexander Shaposhnikov	ce65a0ad5c	[XLA:CPU] Add initial bits for YNNPACK support. + Do not build XLA with YNNPACK on Windows. Co-authored-by: Penporn Koanantakool <penporn@google.com> PiperOrigin-RevId: 820896434	2025-10-17 18:36:53 -07:00
A. Unique TensorFlower	f0057ee4b7	[XLA] Delete stale comment on ShapeUtil::PermuteDimensions usage. PiperOrigin-RevId: 820874660	2025-10-17 16:58:56 -07:00
Zixuan Jiang	1a142dab0a	Refactor shardy_xla_pass. Remove unused code. PiperOrigin-RevId: 820872613	2025-10-17 16:49:40 -07:00
A. Unique TensorFlower	f2ed04aff6	Reverts `0fab8daf15` PiperOrigin-RevId: 820869543	2025-10-17 16:39:59 -07:00
A. Unique TensorFlower	206f1c1891	Update XNNPACK in XLA PiperOrigin-RevId: 820860720	2025-10-17 16:14:15 -07:00
Haibo Huang	a619e2de08	Expose new methods to PjRtTopologyDescription. PiperOrigin-RevId: 820837477	2025-10-17 15:04:17 -07:00
A. Unique TensorFlower	119e1f6731	https://github.com/llvm/llvm-project/pull/162120 removed some automatic namespace determinations, so we need to explicitly specify some namespaces now. This is needed for the LLVM integrate. PiperOrigin-RevId: 820836649	2025-10-17 14:52:43 -07:00
David Majnemer	bdb78510d0	[TSL] Clean up integral types Let's migrate to u?int\d+_t types instead of our own bespoke stuff. PiperOrigin-RevId: 820815523	2025-10-17 14:19:08 -07:00
Eugene Zhulenev	d531cdce30	[xla:ffi] Add TypeRegistry::TypeInfo to be able to register functions to manipulate user-defined types PiperOrigin-RevId: 820811829	2025-10-17 13:41:40 -07:00
Kevin Gleason	46522b8a20	[StableHLO] Add transpose simplification PiperOrigin-RevId: 820804015	2025-10-17 13:31:39 -07:00
Niklas Vangerow	13006913d2	Migrate sample_file_test to HloRunnerPjRt. PiperOrigin-RevId: 820803579	2025-10-17 13:21:59 -07:00
Hyeontaek Lim	05101b9755	[PjRt-IFRT] Temporary workaround for output layout handling PjRt-IFRT directly or indirectly fetched optimized HLO to get the output layout mode and output layouts. This seems to introduce a regression in some jobs that use PJRT C API and have a too large serialized HLO (> 2 GiB). As a workaround, PjRt-IFRT gracefully handles output layout mode and layout discovery errors, and falls back to concrete layouts that are directly obtained from output `PjRtBuffer`s, should give the same behavior before/after the default layout handling change. Further changes will follow to discover default layout modes and layouts without going through `PjRtLoadedExecutable::GetHloModules()`. PiperOrigin-RevId: 820785277	2025-10-17 12:41:35 -07:00
Parker Schuh	b07145966f	Add StatusOr to transfer server BulkTransportInterface on the bond id to forward errors from bond connection failures to the control plane connection. PiperOrigin-RevId: 820783819	2025-10-17 12:28:16 -07:00
Eugene Zhulenev	0fab8daf15	[xla:cpu] Migrate tf2xla to BufferAllocationInfo Reverts `94fbd7554e` PiperOrigin-RevId: 820770766	2025-10-17 11:54:08 -07:00
Benjamin Chetioui	81798b5240	[XLA] Throw away `TilingSpecification` in the `TransposedDotTiledHloSchedule`. After relaxing the constraints related to the iteration space in a recent change, this is no longer necessary. PiperOrigin-RevId: 820766539	2025-10-17 11:33:01 -07:00
A. Unique TensorFlower	94fbd7554e	Reverts `fb52ce8275` PiperOrigin-RevId: 820748684	2025-10-17 10:58:15 -07:00
Penporn Koanantakool	8614a97d98	[xla:cpu:ynn] Add build macros for YNNPACK integration. We won't build XLA with YNNPACK on Windows yet. PiperOrigin-RevId: 820744698	2025-10-17 10:40:45 -07:00
Kostiantyn Liepieshov	f910c98db0	Use `R"hlo(...)hlo"` for HLO text in `sample_text_test.cc`. This improves readability and allows for better syntax highlighting of the embedded HLO strings. PiperOrigin-RevId: 820710394	2025-10-17 09:12:53 -07:00
Eugene Zhulenev	fb52ce8275	[xla:cpu] Migrate tf2xla to BufferAllocationInfo PiperOrigin-RevId: 820707093	2025-10-17 08:59:31 -07:00
Eugene Zhulenev	4752801386	[xla:ffi] Make TypeInfo mandatory in XLA_FFI_REGISTER_TYPE Add placeholders for future Type serialization/deserialization. It's not an ABI breaking change as it's unused today, and it allows to avoid ABI breaking change in the future when FFI will add proper ser/des support for user defined types. PiperOrigin-RevId: 820676169	2025-10-17 07:20:25 -07:00
Aliia Khasanova	30d25d6d18	Add proto [de]serialization for HostExecuteStartThunk PiperOrigin-RevId: 820645056	2025-10-17 05:32:26 -07:00
Karlo Basioli	0bb1532ddf	[XLA] Enable multihost runner to load unoptimized hlo snapshots dumped without custom serialization. PiperOrigin-RevId: 820643951	2025-10-17 05:26:10 -07:00
A. Unique TensorFlower	51fc1ac0d5	Improve logging and error messages from autotuner. - The VLOG messages are updated to more accurately describe whether the autotuner is finding a config in cache, using a default, or actively tuning for the best config. - The error contains the HLO instruction. PiperOrigin-RevId: 820640768	2025-10-17 05:16:19 -07:00
Eugene Zhulenev	52749919c9	[xla:cpu] Add buffer_allocation_info to xla_cpu_runtime_hdrs PiperOrigin-RevId: 820639686	2025-10-17 05:03:10 -07:00
Mohammed Anany	097f587e4e	[XLA:GPU/WS] Adding test coverage for auto warp specialization via Triton. PiperOrigin-RevId: 820637611	2025-10-17 04:49:39 -07:00
Nikita Putikhin	cc58fb18fd	[XLA:GPU] Enable dots with block_n=8 in triton and autotuner This change utilizes recently added Triton support for smaller block sizes. Skipping occupancy optimization for some configs is essentially a workaround for incompatible split_k values. The impact of these configs is limited however because they are only present in non-exhaustive mode, so they mostly get filtered out anyway. PiperOrigin-RevId: 820617352	2025-10-17 03:32:51 -07:00
Will Froom	abc19d2d20	[XLA:CPU] Combine optimization & lowering pass managers by using callback pass. PiperOrigin-RevId: 820610316	2025-10-17 03:07:44 -07:00
Karlo Basioli	5da47fcdd8	[XLA:GPU][codegen] Emit shlo for broadcast_in_dim and lower to equivalent triton op. PiperOrigin-RevId: 820598440	2025-10-17 02:33:27 -07:00
Zixuan Jiang	0ab4818f74	Use all-gather in the spmd_partitioner_test. Before this change, we disallowed all-gather such that the partitioner generates `all-reduce(dynamic-update-slice())` pattern. With this change, we allow all-gather for two reasons. 1. In most cases, all-gather is allowed and preferred. 2. It is easier to read and match the partitioner result. PiperOrigin-RevId: 820593767	2025-10-17 02:02:58 -07:00
Ilia Sergachev	4cd7465b84	PR #32388 : [GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension. Imported from GitHub PR https://github.com/openxla/xla/pull/32388 📝 Summary of Changes Support collectives with non-minor-most last dimension in the sub-byte collective normalization pass. 🎯 Justification Makes more collectives efficient, not require type conversion. 🚀 Kind of Contribution Performance Improvement. 📊 Benchmark (for Performance Improvements) ``` Before: ## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=68384ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=67744ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=66976ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=67040ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=66816ns After: ## Execution time, file=u4_all_gather_1x8.hlo repeat=1 duration=41216ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=2 duration=40960ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=3 duration=40960ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=4 duration=41056ns ## Execution time, file=u4_all_gather_1x8.hlo repeat=5 duration=40960ns ``` Measured on 8xH100 DGX. 🧪 Unit Tests: yes 🧪 Execution Tests: yes Copybara import of the project: -- a3777523ffffbcc59da285544e3fb5575d098b9c by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Sub-byte collective normalization: support collectives with non-minor-most last dimension. Merging this change closes #32388 PiperOrigin-RevId: 820585923	2025-10-17 01:38:24 -07:00
Harsha H S	086937e138	PR #32678 : [ROCm] Use working sha256 for latest ROCm 7.0 docker image and fix test scripts Imported from GitHub PR https://github.com/openxla/xla/pull/32678 📝 Summary of Changes - Fix sha256 of docker image to ensure CI is not broken due to malformed image - Fix test scripts by passing ROCM_PATH to bazel sandbox via repo_env 🎯 Justification Continued CI runs 🚀 Kind of Contribution 🧪 Tests Copybara import of the project: -- 3ca8114613d8e002c137f28bb6608639d08a724a by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>: [ROCm] Use working sha256 for latest ROCm 7.0 docker image -- 09ddfbdf205a6406cdd67e20671f41455fffe0f9 by Harsha HS <Harsha.HavanurShamsundara@amd.com>: [ROCm] Add ROCM_PATH repo_env to test scripts Merging this change closes #32678 PiperOrigin-RevId: 820582560	2025-10-17 01:25:06 -07:00
Shanbin Ke	f573329cc6	PR #32718 : [XLA:GPU] add conv fusion support in cudnn fusion compiler Imported from GitHub PR https://github.com/openxla/xla/pull/32718 📝 Summary of Changes This PR adds conv fusion support in cudnn fusion compiler. * add conv type in `CuDnnFusionConfig` to represent different types of conv. We are getting rid of the conv custom call target so this info has be preserved in fusion config. * add `ConvDimensionAdapter` to generate NCHW logical layout for cudnn frontend while physical layout could be NHWC (most preferable layout) or NCHW (for int conv). Only NHWC layout is used in the unit tests because layout assignment currently doesn't handle conv fusion to transform other layouts to NHWC, this needs to be addressed in separate PR. * add conv translation rule from XLA conv to cudnn frontend graph API. * Other parts of the lowering is taken care automatically by current cudnn fusion compiler: workspace allocation/graph validation/graph compilation/graph serialization. 🎯 Justification This is the first step to unify the conv as cudnn fusion in XLA. Conv custom call will be replaced with conv fusions in the future. 🚀 Kind of Contribution ✨ New Feature 📊 Benchmark (for Performance Improvements) No Performance changes are expected. 🧪 Unit Tests: Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad. 🧪 Execution Tests: Added 3 hand written NHWC conv unit tests for conv_fprop/conv_dgrad/conv_wgrad. Copybara import of the project: -- 57555cd0e3759aacb7a98135c3261f4cc3f642c2 by Cjkkkk <ske@nvidia.com>: init -- d6edecfa42a6371a0908e22daeb8deaf32998ece by Cjkkkk <ske@nvidia.com>: address comments -- 17df6f8451274f070d7d332a126cfefa1ef7df83 by Cjkkkk <ske@nvidia.com>: removed one comment -- 1b7c63b1ade7751cf8f68c7fb11cd68491440081 by Cjkkkk <ske@nvidia.com>: add const Merging this change closes #32718 PiperOrigin-RevId: 820574737	2025-10-17 00:58:07 -07:00
Jacques Pienaar	2096501975	Remove register everything. This should just be IR one. PiperOrigin-RevId: 820548236	2025-10-16 23:22:26 -07:00
A. Unique TensorFlower	1ddcd859d3	Move absl_thread_pool to XLA as YnnThreadpool PiperOrigin-RevId: 820544939	2025-10-16 23:13:24 -07:00
Christian Sigg	c9d8d37611	[xla:gpu] Relax nested gemm fusion constraints. This change removes dimension ordering constraints in `AcceptDotOperand`. PiperOrigin-RevId: 820542964	2025-10-16 23:02:42 -07:00
A. Unique TensorFlower	d46c1b99a9	Automated Code Change PiperOrigin-RevId: 820542824	2025-10-16 22:51:48 -07:00
Gregory Pataky	c0d9a60f83	Internal changes to project structure PiperOrigin-RevId: 820527062	2025-10-16 21:52:14 -07:00
Penporn Koanantakool	b2f2568bcc	[xla:cpu:xnn] Temporarily disable XNNPACK by default. PiperOrigin-RevId: 820519075	2025-10-16 21:31:15 -07:00
A. Unique TensorFlower	5592d364ec	Automated Code Change PiperOrigin-RevId: 820505039	2025-10-16 20:36:41 -07:00
A. Unique TensorFlower	a8a747470e	Update XNNPACK in XLA PiperOrigin-RevId: 820502825	2025-10-16 20:24:07 -07:00
Eugene Zhulenev	ef3a678718	[xla:cpu] Fix BufferAllocationInfo::InOutParameter constructor PiperOrigin-RevId: 820456592	2025-10-16 17:49:08 -07:00
Kevin Gleason	e0f3263a48	[StableHLO Builder] Add API to set frontend attributes PiperOrigin-RevId: 820455957	2025-10-16 17:34:02 -07:00
Benjamin Chetioui	c19b0d8727	[XLA] Relax restrictions on the `iteration_space` parameter in `Schedule`. We're perfectly able to construct a schedule using only a subset of the iteration space of a `tile_offsets_indexing`---and in fact need to when we are processing nested fusions. PiperOrigin-RevId: 820454010	2025-10-16 17:25:27 -07:00
Haibo Huang	c3ce8a9881	Add `PjRtDeviceDimensions` struct and proto. PiperOrigin-RevId: 820440467	2025-10-16 16:41:32 -07:00
Yulia Baturina	180445df5b	Add more textual hdrs to `mkl_dnn_acl` target to enable usage of `--config=mkl_aarch64_threadpool` with `--config=rbe_cross_compile_linux_aarch64` in JAX wheel builds. This addition fixes the error below: ``` 2025-09-03 23:33:24,197 - INFO - external/mkl_dnn_acl_compatible/src/graph/interface/partition_impl.cpp:17:10: fatal error: 'graph/interface/partition_impl.hpp' file not found 2025-09-03 23:33:24,197 - INFO - 17 \| #include "graph/interface/partition_impl.hpp" 2025-09-03 23:33:24,197 - INFO - \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-09-03 23:33:24,197 - INFO - 1 error generated. ``` PiperOrigin-RevId: 820423794	2025-10-16 15:58:17 -07:00
A. Unique TensorFlower	a04d8eeb75	Update XNNPACK in XLA PiperOrigin-RevId: 820419078	2025-10-16 15:43:29 -07:00
Eugene Zhulenev	a2ab65d20c	[xla:ffi] Fix XLA_FFI_REGISTER_TYPE_X macro to support optional args PiperOrigin-RevId: 820414595	2025-10-16 15:30:26 -07:00
Parker Schuh	fd09236fe4	Rollforward of: Implement PjRtStreamExecutorRawBuffer::CopyToLiteralAsync and allow PjRtStreamExecutorBuffer to just use inherited literal conversion logic. Reverts `009d8fdbf4` PiperOrigin-RevId: 820409509	2025-10-16 15:17:15 -07:00
A. Unique TensorFlower	1f3321fa67	* Adds flag `tf_serialize_mlir_to_compressed_bytecode` to serialize to compressed bytecode. * Deserializing MLIR modules still tries to parse as string first as thats the default, on failure it tries to uncompress and parse. PiperOrigin-RevId: 820396326	2025-10-16 14:49:31 -07:00
A. Unique TensorFlower	5863476a05	Integrate LLVM at llvm/llvm-project@bfee9db785 Updates LLVM usage to match [bfee9db78577](https://github.com/llvm/llvm-project/commit/bfee9db78577) PiperOrigin-RevId: 820396282	2025-10-16 14:38:22 -07:00
Kanish Anand	a6a2128d5f	Add IFTTT for named sharding types PiperOrigin-RevId: 820392329	2025-10-16 14:25:10 -07:00
Maxim Ermilov	5c18a50655	Use nvml impl lib based wrapper PiperOrigin-RevId: 820376041	2025-10-16 13:59:12 -07:00
Michael Kuperstein	83a98ac48d	[XLA] Remove dead module-group related code from HloPassPipelineTest PiperOrigin-RevId: 820373683	2025-10-16 13:50:15 -07:00
Will Froom	168eb2c36a	[XLA:CPU][XTile] Add lowering for tensor extract and from_elements & fix dot with scalar output. PiperOrigin-RevId: 820368257	2025-10-16 13:37:25 -07:00
Michael Kuperstein	4e29e9da6c	[XLA] Remove unused RecordPassStartMetadata overload. PiperOrigin-RevId: 820365407	2025-10-16 13:22:35 -07:00
A. Unique TensorFlower	c8b47aac14	In profile_data_lib.cc, throw std::runtime_error instead of check fail. PiperOrigin-RevId: 820351230	2025-10-16 12:47:14 -07:00
A. Unique TensorFlower	83c407040a	[XLA:GPU] Don't fail Autotuner::GetSupportedConfigs if one of the backend fails PiperOrigin-RevId: 820303427	2025-10-16 10:58:41 -07:00

... 3 4 5 6 7 ...

28187 Commits