Commit Graph

186163 Commits

Author SHA1 Message Date
Parker Schuh
13ea97f3a9 Update PjRtStreamExecutorClient main execute path to use CommonPjRtBuffer::ScopedHold. Crucially
this now passes reference_held=true always. This is fine because the only time
this was ever passed as false was if this was already on the compute stream and
this bool is basically ignored if the stream is the compute stream (see
MaybeWaitForEventOnStream).

PiperOrigin-RevId: 822758577
2025-10-22 15:35:02 -07:00
A. Unique TensorFlower
880f245b56 Allow TSL CellReader to work with lazy metrics.
PiperOrigin-RevId: 822757884
2025-10-22 15:25:07 -07:00
Parker Schuh
420ca15b61 Promote check to connection close.
PiperOrigin-RevId: 822746430
2025-10-22 14:52:22 -07:00
Eugene Zhulenev
2cdd8ff5ce [xla:ffi] Keep FFI handler metadata with handler registration
PiperOrigin-RevId: 822741325
2025-10-22 14:34:37 -07:00
Hyeontaek Lim
70111bb38f Reverts 16064a6c08
PiperOrigin-RevId: 822724128
2025-10-22 14:04:41 -07:00
A. Unique TensorFlower
aeda5dabd4 [XLA] Handle nested while loops in CollectivePipeliner.
This CL modifies the collective pipeliner to generate unique body and condition computations for newly generated while loop instructions.

PiperOrigin-RevId: 822719229
2025-10-22 13:47:32 -07:00
Maxim Ermilov
7b277367dc Remove inheritance of GpuComputeCapability from std::variant
PiperOrigin-RevId: 822701900
2025-10-22 13:33:16 -07:00
Parker Schuh
a6889b6922 Switch to using CommonAsyncHostToDeviceTransferManager.
PiperOrigin-RevId: 822701589
2025-10-22 13:21:45 -07:00
Matthias Guenther
6d1a7019f0 Fix issues in optimization patterns for broadcast_in_dim and pad ops.
- Prioritize replacing `broadcast_in_dim` with `reshape` over merging nested `broadcast_in_dim` ops. The new behavior matches the relevant MHLO optimization behavior, which proved to be preferable.
- Fix an issue where `pad` ops that didn't change the dimensions would be removed even if they shifted elements around within the tensor (e.g. padding by -1 on one side and +1 on the opposite side).

PiperOrigin-RevId: 822701252
2025-10-22 13:11:10 -07:00
mmakevic-amd
a5524d43e6 PR #33008: [ROCm] Add CI specific bazelrc file
Imported from GitHub PR https://github.com/openxla/xla/pull/33008

📝 Summary of Changes
Add CI-specific bazelrc that will import both `rocm.bazelrc` from `/usertools` and `rocm_xla.bazelrc`

🎯 Justification
Temporary workaround until split logic in CI (which relies on `/usertools/rocm.bazelrc`) is removed

Copybara import of the project:

--
bb4cbf0c4fbf2c171110040c5c1470bddced203b by Milica Makevic <Milica.Makevic@amd.com>:

Add CI specific bazelrc

Merging this change closes #33008

PiperOrigin-RevId: 822700005
2025-10-22 12:50:14 -07:00
Zixuan Jiang
4d53eda2fe Refactor spmd partitioner.
PiperOrigin-RevId: 822689391
2025-10-22 12:23:05 -07:00
Maxim Ermilov
1b08f96abf Port to new GpuComputeCapability API. Last part
PiperOrigin-RevId: 822676102
2025-10-22 11:59:58 -07:00
Oleg Shyshkov
3503a61282 [XLA:GPU] Combine metadata AllToAlls in RaggedAllToAllMultiHostDecomposer.
Instead of performing four separate AllToAll operations, the metadata tensors are reshaped, concatenated, and then a single AllToAll is executed. The result is then sliced back into the individual metadata tensors. This reduces latency required to initiate separate collective operations.

PiperOrigin-RevId: 822674605
2025-10-22 11:49:53 -07:00
Ken Franko
85c99b1ecb Reverts 2d4dd83773
PiperOrigin-RevId: 822637158
2025-10-22 10:17:06 -07:00
Eugene Zhulenev
4827802e7c [xla:pjrt:ffi] Remove unused type id registration API
PiperOrigin-RevId: 822630041
2025-10-22 10:01:45 -07:00
Will Froom
d8dcad1639 [XLA:CPU] Reenable new fusions in xla_ops_test.
PiperOrigin-RevId: 822608974
2025-10-22 09:02:50 -07:00
Will Froom
3353eeeab7 [XLA:CPU] Only add reassoc flag to reductions with a single floating point op.
PiperOrigin-RevId: 822598746
2025-10-22 08:33:14 -07:00
Dimitar (Mitko) Asenov
bbea04967a Reverts c28d80ae66
PiperOrigin-RevId: 822586242
2025-10-22 08:02:26 -07:00
Marcin Radomski
94d00be0e6 [XLA:GPU] Fix incorrect namespace in buffer_debug_log.*
It was moved to stream_executor/gpu, but code remained in stream_executor::cuda namespace.

PiperOrigin-RevId: 822584666
2025-10-22 07:51:36 -07:00
Oleg Shyshkov
53499fe9d0 [XLA:GPU] Move offset correction logic in a helper function.
PiperOrigin-RevId: 822572708
2025-10-22 07:29:58 -07:00
Alexander Belyaev
a34be3eb68 [XLA:GPU] Ignore zero-sized constants in layout normalization.
PiperOrigin-RevId: 822571991
2025-10-22 07:16:10 -07:00
A. Unique TensorFlower
39506ad1cd Deduplicate functions on the one with largest number of call sites.
Instead of picking arbitrarily.

PiperOrigin-RevId: 822566069
2025-10-22 06:55:15 -07:00
Thomas Joerg
83b84b3c46 [XLA:GPU] Add tests for transpose ops inserted by DotDecomposer.
Also be more precise about what is considered normal form and what is not.

PiperOrigin-RevId: 822554350
2025-10-22 06:18:34 -07:00
Kostiantyn Liepieshov
b5d09010cd Make adding missing shardings to control flow configurable in StableHLO export.
Introduce `addMissingShardingToControlFlow` option in `StablehloExportPipelineOptions` to control whether `ExportStablehloShardingsPass` adds missing shardings to control flow ops. Disable this option in `mlir_to_hlo.cc` when converting MLIR to HLO.

PiperOrigin-RevId: 822542288
2025-10-22 05:37:59 -07:00
A. Unique TensorFlower
3cc86433e3 Correctly set dnn_version in device_description when parsing from proto.
Removing the setting from the other 2 places as it is no longer necessary.

PiperOrigin-RevId: 822533070
2025-10-22 05:02:14 -07:00
A. Unique TensorFlower
dfea7bb9a7 Automated Code Change
PiperOrigin-RevId: 822524939
2025-10-22 04:29:06 -07:00
A. Unique TensorFlower
e5e060f167 Refactor Dynamic_Update_Slice operator in preparation for porting to TFLM.
PiperOrigin-RevId: 822494887
2025-10-22 02:46:40 -07:00
A. Unique TensorFlower
85eff6042f Update GraphDef version to 2388.
PiperOrigin-RevId: 822485942
2025-10-22 02:32:52 -07:00
A. Unique TensorFlower
8f8707055c compat: Update forward compatibility horizon to 2025-10-22
PiperOrigin-RevId: 822485927
2025-10-22 02:22:32 -07:00
Chenhao Jiang
75fa34bbde PR #32231: Support forward conv with dilation and add basic heuristic for differ…
Imported from GitHub PR https://github.com/openxla/xla/pull/32231

📝 Summary of Changes
The changes enable native support for forward convolutions with window dilation in XLA's GPU backend. Previously, all dilated convolutions were treated as non-canonical and required explicit padding materialization. Now, forward convolutions with window dilation (but not base dilation) are preserved and handled natively by cuDNN, avoiding unnecessary padding overhead.

🎯 Justification
Performance Problem: JAX shows 15-23x slower performance than PyTorch for dilated convolutions (33.5ms vs 1.4ms at dilation rate 2). This is because XLA materializes dilated convolutions as padded convolutions instead of using cuDNN's native support.
Solution: Allow forward convolutions with window dilation to bypass padding materialization and use cuDNN's native dilated convolution kernels directly.

🚀 Kind of Contribution
Performance Improvement

📊 Benchmark (for Performance Improvements)
dilation 1:
	prev: 1.08 ms
	now: 1.07 ms
dilation 2:
	prev: 25.79 ms
	now: 0.91 ms
dilation 1024:
	prev: 26.24 ms
	now: 2.34 ms

Copybara import of the project:

--
b5a38df2ed4715b43fc8ca8d652005a35290d47e by Chenhao Jiang <chenhaoj@nvidia.com>:

Support forward conv with dilation and add basic heuristic for differentiating forward/backward

Merging this change closes #32231

PiperOrigin-RevId: 822482265
2025-10-22 02:03:50 -07:00
Jian Cai
95d3b6fe36 [XLA][Numerics][HLO Value Tracking] Handle original values in while loop fusible sinking pass
This reconstructs the original value for while loops with a rewritten input/output shape during the pass.

PiperOrigin-RevId: 822465131
2025-10-22 01:08:37 -07:00
Felix Wang
add51a87c3 [XLA:GPU] Update latency hiding scheduler cost models for B200/H100 FP8 matmul
PiperOrigin-RevId: 822446122
2025-10-22 00:01:00 -07:00
A. Unique TensorFlower
ca2365df32 Make ApproxTopK Op don't fail with kMhloFrontendAttributes.
PiperOrigin-RevId: 822427505
2025-10-21 22:51:17 -07:00
Majid Dadashi
64f382ac25 Add support for kTfLiteInt2 (srq) in tfl.fully_connected.
PiperOrigin-RevId: 822405584
2025-10-21 21:33:00 -07:00
Parker Schuh
68ad2b30fa Implement PjRtStreamExecutorRawBuffer::CopyTo in terms of raw buffers.
PiperOrigin-RevId: 822345080
2025-10-21 17:58:31 -07:00
Haibo Huang
bdb268c5c5 Add helper functions to check PjRtPlatformId types.
PiperOrigin-RevId: 822333726
2025-10-21 17:13:03 -07:00
Derek Murray
69079b7e0d Add flag enable_fatal_error_on_collective_abort.
PiperOrigin-RevId: 822315284
2025-10-21 16:29:26 -07:00
Eugene Zhulenev
90491b0a55 [xla:pjrt:ffi] Prepare for legacy type registration removal
PiperOrigin-RevId: 822309311
2025-10-21 16:13:04 -07:00
Paul Ganssle
512611da80 Internal code migration
PiperOrigin-RevId: 822300362
2025-10-21 15:34:56 -07:00
Haibo Huang
b7d9295b52 Replace ComputationOrigin with the more general PjRtDeviceDimensions
PiperOrigin-RevId: 822288293
2025-10-21 15:11:47 -07:00
Olli Lupton
3cdcb03f18 PR #32838: Fix family-conditional logic
Imported from GitHub PR https://github.com/openxla/xla/pull/32838

📝 Summary of Changes
The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input.

🎯 Justification
Previously the logic would propose an incompatible architecture in this case.

🚀 Kind of Contribution
🐛 Bug Fix

🧪 Unit Tests:
Added a new test case showing the previously-failing case (it used to propose `sm_110`)
Copybara import of the project:

--
f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>:

Fix family-conditional logic

--
fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>:

Accept CR suggestion

Merging this change closes #32838

PiperOrigin-RevId: 822284790
2025-10-21 14:59:18 -07:00
Eugene Zhulenev
0fc052399b [xla:cpu] Fix data race in ThunkExecutor
Also add tsl::down_pointer_cast to improve usability.

PiperOrigin-RevId: 822257137
2025-10-21 13:46:24 -07:00
Michael Whittaker
5776d2771c Pipe incarnations to jax.live_devices.
PiperOrigin-RevId: 822250955
2025-10-21 13:35:27 -07:00
mmakevic-amd
47cd01d4a5 PR #32960: [ROCm] Refactor testing scripts
Imported from GitHub PR https://github.com/openxla/xla/pull/32960

📝 Summary of Changes
(Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, 9d358b9b26, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now.

🎯 Justification
These changes are ROCm specific and helps with rocm internal CI validation pipelines.

🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests

📊 Benchmark (for Performance Improvements)
/

🧪 Unit Tests:
/

🧪 Execution Tests:
/

Copybara import of the project:

--
804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>:

Add missing cuda-only tag to cuda test

--
44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>:

Refactor test scripts

--
fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>:

Cover more mgpu tests

--
1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>:

Switch from rocm_gcc to rocm_ci for sgpu tests

--
00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>:

Changing file permissions

--
003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>:

Remove unnecessary import

--
214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

Add license header

Merging this change closes #32960

PiperOrigin-RevId: 822245565
2025-10-21 13:25:33 -07:00
Eugene Zhulenev
7a107e3571 [xla:ffi] Rename FFI_TypeID_Register API
PiperOrigin-RevId: 822240093
2025-10-21 13:12:40 -07:00
Felix Wang
95f3e6f33c [XLA:GPU]: Refactor the unit test of matmul_interpolator_test.cc to prepare for adding the mix-precision fp8 unit test.
PiperOrigin-RevId: 822239646
2025-10-21 13:02:53 -07:00
Felix Wang
2de2bb8581 Populate the cost for async collective in both async-start and the computation root op.
PiperOrigin-RevId: 822223031
2025-10-21 12:22:08 -07:00
Eugene Zhulenev
633c3efcf9 [xla:cpu] Delete unused cpu_function_runtime header
PiperOrigin-RevId: 822215543
2025-10-21 12:15:40 -07:00
Eugene Zhulenev
6141496817 [xla:ffi] Document XLA:FFI binary API guarantees and add a supporteded API range check
PiperOrigin-RevId: 822214561
2025-10-21 12:02:12 -07:00
Majid Dadashi
c37a4aaa58 Add support for kTfLiteInt2 to Dequantize kernels.
This change enables the Dequantize and PerChannelDequantize operations to handle 2-bit integer inputs (`kTfLiteInt2`). It includes logic to unpack the packed 2-bit integers into int8_t before performing the dequantization and adds new test cases for both per-tensor and per-channel dequantization with kTfLiteInt2.

PiperOrigin-RevId: 822207279
2025-10-21 11:49:46 -07:00