this now passes reference_held=true always. This is fine because the only time
this was ever passed as false was if this was already on the compute stream and
this bool is basically ignored if the stream is the compute stream (see
MaybeWaitForEventOnStream).
PiperOrigin-RevId: 822758577
This CL modifies the collective pipeliner to generate unique body and condition computations for newly generated while loop instructions.
PiperOrigin-RevId: 822719229
- Prioritize replacing `broadcast_in_dim` with `reshape` over merging nested `broadcast_in_dim` ops. The new behavior matches the relevant MHLO optimization behavior, which proved to be preferable.
- Fix an issue where `pad` ops that didn't change the dimensions would be removed even if they shifted elements around within the tensor (e.g. padding by -1 on one side and +1 on the opposite side).
PiperOrigin-RevId: 822701252
Imported from GitHub PR https://github.com/openxla/xla/pull/33008📝 Summary of Changes
Add CI-specific bazelrc that will import both `rocm.bazelrc` from `/usertools` and `rocm_xla.bazelrc`
🎯 Justification
Temporary workaround until split logic in CI (which relies on `/usertools/rocm.bazelrc`) is removed
Copybara import of the project:
--
bb4cbf0c4fbf2c171110040c5c1470bddced203b by Milica Makevic <Milica.Makevic@amd.com>:
Add CI specific bazelrc
Merging this change closes#33008
PiperOrigin-RevId: 822700005
Instead of performing four separate AllToAll operations, the metadata tensors are reshaped, concatenated, and then a single AllToAll is executed. The result is then sliced back into the individual metadata tensors. This reduces latency required to initiate separate collective operations.
PiperOrigin-RevId: 822674605
Introduce `addMissingShardingToControlFlow` option in `StablehloExportPipelineOptions` to control whether `ExportStablehloShardingsPass` adds missing shardings to control flow ops. Disable this option in `mlir_to_hlo.cc` when converting MLIR to HLO.
PiperOrigin-RevId: 822542288
Imported from GitHub PR https://github.com/openxla/xla/pull/32231📝 Summary of Changes
The changes enable native support for forward convolutions with window dilation in XLA's GPU backend. Previously, all dilated convolutions were treated as non-canonical and required explicit padding materialization. Now, forward convolutions with window dilation (but not base dilation) are preserved and handled natively by cuDNN, avoiding unnecessary padding overhead.
🎯 Justification
Performance Problem: JAX shows 15-23x slower performance than PyTorch for dilated convolutions (33.5ms vs 1.4ms at dilation rate 2). This is because XLA materializes dilated convolutions as padded convolutions instead of using cuDNN's native support.
Solution: Allow forward convolutions with window dilation to bypass padding materialization and use cuDNN's native dilated convolution kernels directly.
🚀 Kind of Contribution
Performance Improvement
📊 Benchmark (for Performance Improvements)
dilation 1:
prev: 1.08 ms
now: 1.07 ms
dilation 2:
prev: 25.79 ms
now: 0.91 ms
dilation 1024:
prev: 26.24 ms
now: 2.34 ms
Copybara import of the project:
--
b5a38df2ed4715b43fc8ca8d652005a35290d47e by Chenhao Jiang <chenhaoj@nvidia.com>:
Support forward conv with dilation and add basic heuristic for differentiating forward/backward
Merging this change closes#32231
PiperOrigin-RevId: 822482265
Imported from GitHub PR https://github.com/openxla/xla/pull/32838📝 Summary of Changes
The fallback logic now correctly identifies the highest known compatible architecture when given an unknown architecture as input.
🎯 Justification
Previously the logic would propose an incompatible architecture in this case.
🚀 Kind of Contribution
🐛 Bug Fix
🧪 Unit Tests:
Added a new test case showing the previously-failing case (it used to propose `sm_110`)
Copybara import of the project:
--
f060bb9837d72159343ff2d52f5f2f42b1b7e9a4 by Olli Lupton <olupton@nvidia.com>:
Fix family-conditional logic
--
fc44dcd1e76da67c0b6fe53c33d2a571c3a6ff50 by Olli Lupton <olupton@nvidia.com>:
Accept CR suggestion
Merging this change closes#32838
PiperOrigin-RevId: 822284790
Imported from GitHub PR https://github.com/openxla/xla/pull/32960📝 Summary of Changes
(Partially) upstreaming changes from: https://github.com/ROCm/xla/pull/323, 9d358b9b26, and https://github.com/ROCm/xla/pull/385. It skips some asan/tsan changes for now.
🎯 Justification
These changes are ROCm specific and helps with rocm internal CI validation pipelines.
🚀 Kind of Contribution
🐛 Bug Fix, ♻️ Cleanup, 🧪 Tests
📊 Benchmark (for Performance Improvements)
/
🧪 Unit Tests:
/
🧪 Execution Tests:
/
Copybara import of the project:
--
804ff1b6a6fbba86a3e0a09d739179a4eb4f197d by Milica Makevic <Milica.Makevic@amd.com>:
Add missing cuda-only tag to cuda test
--
44ce7a2d56c9f0c80405447f431ae1e5a33f42e1 by Milica Makevic <Milica.Makevic@amd.com>:
Refactor test scripts
--
fb783c968e9d2ff5d92357908d99e4952235c2bc by Milica Makevic <Milica.Makevic@amd.com>:
Cover more mgpu tests
--
1f53712274f76202241bd3631dbf065826c0b960 by Milica Makevic <Milica.Makevic@amd.com>:
Switch from rocm_gcc to rocm_ci for sgpu tests
--
00e0c8ee2a763680f5a3665dab62202ab230731d by Milica Makevic <Milica.Makevic@amd.com>:
Changing file permissions
--
003c062a8900c12b73c0972e8d406f2661a27aba by Milica Makevic <Milica.Makevic@amd.com>:
Remove unnecessary import
--
214599355f40f1b65e0540daf0b9829d2c950115 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
Add license header
Merging this change closes#32960
PiperOrigin-RevId: 822245565
This change enables the Dequantize and PerChannelDequantize operations to handle 2-bit integer inputs (`kTfLiteInt2`). It includes logic to unpack the packed 2-bit integers into int8_t before performing the dequantization and adds new test cases for both per-tensor and per-channel dequantization with kTfLiteInt2.
PiperOrigin-RevId: 822207279