Future note: At some point, `NanoArray` would need to distinguish between a default layout vs. a concrete layout that is equal to the default layout. If the latter is used, `NanoArray::pjrt_layout()` is expected to return the concrete layout. This is not required by IFRT API semantics yet, but it will be enforced later in the future.
PiperOrigin-RevId: 821808592
When specifying a mapping name to `CreateFileMappingA()`, that function returns
previous mappings that match the same name disregarding the newly requested
mapping size. This doesn't work well with the weight cache that is built (and
mapped) incrementally.
By making the mapping objects anonymous, we ensure that the mapping returned
will have the requested size.
Note: this doesn't increase the totally memory used by the process but the
accounting by the Windows system is different. Compared to a fix that allocates
memory instead of mapping the file, less memory is committed, and private and
more is shareable.
Testing `litert_llm_main` on [Gemma3-1B-IT] on Windows 11.
| Fix | Commit (KB) | Working Set (KB) | Shareable (KB) | Private (KB) |
| ---------: | -----------:| ----------------:| --------------:| ------------:|
| Anon. map | 1 208 416 | 1 678 396 | 1 079 620 | 599 096 |
| Mem. alloc | 1 705 620 | 1 678 572 | 582 428 | 1 096 144 |
| | | | | |
| diff. | +497 204 | 176 | -497 192 | +497 048 |
[Gemma3-1B-IT]: https://huggingface.co/litert-community/Gemma3-1B-IT/blob/main/gemma3-1b-it-int4.litertlm
PiperOrigin-RevId: 821807004
We can now produce arbitrary iteration patterns for output tiles, simply by
parametrizing calls to `ComputeTiledHloInstructions` with different
`TiledHloSchedule`s.
PiperOrigin-RevId: 821796530
IFRT Proxy now returns a `nullptr` if it knows that the Array layout represents a default layout. The user code previously has been migrated to handle this new behavior gracefully, obtaining a concrete default layout as before.
Caveat: IFRT Proxy client infers the layout of the output arrays from `LoadedExecutable::GetOutputLayouts()`, which always concrete layouts today. Thus, these output arrays would use concrete layouts for default layouts, even if the arrays on the server side use `nullptr` for default layouts. This behavior is currently acceptable where all users convert the layout into a concrete one before using it, while this behavior will eventually change so that IFRT Proxy client reflects the array layouts on the server side more accurately.
PiperOrigin-RevId: 821741105
The functionality has been removed previously, but the option was never cleaned up. This does not remove the xla_ignore_channel_id debug option because it also has a non-verifier use.
PiperOrigin-RevId: 821737613
Right now, we use `GetXlaPjrtCpuClient` which in turn calls `GetPjRtCpuClient`, but we will later update `GetXlaPjrtCpuClient` to use the C sandwich, in which case we must call `GetPjRtCpuClient` here in `PJRT_Client_Create`.
This change is a no-op.
PiperOrigin-RevId: 821732030
The dnn_version in device_description was not set, cl/816579045 fixed it for old autotuner infra, this change ports that change to the new autotuner infra.
PiperOrigin-RevId: 821728904
- We encounter this case very often (for cublas autotuner), so it makes sense to optimize it.
- Running cuBLAS kernels as part of autotuning has some unintended side effect which changes the optimized HLO, this fix also mitigates the issue, while we look more into it.
PiperOrigin-RevId: 821716593
Imported from GitHub PR https://github.com/openxla/xla/pull/32782📝 Summary of Changes
Fix hermetic build for rocm.
🎯 Justification
Introduce missing hipblaslt dependency.
Fix invalid libs linking and align with the data directories.
🚀 Kind of Contribution
Please remove what does not apply: 🐛 Bug Fix
📊 Benchmark (for Performance Improvements)
CI, not relevant
🧪 Unit Tests:
Not relevant
🧪 Execution Tests:
Not relevant
Copybara import of the project:
--
f5cb68b0df2265b7048d0068eedd07cccf67e228 by Alexandros Theodoridis <atheodor@amd.com>:
Add missing hermetic lib dependency
--
fe0c9a7fdd36180fea5cf63e20d864355ed98a6c by Alexandros Theodoridis <atheodor@amd.com>:
Add missing hipblaslt deps, fix the targets
--
540d79dd4287a013a3f178ef34a5b96fb8a8a92f by Alexandros Theodoridis <atheodor@amd.com>:
Make hipblaslt mandatory
--
3a6f2282669a1ece4518cc69a01ad76275b603a1 by Alexandros Theodoridis <atheodor@amd.com>:
Fix test
--
eb21b60d34978191315a0c9775d2cb53309dc72d by Alexandros Theodoridis <atheodor@amd.com>:
Ignore asnsigaltstack
--
54c8af2abd7dd682a8494caa05854d574209aa20 by Harsha Havanur Shamsundara <harsha.havanurshamsundara@amd.com>:
[ROCm] Use working sha256 for latest ROCm 7.0 docker image
--
9629a9fc9201a80dba7a0beecb8ee0797960ff6f by Harsha HS <Harsha.HavanurShamsundara@amd.com>:
[ROCm] Add ROCM_PATH repo_env to test scripts
--
1ef6772c6df6aeffcbcc2f27a0ede558fbc6270f by Alexandros Theodoridis <atheodor@amd.com>:
Fix buildifier warning
Merging this change closes#32782
PiperOrigin-RevId: 821614030
In the cases where the program argument with AUTO layout is used in more than one Fragment enforce the DEFAULT layout as we cannot allow different compiled layouts
PiperOrigin-RevId: 821612799
This enables migrating the triton emitter to use emit xtile entry, insert & extract in the child PR.
The main difference is the memref args in the entry function for which `MemrefToPtr` & `PtrToMemref` were introduced which closely resemble `UnrealizedConversionCastOp` with additional verification and will enable special folding of `memref::TransposeOp`.
PiperOrigin-RevId: 821593545
This gives us the two HalfClose events + HandleEvent() and SendRawFrame() as
the API from the socket integration and subclasses can handle these
accordingly. This also moves the responsibility to destroy in the handler logic
with the contract that the event is removed from the loop on the second HalfClose event.
PiperOrigin-RevId: 821445213
Given a user seed, will update the MSA sort order priority of a (small?) number of randomly selected instructions during compilation.
This causes small perturbations on the compiler's prefetching decisions, which allows for 2 main features:
1. finding out if there is a single instruction which was given a "wrong" priority by the compiler so it can be fixed
- to do this, we run some benchmark many times with different seeds until we find a seed that drastically reduces the compiled code's runtime
- once we found that seed, we can use binary search to decrease the "selection range" and zero-in on the one specific offending instruction
2. finding a lot of small changes that together reduce the runtime
- we can do this using a "hill-climbing" method
- try many perturbations until you find one slightly better than the baseline.
- try many followup perturbations (perturbing the best perturbation from the previous stage) until you find one slightly better again
- repeat until no more improvements are found
NOTE: Right now there's not "good way" of finding which instructions had their priority adjusted (especially important in (1) to find the one offending instruction). The only way to do so is to increase the log-level of the compilation debug print and then look at the logs.
PiperOrigin-RevId: 821309046
This change is a no-op since both newly introduced XLA:TPU option and the corresponding option on ExportNamedComputation pass is false by default.
PiperOrigin-RevId: 821039969
the transposes are not identity permutations. Identity transposes
should be eliminated separately in HandleTranspose already.
PiperOrigin-RevId: 820903953