Commit Graph

20081 Commits

Author SHA1 Message Date
Ilya Tikhonovskiy
4f3f2c9444 [XLA:GPU] add NanCount thunk to thunk_buffer_debug_pass
We call the pass for f32 and bf16 output buffers.

PiperOrigin-RevId: 826808271
2025-11-01 03:12:22 -07:00
A. Unique TensorFlower
4618f903c4 Reverts bec8916f32
PiperOrigin-RevId: 826722506
2025-10-31 20:19:17 -07:00
Eugene Zhulenev
752a654e9e [jax:ffi] Declare ffi::TypeInfo as a struct static member
PiperOrigin-RevId: 826717163
2025-10-31 20:00:24 -07:00
Terry Sun
8134117476 PR #32836: [GPU] Dispatch S-curve model to single-partition multi-host topology
Imported from GitHub PR https://github.com/openxla/xla/pull/32836

📝 Summary of Changes
Updated SINGLE_HOST communication type to SINGLE_PARTITION (fast-interconnect domain) to meet the need of multi-node NVLink (MNNVL) topology. Piped auto-detected partition size for communication type determination, also exposed partition size in SolGPUCostModel::Config for AOT compilation.

🎯 Justification
S-curve model cannot handle NVLink latency, single fast-interconnect domain including MNNVL topology should use latency table model. This PR updates the routing mechanism so that MNNVL will be treated as a single partition, while previously host is assumed equivalent to partition.

🚀 Kind of Contribution
 New Feature

📊 Benchmark (for Performance Improvements)
N/A

🧪 Unit Tests:
Added unit tests for model dispatching mechanism.

🧪 Execution Tests:
Behavior unchanged for non-MNNVL topology, N/A.

Copybara import of the project:

--
a9544375934873f7b888fdb5ff6c9dc6ee8b0e6c by Terry Sun <tesun@nvidia.com>:

use partition size for static model dispatching

--
e3445a5deb8da10146e90c50da5598f91cfe0a69 by Terry Sun <tesun@nvidia.com>:

expose partition size to config

--
212535ce891b8eb96ebb3c1e215a91d2b5035594 by Terry Sun <tesun@nvidia.com>:

better modularity

--
a9fe8a0f89dea9e2811d76a3570c7398df8dd756 by Terry Sun <tesun@nvidia.com>:

better code structure and doc string

--
a64a2b5ed1d45d815c6a2c47628b4d9ebb8368bd by Terry Sun <tesun@nvidia.com>:

update naming

Merging this change closes #32836

PiperOrigin-RevId: 826697791
2025-10-31 18:28:25 -07:00
Eugene Zhulenev
dad4fb74cd [xla:ffi] Remove deprecated TypeInfo constructor and replace it with XLA_FFI_TypeInfo alias
PiperOrigin-RevId: 826692285
2025-10-31 18:05:12 -07:00
Eugene Zhulenev
fbd032df67 [xla:cpu] Pass HloModule pointer to Thunk SerDes
Reverts 993369077a

PiperOrigin-RevId: 826675119
2025-10-31 17:12:35 -07:00
Penporn Koanantakool
00be2bc09e [xla:cpu:onednn] Skip failing tests on Aarch64 CPUs.
PiperOrigin-RevId: 826675056
2025-10-31 16:58:45 -07:00
Eugene Zhulenev
0b5bc94a83 [xla:ffi] Migrate to xla::ffi::MakeTypeInfo() API
PiperOrigin-RevId: 826667736
2025-10-31 16:42:38 -07:00
A. Unique TensorFlower
15e235f79b Allow IFRT-proxy to expand error-status payloads that are specific to Pathways.
PiperOrigin-RevId: 826656416
2025-10-31 15:58:26 -07:00
Sevin Fide Varoglu
c655468288 PR #31375: [XLA:GPU] Add NVLink domain check to CollectiveBackendAssigner
Imported from GitHub PR https://github.com/openxla/xla/pull/31375

📝 Summary of Changes
This PR updates the CollectiveBackendAssigner pass to account for NVLink domain connectivity when deciding between NVSHMEM and DEFAULT backends. It does this by adding a slice_size parameter to the compilation pipeline and introducing an IsIntraNVLinkDomain check.

🎯 Justification
The CollectiveBackendAssigner now uses NVSHMEM not only for single-host scenarios, but also when all devices are within the same NVLink domain.

🚀 Kind of Contribution
️ Performance Improvement, 🧪 Tests

📊 Benchmark (for Performance Improvements)
H100
|  | NVSHMEM enabled | NVSHMEM disabled |
|----------|----------|----------|
| llama31_8b_fp8_1x8    | 1095330 us   | 1093816 us    |
| llama31_8b_bf16_2x8    | 1368948 us   | 1370896 us   |
| llama31_8b_fp8_2x8    | 1096447 us   | 1092437 us   |
| llama31_70b_fp8_16x8    | 9723821 us   | 9707544 us    |

🧪 Unit Tests:
Added unit tests to xla/service/gpu/transforms/collectives/collective_backend_assigner_test.cc

🧪 Execution Tests:
Tested with llama3-8b on 2 GB200 nodes (fsdp = 8). The average step time in NVSHMEM case was 3.69s (vs. 3.76s in the default case).
Copybara import of the project:

--
a02b77cec9622314af01ae481d0fb28b149f1b45 by Sevin Varoglu <svaroglu@nvidia.com>:

Add NVLink domain check to CollectiveBackendAssigner

Merging this change closes #31375

PiperOrigin-RevId: 826649437
2025-10-31 15:48:52 -07:00
Zixuan Jiang
bf84442f21 Refactor mesh and axis representation.
PiperOrigin-RevId: 826647907
2025-10-31 15:36:42 -07:00
Jian Cai
9c620f90b8 [XLA][Numerics][HLO Original Value] Support original values for more cases in while loop simplifier pass
This updates the original value of a while loop if its input was nested tuples and got flatten during the pass

PiperOrigin-RevId: 826644894
2025-10-31 15:23:52 -07:00
A. Unique TensorFlower
80048022c7 Update XNNPACK in XLA
PiperOrigin-RevId: 826626298
2025-10-31 14:36:06 -07:00
Bill Varcho
261e077984 [ReplicaGroupV3][MeshAxesReplicaGroupList][2/2] Add flattened_replica_groups function for MeshAxesReplicaGroupList.
PiperOrigin-RevId: 826619318
2025-10-31 14:19:13 -07:00
Jian Cai
a6e123761d [XLA][Numerics][HLO Original Values] Handles original values of while loops in TPU reduce code motion pass
This updates the original value of a while loop after its input/output shape gets changed after the pass sinks qualified reduce instructions into its body.

PiperOrigin-RevId: 826618908
2025-10-31 14:05:49 -07:00
Parker Schuh
eef0661fc5 Rollforward with fixes of "Change RawSEDeviceMemory to be AsyncValueRef".
Reverts c7055c2e5b

PiperOrigin-RevId: 826608975
2025-10-31 13:39:01 -07:00
Bill Varcho
d008dc3999 Reverts d25ccb438d
PiperOrigin-RevId: 826583855
2025-10-31 12:35:26 -07:00
Haibo Huang
8572aaa4e9 Unify topology in PjRtTopologyDescription
The topology on pjrt layer can be seen as:

(process, chip, logical device) or (process, chip, core)

For cpu, it is (1, num device, 1)

For gpu, it is (num host, gpu per host, 1)

PiperOrigin-RevId: 826581627
2025-10-31 12:22:41 -07:00
A. Unique TensorFlower
e0f6a6c7f3 Integrate LLVM at llvm/llvm-project@42a8ff877d
Updates LLVM usage to match
[42a8ff877d47](https://github.com/llvm/llvm-project/commit/42a8ff877d47)

PiperOrigin-RevId: 826574010
2025-10-31 12:01:33 -07:00
Aliia Khasanova
6ff7f9c87f Add de/serializaton of fake_allocations in DynamicSliceThunk.
PiperOrigin-RevId: 826541399
2025-10-31 10:36:21 -07:00
Eusebio Durán Montaña
ecc2510eb0 Use Deserializer lambda for embedded thunks in DynamicSliceThunk
PiperOrigin-RevId: 826474606
2025-10-31 07:20:46 -07:00
Henning Becker
26d0882419 Add proto serialization for GpuExecutable
This is adding `GpuExecutuable::ToProto` and `GpuExecutable::FromProto` which allow us to [de]serialize an instance of `GpuExecutable` and later reconstruct it.

PiperOrigin-RevId: 826470601
2025-10-31 07:07:33 -07:00
A. Unique TensorFlower
f73a954906 Add SymbolicExpr::IsBinaryOp() method
This CL introduces a new helper method SymbolicExpr::IsBinaryOp() to quickly determine if a SymbolicExpr is a binary operation (i.e., not a constant or a variable). This is used in indexing_map.cc in several places for AffineMap and it will simplify the refactor.

PiperOrigin-RevId: 826468454
2025-10-31 06:54:52 -07:00
Marcin Radomski
718fe5695e [XLA:GPU] Add flags for filtering debugged thunks
Checking all buffers is way too heavy and causes timeouts, so we need the ability to focus on interesting parts of the thunk graph.

`--xla_gpu_experimental_thunk_buffer_debug_filter_by_thunk_id_ranges` allows limiting thunk IDs to selected ranges or values. The IDs are assigned in the order of emitting thunks, which should (TM) be stable and allow bisecting to find culprit thunk(s). The IDs are given as comma-separated list of integers, closed or half-open ranges (e.g. `:2,5,7:8,12:` to match <=2, 5, 7, 8 and >=12).

`--xla_gpu_experimental_thunk_buffer_debug_filter_by_profile_annotation_re` allows matching by thunk's profile annotation. This is a comma-separated list of regexes that will be matched against `ThunkInfo::profile_annotation`. The thunk's profile annotation needs to match any of the regexes.

They are meant to work with all thunk debug buffer instrumentation (currently: checksums, NaNs). If both flags are defined, the thunk will have to pass both the ID and profile annotation filters to get instrumented.

Implementation of the filtering logic is not included in this CL.

PiperOrigin-RevId: 826457166
2025-10-31 06:11:25 -07:00
Kanish Anand
e7dcad735e Add equality operator for NamedSharding
PiperOrigin-RevId: 826442714
2025-10-31 05:16:43 -07:00
Aliia Khasanova
add489fd8d Use std::vector<BufferAllocation> instead of std::vector<std::unique_ptr<BufferAllocation>> in DynamicSliceThunk.
`BufferAllocation::Slice` stores a raw pointer to the corresponding `BufferAllocation`. Now we keep the embedded thunk allocations alive by stroing unique_ptrs in the wrapping DynamicSliceThunk. The current design makes it hard to reuse the existing infrastructure, specifically to serialize `DynamicSliceThunk`. To address this, I'm changing fake_allocations to be  `std::vector<BufferAllocation>`.

The move constructor `std::vector::vector(std::vector&&)` is guaranteed to have constant time complexity and therefore it steals the internal data buffer from the source vector. This infers that the pointers to allocations are kept stable as long as:
* we preallocate the vector size
* we never copy the vector, but move

To make it safer for later usage, we can explicitely prohibid BufferAllocation to be  copyable/moveable. I'm going to do this in the following cl.

PiperOrigin-RevId: 826440060
2025-10-31 05:05:43 -07:00
A. Unique TensorFlower
e32304ddc5 [Autotuner]Add support for sharded autotuning in the pass.
PiperOrigin-RevId: 826417614
2025-10-31 03:50:55 -07:00
Eusebio Durán Montaña
e32f20dd91 Use factory function to create CubSortThunk
The `CubSortThunk` constructor was calling a function that returns a `absl::StatusOr`, and ignoring non-ok statuses and just accessing the value.

Presumably in prod the status is always ok, but making this failure case explicit.

PiperOrigin-RevId: 826410861
2025-10-31 03:37:19 -07:00
Kanish Anand
adfd891fde Refactor Mesh ctor's
PiperOrigin-RevId: 826410314
2025-10-31 03:26:09 -07:00
A. Unique TensorFlower
8734ec41d5 Disable capturing of dot RHS operands
This is proving to be unreliable.

PiperOrigin-RevId: 826395008
2025-10-31 02:45:34 -07:00
A. Unique TensorFlower
d6d4e02248 [XLA:GPU] Add multimem setup.
PiperOrigin-RevId: 826391581
2025-10-31 02:35:40 -07:00
A. Unique TensorFlower
993369077a Reverts bf23bf1b32
PiperOrigin-RevId: 826380939
2025-10-31 01:51:12 -07:00
A. Unique TensorFlower
d25ccb438d Reverts cef240807a
PiperOrigin-RevId: 826374657
2025-10-31 01:32:52 -07:00
Bill Varcho
cef240807a [ReplicaGroupV3][MeshAxesReplicaGroupList][1/2] Add initial class definition for V3 replica group.
PiperOrigin-RevId: 826334561
2025-10-30 23:18:40 -07:00
Felix Wang
d9c76aafeb Adjust the collective-permute cross host type to MULTI_HOST_NON_WORLD_LEVEL only.
PiperOrigin-RevId: 826327580
2025-10-30 22:54:49 -07:00
Eugene Zhulenev
d90723f48e [xla:pjrt:cpu] Add e2e test for YnnFusion + PJRT client
PiperOrigin-RevId: 826323865
2025-10-30 22:41:49 -07:00
Eugene Zhulenev
7ad55e8818 [xla:cpu] Add an end-to-end test for ynn fusions
PiperOrigin-RevId: 826318525
2025-10-30 22:20:44 -07:00
Eugene Zhulenev
bf23bf1b32 [xla:cpu] Pass HloModule pointer to Thunk SerDes
PiperOrigin-RevId: 826312546
2025-10-30 22:11:41 -07:00
Eugene Zhulenev
56d3b19280 [xla:cpu] NFC: Rename protos for Xnn/Ynn fusion options
PiperOrigin-RevId: 826304955
2025-10-30 22:01:47 -07:00
A. Unique TensorFlower
a95c558dc4 Save compile options with the compiled IFRT IR program to be used later for serialization
PiperOrigin-RevId: 826301016
2025-10-30 21:54:24 -07:00
A. Unique TensorFlower
b2334ac330 Integrate LLVM at llvm/llvm-project@22079e3f36
Updates LLVM usage to match
[22079e3f3698](https://github.com/llvm/llvm-project/commit/22079e3f3698)

PiperOrigin-RevId: 826294004
2025-10-30 20:44:41 -07:00
Eugene Zhulenev
db273660ba [xla:pjrt] Remove PjRtFuture type alias
Cleaning up BUILD files and includes will be done separately.

PiperOrigin-RevId: 826280389
2025-10-30 19:44:40 -07:00
Eugene Zhulenev
429a0cf1c7 [xla:cpu] Add target machine features to the error message
PiperOrigin-RevId: 826253599
2025-10-30 17:49:12 -07:00
Eugene Zhulenev
d9024af6d4 [xla:cpu] Do not register legacy runtime symbols with XLA:CPU custom calls
PiperOrigin-RevId: 826208548
2025-10-30 16:25:55 -07:00
Niklas Vangerow
31bb7c01ff Migrate multioutput_fusion_test to use PjRt.
PiperOrigin-RevId: 826203532
2025-10-30 15:18:22 -07:00
Parker Schuh
c3d0bf7023 Add additional way to poision a connection (to allow testing different
poisoning strategies).

PiperOrigin-RevId: 826193232
2025-10-30 14:52:58 -07:00
A. Unique TensorFlower
c40bb10b96 Add the option to dump before/after autotuned instructions in AutotunerConfig.
- This change is required to still support the functionality of xla_gpu_dump_autotuned_gemm_fusions in the new infra.

PiperOrigin-RevId: 826161466
2025-10-30 14:39:24 -07:00
A. Unique TensorFlower
8f60516a86 Refactor: Move common SymbolicMapTest setup to the fixture.
This change moves the initialization of commonly used `SymbolicExpr` and a sample `SymbolicMap` into the `SymbolicMapTest` fixture to reduce code duplication across tests.

PiperOrigin-RevId: 826161168
2025-10-30 14:19:16 -07:00
A. Unique TensorFlower
7736af79a6 Only enable YNNPACK for bf16 and int8 for now.
We plan to enable this in stages, starting with int8 and bf16, where the improvement is more significant.

PiperOrigin-RevId: 826160602
2025-10-30 14:05:02 -07:00
Karlo Basioli
f4ebf9d47d [XLA][codegen] Migrate triton operations that have shared dialect lowerings are implemented for.
These were missed in previous commits.
Addresses transpose and bitcast.

PiperOrigin-RevId: 826158776
2025-10-30 13:54:31 -07:00