Imported from GitHub PR https://github.com/openxla/xla/pull/32836📝 Summary of Changes
Updated SINGLE_HOST communication type to SINGLE_PARTITION (fast-interconnect domain) to meet the need of multi-node NVLink (MNNVL) topology. Piped auto-detected partition size for communication type determination, also exposed partition size in SolGPUCostModel::Config for AOT compilation.
🎯 Justification
S-curve model cannot handle NVLink latency, single fast-interconnect domain including MNNVL topology should use latency table model. This PR updates the routing mechanism so that MNNVL will be treated as a single partition, while previously host is assumed equivalent to partition.
🚀 Kind of Contribution
✨ New Feature
📊 Benchmark (for Performance Improvements)
N/A
🧪 Unit Tests:
Added unit tests for model dispatching mechanism.
🧪 Execution Tests:
Behavior unchanged for non-MNNVL topology, N/A.
Copybara import of the project:
--
a9544375934873f7b888fdb5ff6c9dc6ee8b0e6c by Terry Sun <tesun@nvidia.com>:
use partition size for static model dispatching
--
e3445a5deb8da10146e90c50da5598f91cfe0a69 by Terry Sun <tesun@nvidia.com>:
expose partition size to config
--
212535ce891b8eb96ebb3c1e215a91d2b5035594 by Terry Sun <tesun@nvidia.com>:
better modularity
--
a9fe8a0f89dea9e2811d76a3570c7398df8dd756 by Terry Sun <tesun@nvidia.com>:
better code structure and doc string
--
a64a2b5ed1d45d815c6a2c47628b4d9ebb8368bd by Terry Sun <tesun@nvidia.com>:
update naming
Merging this change closes#32836
PiperOrigin-RevId: 826697791
Imported from GitHub PR https://github.com/openxla/xla/pull/31375📝 Summary of Changes
This PR updates the CollectiveBackendAssigner pass to account for NVLink domain connectivity when deciding between NVSHMEM and DEFAULT backends. It does this by adding a slice_size parameter to the compilation pipeline and introducing an IsIntraNVLinkDomain check.
🎯 Justification
The CollectiveBackendAssigner now uses NVSHMEM not only for single-host scenarios, but also when all devices are within the same NVLink domain.
🚀 Kind of Contribution
⚡️ Performance Improvement, 🧪 Tests
📊 Benchmark (for Performance Improvements)
H100
| | NVSHMEM enabled | NVSHMEM disabled |
|----------|----------|----------|
| llama31_8b_fp8_1x8 | 1095330 us | 1093816 us |
| llama31_8b_bf16_2x8 | 1368948 us | 1370896 us |
| llama31_8b_fp8_2x8 | 1096447 us | 1092437 us |
| llama31_70b_fp8_16x8 | 9723821 us | 9707544 us |
🧪 Unit Tests:
Added unit tests to xla/service/gpu/transforms/collectives/collective_backend_assigner_test.cc
🧪 Execution Tests:
Tested with llama3-8b on 2 GB200 nodes (fsdp = 8). The average step time in NVSHMEM case was 3.69s (vs. 3.76s in the default case).
Copybara import of the project:
--
a02b77cec9622314af01ae481d0fb28b149f1b45 by Sevin Varoglu <svaroglu@nvidia.com>:
Add NVLink domain check to CollectiveBackendAssigner
Merging this change closes#31375
PiperOrigin-RevId: 826649437
This updates the original value of a while loop after its input/output shape gets changed after the pass sinks qualified reduce instructions into its body.
PiperOrigin-RevId: 826618908
The topology on pjrt layer can be seen as:
(process, chip, logical device) or (process, chip, core)
For cpu, it is (1, num device, 1)
For gpu, it is (num host, gpu per host, 1)
PiperOrigin-RevId: 826581627
This is adding `GpuExecutuable::ToProto` and `GpuExecutable::FromProto` which allow us to [de]serialize an instance of `GpuExecutable` and later reconstruct it.
PiperOrigin-RevId: 826470601
This CL introduces a new helper method SymbolicExpr::IsBinaryOp() to quickly determine if a SymbolicExpr is a binary operation (i.e., not a constant or a variable). This is used in indexing_map.cc in several places for AffineMap and it will simplify the refactor.
PiperOrigin-RevId: 826468454
Checking all buffers is way too heavy and causes timeouts, so we need the ability to focus on interesting parts of the thunk graph.
`--xla_gpu_experimental_thunk_buffer_debug_filter_by_thunk_id_ranges` allows limiting thunk IDs to selected ranges or values. The IDs are assigned in the order of emitting thunks, which should (TM) be stable and allow bisecting to find culprit thunk(s). The IDs are given as comma-separated list of integers, closed or half-open ranges (e.g. `:2,5,7:8,12:` to match <=2, 5, 7, 8 and >=12).
`--xla_gpu_experimental_thunk_buffer_debug_filter_by_profile_annotation_re` allows matching by thunk's profile annotation. This is a comma-separated list of regexes that will be matched against `ThunkInfo::profile_annotation`. The thunk's profile annotation needs to match any of the regexes.
They are meant to work with all thunk debug buffer instrumentation (currently: checksums, NaNs). If both flags are defined, the thunk will have to pass both the ID and profile annotation filters to get instrumented.
Implementation of the filtering logic is not included in this CL.
PiperOrigin-RevId: 826457166
`BufferAllocation::Slice` stores a raw pointer to the corresponding `BufferAllocation`. Now we keep the embedded thunk allocations alive by stroing unique_ptrs in the wrapping DynamicSliceThunk. The current design makes it hard to reuse the existing infrastructure, specifically to serialize `DynamicSliceThunk`. To address this, I'm changing fake_allocations to be `std::vector<BufferAllocation>`.
The move constructor `std::vector::vector(std::vector&&)` is guaranteed to have constant time complexity and therefore it steals the internal data buffer from the source vector. This infers that the pointers to allocations are kept stable as long as:
* we preallocate the vector size
* we never copy the vector, but move
To make it safer for later usage, we can explicitely prohibid BufferAllocation to be copyable/moveable. I'm going to do this in the following cl.
PiperOrigin-RevId: 826440060
The `CubSortThunk` constructor was calling a function that returns a `absl::StatusOr`, and ignoring non-ok statuses and just accessing the value.
Presumably in prod the status is always ok, but making this failure case explicit.
PiperOrigin-RevId: 826410861
This change moves the initialization of commonly used `SymbolicExpr` and a sample `SymbolicMap` into the `SymbolicMapTest` fixture to reduce code duplication across tests.
PiperOrigin-RevId: 826161168