pass.
This is because the pass is needed to linearize point-to-point Send and Recv
chains an HLO scheduler.
Modify the GPU HLO scheduler to call P2PSchedulePreparation pass regardless
whether the latency hiding scheduler is on.
PiperOrigin-RevId: 565374605
The output is mutable for no good reason, which causes issues when we want
to express "this fusion instruction's roots or the instruction if it's not
a fusion".
PiperOrigin-RevId: 565341699
Changed SlicedAllocationFinder to
- accept a method to determine if allocations a permitted to begin at a given offset
- expose a method to test if a sliced allocation can fit at a specific offset
PiperOrigin-RevId: 565231151
Fix a bug in which we over-allocate space for slices, when they are colocated with larger buffers.
The interaction causing this behavior is as follows:
A) GlobalDecreasingSizeBestFitHeap::FindChunkCandidates() adds additional space to the last chunk in a sliced allocation, to account for max_colocation_size.
B) When AlternateMemoryBestFitHeap::CheckPrefetchFit() computes slices_for_pending_chunks, it recomputes the size of the sliced allocation as the sum of the sizes of the chunks returned from A. Note, we do not recompute the size for the allocation in a non-sliced world.
C) Before committing a chunk, GlobalDecreasingSizeBestFitHeap::CommitChunk() changes the chunk's size to fit the size from B. Thus, in the sliced case we keep the extra max_colocation_size space, since we recalculated the allocation size with it. In the non-sliced case, we adjust the chunk size back to what is needed for the request.
So, this change is a no-op for non-slices.
PiperOrigin-RevId: 565217603
There's a short period during ParameterServerStrategy initialization / cluster connection in which worker preemptions will lead to UnavailableErrors from CreateContext calls. This adds configurable retries to SetServerDef so that a single connection failure does not stop the whole job. Retries will be enabled as the default behavior for PSS in a followup change.
PiperOrigin-RevId: 565214961
The loop that runs Autotune will fetch current values for available CPU and RAM on each iteration. This helps in situations where the hardware resources available to tf.data may be vertically scaled up or down based on usage during the process' lifetime.
PiperOrigin-RevId: 565197940
The owned PjRtBuffers in `owned_executable_args` need to live until execution is complete. Currently this is achieved by blocking until all the executable outputs are ready. However, this seemed to cause performance overheads, see b/299683272 and b/300102691.
With this change, we don't block until execution is complete. The ownership of `owned_executable_args` is moved to a lambda which is executed as a callback when the PjRtFuture returned by ExecutePortable is ready (which happens when the execution is complete).
PiperOrigin-RevId: 565169152
-Add WithReplicaGroups implementation for HloInstructionPattern to match with the collective instruction's replica groups.
PiperOrigin-RevId: 565160403
Small collectives might be better off when sinked and there are other potnential use cases
Also fix a bug, where we were accepting reuse of the data that we were storing and changing the tests using that pattern to match the fix.
PiperOrigin-RevId: 565080772
instructions through control dependence.
This is because the generated HLO program is correct even without the control
dependence chaining. The purpose of the control dependence chaining is to
support a scheduler, such as the latency hiding scheduler, and thus will be
added to the latency hiding scheduler preparation pass. Not producing the
control dependence chaining while decomposing collective-permute can also
simplify the implementation of collective-pipeliner in pipelining Send and
Recv instructions.
PiperOrigin-RevId: 565073772