Files
LocalAI/core/services/nodes/replicapicker.go
LocalAI [bot] 8bbe89a537 fix(distributed): route per request across loaded replicas + cache probeHealth (#9968)
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel

Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.

A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.

No behavior change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(distributed): route per inference request and cache probeHealth

Two related fixes that together restore load balancing across loaded
replicas of the same model.

1. ModelLoader.Load and LoadModel bypass the local *Model cache when
   modelRouter is set. The cached *Model wraps an InFlightTrackingClient
   bound to a single (nodeID, replicaIndex) — reusing it pinned every
   subsequent request to whichever node won the very first pick, so
   FindAndLockNodeWithModel's round-robin never got a chance to run
   even after the reconciler scaled the model out to a second node. In
   distributed mode SmartRouter.Route now runs per request, and
   PickBestReplica picks the least-loaded replica each time.

   SmartRouter has its own coalescing (advisory DB lock for first-time
   loads + singleflight on backend.install RPC) so concurrent first
   requests for a not-yet-loaded model still produce a single worker
   side install.

2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
   in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
   routing every inference call hits probeHealth, and llama.cpp-style
   backends serialize HealthCheck behind active Predict — so a burst of
   incoming requests stalled on the probe to a node already mid-stream,
   tripping the 2s timeout and falling through to the install path.
   singleflight collapses N concurrent first-time probes for the same
   (node, addr) into one round-trip, failed probes invalidate the entry
   so the staleness-recovery path still triggers, and the TTL matches
   pkg/model/model.go's healthCheckTTL so the single-process and
   distributed paths share a staleness budget. The background
   HealthMonitor still reaps actually-dead backends within ~45s.

The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-24 08:15:27 +00:00

70 lines
2.5 KiB
Go

package nodes
import "time"
// ReplicaCandidate is the minimum view of a loaded model replica needed to
// apply the routing policy. It is intentionally decoupled from the gorm models
// (BackendNode, NodeModel) so the same picker can run against fresh DB rows
// (SmartRouter.Route → FindAndLockNodeWithModel) and against an in-memory
// snapshot (the per-frontend rotating cache flagged in pkg/model — see TODO
// below).
type ReplicaCandidate struct {
NodeID string
Address string
ReplicaIndex int
InFlight int
LastUsed time.Time
AvailableVRAM uint64
}
// PickBestReplica is the single source of truth for which loaded replica of a
// model serves the next request.
//
// Policy (ordered tiers, first non-tie wins):
// 1. Least in-flight wins — primary load-balancing signal.
// 2. Oldest last_used wins — round-robin between equally-loaded replicas.
// Every successful pick refreshes last_used (in FindAndLockNodeWithModel's
// transaction and in TouchNodeModel on cache hits), so the "oldest" tier
// naturally rotates through the candidate set without a separate cursor.
// 3. Largest available_vram wins — cold-start tiebreaker for replicas that
// have never been picked (identical last_used).
//
// Two callers must agree on this policy:
//
// - SmartRouter.Route, via the SQL ORDER BY in FindAndLockNodeWithModel
// (registry.go). That query MUST mirror this function — TestPickerSQLMirror
// asserts both sides agree on a representative dataset.
//
// - The per-frontend rotating-replica cache (NOT YET IMPLEMENTED — see
// pkg/model/loader.go and pkg/model/initializers.go for the integration
// point). When that cache lands, it will call PickBestReplica against an
// in-memory snapshot using locally-tracked in-flight counters and skip the
// per-request DB round-trip.
//
// Returns nil when the candidate list is empty. Does not allocate.
func PickBestReplica(candidates []ReplicaCandidate) *ReplicaCandidate {
if len(candidates) == 0 {
return nil
}
best := &candidates[0]
for i := 1; i < len(candidates); i++ {
c := &candidates[i]
if betterReplica(c, best) {
best = c
}
}
return best
}
// betterReplica reports whether candidate a is preferred over candidate b
// under the policy documented on PickBestReplica.
func betterReplica(a, b *ReplicaCandidate) bool {
if a.InFlight != b.InFlight {
return a.InFlight < b.InFlight
}
if !a.LastUsed.Equal(b.LastUsed) {
return a.LastUsed.Before(b.LastUsed)
}
return a.AvailableVRAM > b.AvailableVRAM
}