mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-30 03:25:42 -04:00
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(distributed): route per inference request and cache probeHealth Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local *Model cache when modelRouter is set. The cached *Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
70 lines
2.5 KiB
Go
70 lines
2.5 KiB
Go
package nodes
|
|
|
|
import "time"
|
|
|
|
// ReplicaCandidate is the minimum view of a loaded model replica needed to
|
|
// apply the routing policy. It is intentionally decoupled from the gorm models
|
|
// (BackendNode, NodeModel) so the same picker can run against fresh DB rows
|
|
// (SmartRouter.Route → FindAndLockNodeWithModel) and against an in-memory
|
|
// snapshot (the per-frontend rotating cache flagged in pkg/model — see TODO
|
|
// below).
|
|
type ReplicaCandidate struct {
|
|
NodeID string
|
|
Address string
|
|
ReplicaIndex int
|
|
InFlight int
|
|
LastUsed time.Time
|
|
AvailableVRAM uint64
|
|
}
|
|
|
|
// PickBestReplica is the single source of truth for which loaded replica of a
|
|
// model serves the next request.
|
|
//
|
|
// Policy (ordered tiers, first non-tie wins):
|
|
// 1. Least in-flight wins — primary load-balancing signal.
|
|
// 2. Oldest last_used wins — round-robin between equally-loaded replicas.
|
|
// Every successful pick refreshes last_used (in FindAndLockNodeWithModel's
|
|
// transaction and in TouchNodeModel on cache hits), so the "oldest" tier
|
|
// naturally rotates through the candidate set without a separate cursor.
|
|
// 3. Largest available_vram wins — cold-start tiebreaker for replicas that
|
|
// have never been picked (identical last_used).
|
|
//
|
|
// Two callers must agree on this policy:
|
|
//
|
|
// - SmartRouter.Route, via the SQL ORDER BY in FindAndLockNodeWithModel
|
|
// (registry.go). That query MUST mirror this function — TestPickerSQLMirror
|
|
// asserts both sides agree on a representative dataset.
|
|
//
|
|
// - The per-frontend rotating-replica cache (NOT YET IMPLEMENTED — see
|
|
// pkg/model/loader.go and pkg/model/initializers.go for the integration
|
|
// point). When that cache lands, it will call PickBestReplica against an
|
|
// in-memory snapshot using locally-tracked in-flight counters and skip the
|
|
// per-request DB round-trip.
|
|
//
|
|
// Returns nil when the candidate list is empty. Does not allocate.
|
|
func PickBestReplica(candidates []ReplicaCandidate) *ReplicaCandidate {
|
|
if len(candidates) == 0 {
|
|
return nil
|
|
}
|
|
best := &candidates[0]
|
|
for i := 1; i < len(candidates); i++ {
|
|
c := &candidates[i]
|
|
if betterReplica(c, best) {
|
|
best = c
|
|
}
|
|
}
|
|
return best
|
|
}
|
|
|
|
// betterReplica reports whether candidate a is preferred over candidate b
|
|
// under the policy documented on PickBestReplica.
|
|
func betterReplica(a, b *ReplicaCandidate) bool {
|
|
if a.InFlight != b.InFlight {
|
|
return a.InFlight < b.InFlight
|
|
}
|
|
if !a.LastUsed.Equal(b.LastUsed) {
|
|
return a.LastUsed.Before(b.LastUsed)
|
|
}
|
|
return a.AvailableVRAM > b.AvailableVRAM
|
|
}
|