* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel
Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.
A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.
No behavior change.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(distributed): route per inference request and cache probeHealth
Two related fixes that together restore load balancing across loaded
replicas of the same model.
1. ModelLoader.Load and LoadModel bypass the local *Model cache when
modelRouter is set. The cached *Model wraps an InFlightTrackingClient
bound to a single (nodeID, replicaIndex) — reusing it pinned every
subsequent request to whichever node won the very first pick, so
FindAndLockNodeWithModel's round-robin never got a chance to run
even after the reconciler scaled the model out to a second node. In
distributed mode SmartRouter.Route now runs per request, and
PickBestReplica picks the least-loaded replica each time.
SmartRouter has its own coalescing (advisory DB lock for first-time
loads + singleflight on backend.install RPC) so concurrent first
requests for a not-yet-loaded model still produce a single worker
side install.
2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
routing every inference call hits probeHealth, and llama.cpp-style
backends serialize HealthCheck behind active Predict — so a burst of
incoming requests stalled on the probe to a node already mid-stream,
tripping the 2s timeout and falling through to the install path.
singleflight collapses N concurrent first-time probes for the same
(node, addr) into one round-trip, failed probes invalidate the entry
so the staleness-recovery path still triggers, and the TTL matches
pkg/model/model.go's healthCheckTTL so the single-process and
distributed paths share a staleness budget. The background
HealthMonitor still reaps actually-dead backends within ~45s.
The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>