mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-07 00:06:51 -04:00
070e001dfd0fa0f6cd6c88cfae0a4e480d59799f
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3a932a9803 |
feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159)
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED is set, and connect workers with scoped credentials from the register response. Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs. Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode. Assisted-by: Grok:grok grok-build Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(distributed): correct NATS JWT scoping and harden client auth The JWT-auth path added in 46467cc7 had several gaps that fail silently under LOCALAI_NATS_REQUIRE_AUTH: - Agent-worker minted JWTs did not allow the subjects the agent worker actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop), so MCP-CI jobs and backend-stop session cleanup were silently dropped. Scope the agent permission set to those subjects. - NATS subscription permission violations were swallowed (Subscribe returned a live-but-dead subscription). Confirm subscriptions with a server round-trip so a denial surfaces synchronously, and log async permission errors. - The backend worker connected anonymously when given a JWT without its paired seed; reject the unpaired credential instead. - The documented service-user permissions in nats-auth-setup.sh omitted prefixcache.>, which the frontend publishes and subscribes; add it. Also: add a credential-provider hook to the messaging client (consumed by the follow-up credential-lifecycle change), drop the always-nil error from NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct), and gofmt the feature's files. Tests: an agent-JWT e2e spec that connects to the enforcing NATS server and exercises every subscription the agent worker makes, plus permission allow-list coverage unit tests. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(distributed): acquire and auto-refresh worker NATS credentials Workers fetched NATS credentials once at startup, which broke two cases under JWT auth: a worker that registered while still pending admin approval never received a minted JWT (it connected unauthenticated and gave up), and a long-running worker's 24h JWT expired with no way to renew it. Introduce workerregistry.NATSCredentialManager, built on idempotent re-registration (the frontend preserves the node row and mints a fresh JWT each call): - Acquire re-registers through admin approval until the node is approved and credentials are minted (or returns the first success when auth is not required, preserving anonymous-NATS behavior). - RefreshLoop re-registers before the JWT expires (~75% of its lifetime), updating the credentials served to the connection. - Both are bounded (default 100 attempts / consecutive failures) and return an error on exhaustion, so an unapprovable or unrenewable worker exits non-zero and surfaces the problem instead of hanging or drifting toward an expired credential. The messaging client gains WithUserJWTProvider, fetching credentials on each (re)connect so the connection transparently adopts a refreshed JWT when the server expires the old one. RegisterFull exposes the approval status and full response; Register delegates to it. Both the backend worker and the agent worker are wired to this: explicit env credentials are used as-is, minted credentials are acquired-with-wait and refreshed, and a permanent refresh failure shuts the worker down so it restarts and re-acquires. Tests cover Acquire (wait-through-pending, bounded give-up, context cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry exit) and jwtExpiry decoding. Docs updated in distributed-mode.md. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
a44bdb29d4 |
feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Insert with path recency refresh and entry cap Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): recency-weighted per-value Weight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Remove all entries for a value Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(radixtree): race-free concurrency smoke test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard Address review findings on the generic prefix tree: - Extract a shared pruneWalk helper parameterized by a shouldClear predicate and use it from Evict, Remove, and the MaxEntries path. Previously evictOldestLocked cleared a victim's value but never removed the now value-less node or its childless ancestors, so internal nodes accumulated under sustained churn at the cap. The MaxEntries path now prunes the victim and its empty ancestors. - DRY: pruneWalk replaces the duplicated logic in the former pruneLocked and Remove's inner closure. - Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take the read lock (RLock) while Insert, Evict and Remove keep the write lock. Confirmed race-clean under go test -race. - Document the strict greater-than TTL boundary on Options.TTL and expired: age exactly equal to TTL is still live. - Guard Insert against an empty key (no-op): the root never holds a value. Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation, the no-growth-past-cap invariant, the TTL boundary, and empty-key behavior for both Insert and LongestMatch. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): RoutePolicy enum with parse/resolve Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Config with defaults and validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): deterministic xxhash prefix-chain extractor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): pure filter-then-score replica selection Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Provider interface and radix-tree-backed Index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(prefixcache): gofmt policy enum comment alignment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): head-first prefix chunking and hoist Weight out of sort Address code-quality review findings in the prefixcache package. Correctness: ExtractChain now chunks from absolute offset 0 with fixed [0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth head blocks. The previous tail-keeping logic shifted the byte offset by a non-window amount once a conversation grew past MaxDepth*WindowBytes, changing every hash each turn and silently breaking cross-turn longest-prefix matching. The reusable KV/prefix cache lives at the head of the prompt, so anchoring at offset 0 makes the chain a true prefix-chain: P and P+suffix share their full leading overlap. Add a regression spec proving cross-turn stability past the cap. Performance: Index.Decide precomputes each candidate's Weight once (decorate-sort-undecorate) instead of calling the O(tree size) Weight inside the O(n log n) sort comparator. Behavior is unchanged. Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual byte loop, clearing the modernize rangeint finding. Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's documented concurrency safety under go test -race. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): prefixcache observe/invalidate subjects and payloads Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): NATS sync publish/apply for observe and invalidate Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): ctx carrier for prefix-hash chain Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): PrefixChainHook indirection for backend-side chain build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): stash prompt prefix chain on ctx before distributed routing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend): mirror modelID fallback for prefix-chain salt parity Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scheduling config columns for prefix-cache routing Add RoutePolicy and per-model balance/prefix-match override columns to ModelSchedulingConfig and include them in the SetModelScheduling upsert DoUpdates list so updates are not dropped on conflict. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): optional route preference in FindAndLockNodeWithModel Add a RoutePreference type and a new pref parameter so the atomic pick+lock+increment can be biased toward a preferred node without weakening atomicity. A nil preference reproduces the previous ORDER BY behavior exactly. Update the ModelRouter interface, both router.go call sites (pass nil for now; Phase 5 builds the real preference), the test doubles, and the distributed e2e caller. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): make Sync satisfy Provider with Evict Sync.Observe now returns whether the local index treated the assignment as new or extended, and Sync gains an Evict method that delegates to the wrapped index. Together these let SmartRouter hold a single prefixcache.Provider that broadcasts via NATS. Adds a compile-time Provider assertion and an Evict-delegates behavioral test. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil keeps routing byte-for-byte the round-robin floor). On each request Route now calls buildPreference: it reads the prompt prefix chain from ctx (distributedhdr.PrefixChain), resolves the per-model policy/thresholds over the global config, loads candidate replica in-flight via a new registry read LoadedReplicaStats (deduped to one entry per node using the MIN in-flight across that node's replicas), asks the provider to Decide, and runs prefixcache.Select. The chosen node is passed as the RoutePreference to FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick, cold scheduleAndLoad), and the served node is recorded via Observe only when the resolved policy is prefix_cache so round-robin models never pollute the tree. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): invalidate prefix-cache entries on unload and stale removal UnloadModel and both staleness fall-through paths in Route (after a failed gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model, nodeID), guarded by a nil-provider check so the round-robin floor is unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations also broadcast to peer frontends. Adds a test that a previously hot prefix no longer Decides to a node after UnloadModel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): rolling forced-disturb pressure counter Add a concurrency-safe per-model rolling counter that tracks how many times a request had a usable hot prefix match but the load guard forced it off the warm node. Entries outside the window are dropped lazily on Count so the backing slice stays bounded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): autoscale on prefix-cache forced-disturb pressure Wire the rolling forced-disturb counter into the SmartRouter and the ReplicaReconciler. Router: in buildPreference, after Decide + Select, record a forced-disturb when a usable hot prefix match existed (d.HotNodeID != "" and d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or nothing) because the load guard ruled the warm node out. This is the scale-worthy signal: the cache-warm replica is saturated. It deliberately does not fire for all-unique workloads (no hot match), avoiding false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil keeps the path a no-op. Reconciler: read the same Pressure instance in reconcileModel as an extra scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel guards and the UnsatisfiableUntil cooldown that gates the whole method. Pressure never overrides MaxReplicas and never force-evicts; a no-capacity model does not spin. Window and threshold come from prefixcache.Config (PressureWindow default 1m, PressureScaleThreshold default 1) and are configurable via ReplicaReconcilerOptions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow Record now prunes entries older than the rolling window (the same prune Count does), via a shared pruneLocked helper, so a model that takes forced-disturb records but is never Counted (e.g. one with zero loaded replicas the reconciler skips) no longer grows its backing slice unbounded. Also removes the dead pressureWindow struct field and the ReplicaReconcilerOptions.PressureWindow option from the reconciler: they were stored but never read (the window lives inside the *prefixcache.Pressure instance). The scale block now reads pressure.Count once into a local. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(api): prefix-cache fields in scheduling endpoint DTO with validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): prefix-cache routing controls in node scheduling form Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire prefix-cache index, NATS sync, and config Activates prefix-cache-aware routing in distributed mode. Builds the prefixcache Index + NATS-backed Sync + Pressure counter, installs the distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix chain per request, subscribes to prefixcache.observe/prefixcache.invalidate to apply peers' events to the local index (no re-broadcast), threads PrefixProvider/PrefixConfig/Pressure into the SmartRouter and Pressure/PressureThreshold into the ReplicaReconciler, and runs a background eviction ticker (every TTL/2) bound to the app context. Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE) opts out and leaves the provider/pressure nil so routing stays round-robin. --distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m) controls entry idle-timeout and eviction cadence. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(nodes): round-robin-floor invariant for prefix-cache routing Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never picked even with a perfect prefix match (round-robin floor holds), while a balanced hot node within the load slack is reused. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(prefixcache): clear branch lint findings and em dashes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): validate prefix-cache config at startup wiring Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(radixtree): single-walk WeightsFor for batch value weights Add Tree.WeightsFor(values, now) which computes the recency-weighted weight for many values in a single O(N + len(values)) tree traversal, versus calling Weight once per value (O(len(values) * N)). Consumers that score K candidates against the tree under the read lock no longer pay K full walks. Extract the per-entry contribution math into an unexported helper shared by both Weight and WeightsFor so the metric stays identical (DRY). Weight's public behavior is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): add ModelConfig.ModelID() single source of truth The c.Name fallback to c.Model was duplicated in core/backend/options.go (feeding model.WithModelID) and hand-copied into core/backend/llm.go (the prefix-chain salt). These MUST agree or the prefix-cache salt diverges silently from the id the model loader tracks. Consolidate both into a new config.ModelConfig.ModelID() helper and call it from both sites. Behavior is identical. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): reuse one xxhash.Digest in ExtractChain ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth per call) and grew the chain slice without preallocation. Reuse a single Digest via Reset() before each block and preallocate the chain to min(nBlocks, MaxDepth). xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical value to a fresh New()+Write. Output hashes are unchanged, preserving the cross-process determinism that peers rely on over NATS. Verified by capturing ExtractChain output for the existing test inputs before and after the refactor: identical. Existing extractor tests pass unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk Index.Decide called radixtree.LongestMatch over the whole tree, so the deepest match could be a node that is offline, unloaded, or simply not in the passed candidate set. Honoring that as HotNodeID produced a false forced-disturb signal upstream (buildPreference records pressure when chosen != HotNodeID), making it look like a warm replica was load saturated when it was actually absent. Build the candidate set once and only set HotNodeID/MatchRatio when the matched node is an actual candidate; otherwise fall back to cold placement. A future refinement could ask the tree for the longest match restricted to the candidate nodes (shallower-but-valid) instead of dropping it. Also replace the per-candidate tree.Weight call in the cold-order sort with a single tree.WeightsFor walk, turning O(K*N) under the read lock into O(N + K). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): remove Select's unreachable deterministic fallback buildPreference always passes ColdOrder as a permutation of the full candidate set, so the cold-order loop hits every eligible candidate. The trailing best/bestIF scan was dead. Replace it with a plain "return """ and document that ColdOrder is guaranteed to cover all candidates, so "" means none were eligible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): fetch model scheduling config once per Route GetModelScheduling was read three times per request - in resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling - three DB round-trips for one row that is immutable for the life of the request, and not a consistent snapshot. Fetch it once near the top of Route and thread the *ModelSchedulingConfig (may be nil) into all three helpers. scheduleNewModel keeps its own fetch since it runs outside the Route snapshot. Behavior is identical for nil sched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): add Pressure.Reset to consume forced-disturb signal Pressure.Count is non-draining (it prunes only by age), so a single burst of forced-disturbs stays within the rolling window for the whole window and keeps Count >= threshold on every reconciler tick. The reconciler will use Reset to clear a model's events after acting on the signal so a fresh scale-up requires fresh forced-disturbs to accumulate, rather than one burst driving the model toward MaxReplicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): at most one scale-up per reconcile tick, consume pressure Two autoscale bugs: 1. Over-scaling: the pressure scale-up block read Pressure.Count but never consumed it. With a non-draining counter a single forced-disturb burst kept Count >= threshold across the whole window, firing scaleUp on every tick and pushing the model toward MaxReplicas off one transient burst. After a successful pressure-triggered scale-up the reconciler now calls Pressure.Reset to consume the signal. 2. Double scale-up in one tick: the all-replicas-busy block and the pressure block could both fire in the same reconcileModel pass, each calling scaleUp(+1) against the same `current` read once at the top, so a model that was both busy and over threshold scaled +2 and could overshoot MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1) per tick: the pressure block is skipped if the busy block already scaled, and scale-down is skipped in any tick that scaled up. MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation Add SetReplicaRemovedHook to NodeRegistry and fire it from both RemoveNodeModel and RemoveAllNodeModelReplicas after a successful delete. This is the single chokepoint every replica-removal path funnels through (router eviction, reconciler scale-down, probe reaper, health-monitor node-down reap, RemoteUnloaderAdapter), so the prefix-cache index can be invalidated by construction rather than wiring each call site individually. The hook is stored in an atomic.Pointer so the startup wiring (setter) and the request/reconcile-time fire are race-free; it is nil-safe when unset. GORM Delete reports no error for a no-op delete, so the hook also fires when nothing was removed; the consumer's Invalidate(model, node) is idempotent so this is harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): invalidate prefix-cache on any replica removal via registry hook Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): single source of truth for threshold bounds Extract ValidateThresholds into prefixcache/config.go so the per-model override validation (nodes.go endpoint) and Config.Validate share one implementation of the numeric bounds (min_prefix_match in [0,1], balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead of hard-coding them in two places. The route_policy allow-list stays explicit (not ParsePolicy, which maps typos to Default). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): preserve prefix-cache settings on partial scheduling update A scheduling POST that omitted route_policy/thresholds (e.g. a min_replicas-only update) full-replaced every column and silently reset the model's previously-configured prefix-cache settings to empty/zero. Make the four prefix-cache request fields pointers so omitted is distinguishable from explicit zero, and merge PATCH-style in SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves the existing config value (zero default when none). Non-prefix fields keep their full-replace PUT semantics. Validation now runs on the resolved values via prefixcache.ValidateThresholds. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica removal of every model, including round-robin models that never used the prefix cache. Index.Invalidate previously called tree(model), which lazily created and permanently retained an empty radix tree for any model that ever lost a replica, growing the trees map without bound. Sync.Invalidate also published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op removals across the cluster. Index.Invalidate now looks the tree up read-only via existingTree and returns without allocating when none exists. The Provider interface is unchanged; Sync gates the broadcast through an optional invalidateExisting(bool) capability type-asserted from the wrapped Index, falling back to the prior always-broadcast behavior for other Provider implementations. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort WeightsFor already returns a map keyed by every requested candidate, so the separate candidates set built to validate the hot match was redundant: a node is a candidate iff it is a key in the weights map. Drop the extra map and gate the hot-match check on weights membership. Also skip the sort when there is at most one candidate, since the input order is already the cold order. Behavior is unchanged. Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins would need lazy cross-file changes and is out of scope here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns Bulk node-scoped node_models deletes (Register re-register cleanup, MarkOffline, MarkDraining, Deregister) removed rows directly without firing the replica-removed hook, so the prefix-cache index kept pointing at nodes whose models were gone. Capture the DISTINCT model names before each bulk delete and fire fireReplicaRemoved once per model after a successful delete, restoring the single-chokepoint invariant for all removal paths. The pre-query is skipped when no hook is set so the no-hook path stays cheap. Also narrow LoadedReplicaStats to SELECT only node_id and in_flight (the only fields the router consumer reads), dropping the JOIN-side available_vram fetch and unused columns while keeping the []ReplicaCandidate return type unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(reconciler): consume autoscale signals only on a real scale-up scaleUp was fire-and-forget (void) yet its callers unconditionally consumed the pressure signal (Pressure.Reset) and the MinReplicas hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp added nothing (ScheduleAndLoadModel errored, or no node could be loaded) the saturated warm replica got no new replica AND its accumulated forced-disturb history was wiped, forcing the signal to re-accumulate over a full PressureWindow before the next attempt. Make scaleUp return whether at least one replica was actually scheduled, and gate the side effects on it: - pressure block (2b): set scaledUp and call Pressure.Reset only on success; on failure preserve the signal so the next tick retries off the same accumulated pressure. - busy-burst block (2): set scaledUp from the return value so a failed attempt does not suppress the pressure path or scale-down. - MinReplicas block: call ClearUnsatisfiable only on success so a failed attempt does not reset the unsatisfiable counter. All existing invariants (MaxReplicas, capacity gating, UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are preserved. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): drop router's redundant prefix-cache Invalidate calls The NodeRegistry removal chokepoint (RemoveNodeModel / RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which invalidates the prefix-cache index. The router was also calling prefixProvider.Invalidate explicitly right after each registry removal on the two stale-replica health-probe fall-throughs in Route and in UnloadModel, so every router-side eviction invalidated twice (double tree-prune + double NATS broadcast). Remove the three redundant explicit Invalidate calls and their empty nil-guards. Each removed call sat immediately after a registry removal that fires the hook, so invalidation is preserved via the chokepoint. Decide/Observe usage is untouched. Re-point the unit test (fake registry fires no hook) to assert the removal chokepoint is exercised on unload instead of the router's direct invalidation. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): make capture+delete atomic in bulk node_models removal paths MarkOffline, MarkDraining, and the Register re-register cleanup ran the nodeModelNames SELECT and the bulk node_models DELETE as two separate statements on r.db with no transaction. A SetNodeModel landing between the two was deleted but its replica-removed hook never fired, leaving the prefix-cache index pointing at a removed replica until TTL or candidacy self-heal. Wrap the capture and the delete in a single db.Transaction in each path (mirroring how Deregister already does it). The captured model names are collected into a slice declared outside the closure; the replica-removed hook fires for each only after the transaction commits, so a rollback never invalidates the index for a removal that did not persist. The set of fired hooks now equals exactly the set of node_models rows actually deleted, with no interleaving gap. The status flip in MarkOffline/MarkDraining (setStatus) is a separate, pre-existing operation and routing already filters non-healthy nodes, so it stays outside the transaction; return contracts are unchanged. Deregister was already correct and is untouched. The cheap-path skip (no hook -> skip the SELECT) is preserved. Adds a spec asserting MarkOffline fires hooks for exactly the rows it deletes and leaves no node_models row behind (consistent snapshot). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(nodes): debug logging for prefix-cache routing decisions and observations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): match shared prefixes by valuing every node on insert Insert recorded the value (node id) only on the final node of the key chain, leaving every intermediate prefix node valueless. LongestMatch returns the deepest node that hasValue, so two chains that share a leading block but diverge in the tail never matched: only exact-repeat queries hit. That broke the prefix-cache routing core use cases (shared system prompt, multi-turn extension, volatile tail), all of which rely on prefix matching rather than exact-repeat. Set value/hasValue/lastSeen at every node along the chain so each prefix-block node remembers the node id that served that prefix (SGLang/vLLM-style). The deepest match wins, and the last writer owns a shared prefix node (a recency heuristic: the most recent chain through a block is the one most likely still warm). size now counts valued nodes, which is the intended meaning. Updated radixtree tests to the new semantics: deepest-prefix test uses non-overlapping chains, a new test asserts last-writer-owns-shared-node, Evict/Remove/MaxEntries expectations recomputed for per-prefix-node counting, and a shared-prefix LongestMatch red test added. Added a prefixcache Decide test proving a prefix-only query routes to the warm node. No prefixcache .go logic changed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): lock in prefix-cache routing behavior end to end Add a DB-backed e2e spec that drives SmartRouter against a real NodeRegistry (Postgres testcontainer) and the real prefixcache.Index radix-tree provider, using a fake gRPC backend factory so no real inference runs. Covers the five behaviors validated by hand: 1. Cold miss + observe: an unseen prefix chain cold-places and is recorded. 2. Hot-match affinity: the same chain returns to its warm node X. 3. Shared-prefix match: a divergent chain sharing X's leading prefix still routes to X (the radix-tree regression we fixed). 4. Negative control: an unrelated chain is a cold miss, not a false hot match on X. 5. Failover + invalidation: removing X's replica fires the registry chokepoint hook to invalidate the prefix entry, and the chain fails over to surviving node Y and re-homes there. Replaces the need for manual docker-compose re-runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): make prefix-cache affinity replica-granular Track prefix-cache affinity per loaded replica (a backend process with its own KV cache) instead of per node, so multiple replicas of the same model on one node each keep distinct affinity and a hot prefix routes back to the exact replica that served it. - radixtree: add RemoveFunc(pred) and reimplement Remove on top of it. - prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/ PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to drop every replica of a node; Invalidate drops one replica. Select returns (ReplicaKey, bool) and gains a deterministic least-in-flight eligible fallback (tiebreak NodeID then Replica). - messaging: carry Replica on PrefixCacheObserveEvent and PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node). - Sync delegates + broadcasts with replica; InvalidateNode broadcasts Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode. This is part 1 of 2; the registry/router/wiring consumers are updated separately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): make prefix-cache routing replica-granular Wire the SmartRouter, NodeRegistry, and distributed startup to the replica-keyed prefixcache API. Affinity is now tracked per replica (each replica is a separate process with its own KV cache), so a prefix served by (node,0) no longer leaks onto the same-node sibling (node,1). - RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks the EXACT (node_id, replica_index) row, falling through to the default ORDER BY when that replica is not loaded. - SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires the specific replica, RemoveAllNodeModelReplicas and the four bulk node-scoped deletes fire replica<0 (all replicas of the node). - buildPreference builds one Candidate per loaded replica and locks the exact replica the policy chose; observePrefix records the served ReplicaKey at every call site. - distributed.go routes the hook to InvalidateNode (replica<0) or Invalidate(key). - Tests updated to the replica-keyed API plus new coverage: a hot prefix on (node,0) prefers replica 0 over the same-node sibling (router unit + e2e), and FindAndLock locks the exact preferred replica. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): derive prefix chain from messages for tokenizer-template models Prefix-cache-aware routing built its prompt-prefix chain from the rendered prompt string `s` in ModelInference. For models with TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the backend tokenizes the structured messages itself - so `s` is empty, the chain is empty, and routing silently falls back to round-robin. That covers the bulk of modern chat models (qwen3, llama3, ...), so the feature effectively never engaged for them. Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable head-first serialization of the conversation (role + content per turn). Two requests sharing a leading system prompt and early turns share a leading byte prefix, which ExtractChain maps to a shared chain prefix - landing both on the same cache-warm replica. The rendered `s` is still preferred when present (higher fidelity for non-template models). Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision" logs despite per-request Route calls, traced to the empty-chain guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document prefix-cache routing roadmap Add a routing-and-caching roadmap section to the distributed-mode guide, linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and NVIDIA Dynamo. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8d6548c0b9 |
fix(distributed): sync gallery OpCache + caches across frontend replicas (#9983)
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:
- A user installing a model on replica A saw the operation card flicker
in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
on replica A failed to find the new model — B's ModelConfigLoader was
still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
that had already shipped.
Mirror the jobs Dispatcher pattern for gallery ops:
- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
operations row and broadcast OpCacheEvent so peers merge it in. The
hydrate path uses a new GalleryStore.ListActive() (status in {pending,
downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
Wildcard subscriber that calls a new lock-light mergeStatus into the
local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
runs the locally-registered cancel func. Hydrate() restores active rows
from PostgreSQL on startup so a freshly-started replica is not
observably empty mid-install. CancelOperation tolerates the cancel func
living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
a successful install/delete/upgrade. SubscribeBroadcasts wires peers
to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
so a failed install replicated to a peer arrived with a nil error and
the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
to NATS or persisting to PostgreSQL. The wildcard subscriber's
mergeStatus loops back into the same service on the publishing replica
and would deadlock otherwise; this also prevents future PG round-trips
from stalling concurrent readers on every progress tick.
Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
a0f3e26245 |
fix(distributed): make admin backend installs resilient and observable (#9958)
* feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" | "downloading" | "running_on_worker" | "success" | "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" | "downloading" | "extracting" | "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e5d7b84216 |
fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717)
* feat(messaging): add backend.upgrade NATS subject + payload types
Splits the slow force-reinstall path off backend.install so it can run on
its own subscription goroutine, eliminating head-of-line blocking between
routine model loads and full gallery upgrades.
Wire-level Force flag on BackendInstallRequest is kept for one release as
the rolling-update fallback target; doc note marks it deprecated.
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed/worker): add per-backend mutex helper to backendSupervisor
Different backend names lock independently; same backend serializes. This
is the synchronization primitive used by the upcoming concurrent install
handler — without it, wrapping the NATS callback in a goroutine would
race the gallery directory when two requests target the same backend.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed/worker): run backend.install handler in a goroutine
NATS subscriptions deliver messages serially on a single per-subscription
goroutine. With a synchronous install handler, a multi-minute gallery
download would head-of-line-block every other install request to the
same worker — manifesting upstream as a 5-minute "nats: timeout" on
unrelated routine model loads.
The body now runs in its own goroutine, with a per-backend mutex
(lockBackend) protecting the gallery directory from concurrent operations
on the same backend. Different backend names install in parallel.
Backward-compat: req.Force=true is still honored here, so an older master
that hasn't been updated to send on backend.upgrade keeps working.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed/worker): subscribe to backend.upgrade as a separate path
Slow force-reinstall now lives on its own NATS subscription, so a
multi-minute gallery pull cannot head-of-line-block the routine
backend.install handler on the same worker. Same per-backend mutex
guards both — concurrent install + upgrade for the same backend
serialize at the gallery directory; different backends are independent.
upgradeBackend stops every live process for the backend, force-installs
from gallery, and re-registers. It does not start a new process — the
next backend.install will spawn one with the freshly-pulled binary.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend
Master now sends to backend.upgrade for force-reinstall, with a
nats.ErrNoResponders fallback to the legacy backend.install Force=true
path so a rolling update with a new master + an old worker still
converges. The Force parameter leaves the public Go API surface
entirely — only the internal fallback sets it on the wire.
InstallBackend timeout drops 5min -> 3min (most replies are sub-second
since the worker short-circuits on already-running or already-installed).
UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi
gallery pulls.
Updates the admin install HTTP endpoint
(core/http/endpoints/localai/nodes.go) to the new signature too.
router_test.go's fakeUnloader does not yet implement the new interface
shape; Task 3.2 will catch it up before the next package-level test run.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): update fakeUnloader for new NodeCommandSender shape
InstallBackend lost its force bool param (Force is not part of the public
Go API anymore — only the internal upgrade-fallback path sets it on the
wire). UpgradeBackend gained a method. Fake records both call slices and
provides an installHook concurrency seam for upcoming singleflight tests.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): cover UpgradeBackend's new subject + rolling-update fallback
Task 3.1 changed the master to publish UpgradeBackend on the new
backend.upgrade subject; the existing UpgradeBackend tests scripted the
old install subject and so all 3 began failing as expected. Updates them
to script SubjectNodeBackendUpgrade with BackendUpgradeReply.
Adds two new specs for the rolling-update fallback:
- ErrNoResponders on backend.upgrade triggers a backend.install
Force=true retry on the same node.
- Non-NoResponders errors propagate to the caller unchanged.
scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and
scriptReplyMatching (predicate-matched canned reply, used to assert that
the fallback path actually sets Force=true on the install retry).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): coalesce concurrent identical backend.install via singleflight
Six simultaneous chat completions for the same not-yet-loaded model were
observed firing six independent NATS install requests, each serializing
through the worker's per-subscription goroutine and amplifying queue
depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group
keyed by (nodeID, backend, modelID, replica): N concurrent identical
loads share one round-trip and one reply.
Distinct (modelID, replica) keys still fire independent calls, so
multi-replica scaling and multi-model fan-out are unaffected.
fakeUnloader gains a sync.Mutex around its recording slices to keep
concurrent test goroutines race-clean.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e/distributed): drop force arg from InstallBackend test calls
Two e2e test call sites still passed the trailing force bool that was
removed from RemoteUnloaderAdapter.InstallBackend in
|
||
|
|
447c186089 |
fix(distributed): make backend upgrade actually re-install on workers (#9708)
* fix(distributed): make backend upgrade actually re-install on workers UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * test(e2e/distributed): pass force=false to adapter.InstallBackend NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6b63b47f61 |
feat(distributed): support multiple replicas of one model on the same node (#9583)
* feat(distributed): support multiple replicas of one model on the same node The distributed scheduler implicitly assumed `(node_id, model_name)` was unique, but the schema didn't enforce it and the worker keyed all gRPC processes by model name alone. With `MinReplicas=2` against a single worker, the reconciler "scaled up" every 30s but the registry never advanced past 1 row — the worker re-loaded the model in-place every tick until VRAM fragmented and the gRPC process died. This change introduces multi-replica-per-node as a first-class concept, with capacity-aware scheduling, a circuit breaker, and VRAM soft-reservation. Operators can declare per-node capacity via the worker flag `--max-replicas-per-model` (mirrored as auto-label `node.replica-slots=N`) or override per-node from the UI. * Schema: BackendNode gains MaxReplicasPerModel (default 1) and ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id + model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for the reconciler circuit breaker. * Registry: replica_index threaded through SetNodeModel, RemoveNodeModel, IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel, SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers: CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot), RemoveAllNodeModelReplicas, FindNodesWithFreeSlot, ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with ErrInsufficientVRAM), and the unsatisfiable-flag CRUD. * Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads of the same model land on distinct ports. Adds CLI flag --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1) and emits the auto-label. * Router: scheduleNewModel filters candidates by free slot, allocates the replica index, and soft-reserves VRAM before installing the backend. evictLRUAndFreeNode now deletes the targeted row by ID instead of all replicas of the model on the node — fixes a latent bug where evicting one replica orphaned its siblings. * Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig (MinReplicas > capacity) doesn't loop forever. After 3 consecutive ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and emits a warning. ClearAllUnsatisfiable fires from Register, ApproveNode, SetNodeLabel(s), RemoveNodeLabel and UpdateMaxReplicasPerModel so a new node joining or label changes wake the reconciler immediately. scaleDownIdle removes highest-replica-index first to keep slots compact. * Heartbeat resets reserved_vram to 0 — worker is the source of truth for actual free VRAM; the reservation is only for the in-tick race window between two scheduling decisions. * Probe path (reconciler.probeLoadedModels and health.doCheckAll) now pass the row's replica_index to RemoveNodeModel so an unreachable replica doesn't orphan healthy siblings. * Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a sticky override (preserved across worker re-registration). DELETE clears the override so the worker's flag applies again on next register. Required because Kong defaults the worker flag to 1, so every worker restart would have silently reverted the UI value. * React UI: always-visible slot badge on the node row (muted at default 1, accented when >1); inline editor in the expanded drawer with pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when the value is admin-set, and a "Reset" button to hand control back to the worker. Soft confirm when shrinking the cap below the count of loaded replicas. Scheduling rules table gets an "Unsatisfiable until HH:MM" status badge surfacing the cooldown. * node.replica-slots filtered out of the labels strip on the row to avoid duplicating the slot badge. 23 new Ginkgo specs (registry, reconciler, inflight, health) cover: multi-replica row independence, RemoveNodeModel of one replica preserving siblings, NextFreeReplicaIndex slot allocation including ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping and recovery on Register, scheduleDownIdle ordering, ClusterCapacity math, ReserveVRAM admission gating, Heartbeat reset, override survival across worker re-registration, and ResetMaxReplicasPerModel handing control back. Plus 8 stdlib tests for the worker processKey / CLI / auto-label. Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor worker (single 128 GiB node, MinReplicas=2): the reconciler now caps the scale-up at the cluster's actual capacity instead of looping. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing] * refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions * Capacity editor hint trimmed from operator-doc-style ("Sourced from the worker's `--max-replicas-per-model` flag. Changing it here makes it a sticky admin override that survives worker restarts." → "Saved values stick across worker restarts.") and the override-state copy similarly compressed. The full mechanic is no longer needed in the UI — the override pill carries the meaning and the docs cover the rest. * Node row actions migrated from an inline cluster of icon buttons (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for per-row model actions, so dense Nodes tables stay clean. Approve stays as a prominent primary button — it's a stateful admission gate, not a routine action, and elevating it matches how /manage surfaces install-time decisions outside the menu. * The expanded drawer's Labels section now filters node.replica-slots out of the editable label list. The label is owned by the Capacity editor above; surfacing it again as an editable label invited confusion (the Capacity save would clobber any direct edit). Both backend and agent workers benefit — they share the row rendering path, so the action menu and label filter apply to both. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish] * fix(react-ui/nodes): suppress slot badge on agent workers Agent workers don't load models, so the per-node replica capacity is inapplicable to them. Showing "1× slots" on agent rows was a tiny inconsistency from the unified rendering path — gate the badge on node_type !== 'agent' so it only appears on backend workers. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] * refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form The expanded node drawer used to stack five panels — slot badge, filled capacity box, Loaded Models h4+empty-state, Installed Backends h4+empty-state, Labels h4+chips+form — making routine inspections feel like a control panel. The scheduling rule form wrapped its mode toggle as two 50%-width filled buttons that competed visually with the actual primary action. * Drawer: collapse three rarely-touched config zones (Capacity, Backends, Labels) into one `<details>` "Manage" disclosure (closed by default) with small uppercase eyebrow labels for each zone instead of parallel h4 sub-headings. Loaded Models stays as the at-a-glance headline with a single-line empty hint instead of a boxed empty state. CapacityEditor renders flat (no filled background) — the Manage disclosure provides framing. * Scheduling form: replace the chunky 50%-width button-tabs with the project's existing `.segmented` control (icon + label, sized to content). Mode hint becomes a single tied line below. Fields stack vertically with helper text under inputs and a hairline divider above the right-aligned Save / Cancel. The empty drawer collapses from ~5 stacked sections (~280px tall) to two lines (~80px). The scheduling form now reads as a designed dialog instead of raw building blocks. Both surfaces now match the typographic density and weight of the rest of the admin pages. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish] * feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox The native <select> made operators scroll through every gallery entry to find a model name. The project already has SearchableModelSelect (used in Studio/Talk/etc.) which combines free-text search with the gallery list and accepts typed model names that aren't installed yet — useful for pre-staging a scheduling rule before the node it'll run on has finished bootstrapping. Also drops the now-unused useModels import (the combobox manages the gallery hook internally). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] * refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips The Nodes page was rendering the same key=value chip pattern in two places with subtly different markup: the Labels editor in the expanded drawer and (post-distill) the Node Selector input in the scheduling form. The form's input was also a comma-separated string that operators were getting wrong. * Extract <KeyValueChips> as a fully controlled chip-builder. Parent owns the map and decides what onAdd/onRemove does — form state for the scheduling form, API calls for the live drawer Labels editor. Same visuals everywhere; one component to change when polish needs apply. * Replace the comma-separated Node Selector text input with KeyValueChips. Operators were copying syntax from docs and missing commas; the chip vocabulary makes the key=value structure self-documenting. * Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max replicas. Picked over a slider because replica counts are exact specs derived from VRAM math (operator decision, not a fuzzy estimate). The chips give one-click access to common values (1/2/3/4 for Min, 0=no-limit/2/4/8 for Max) without the slider's special-value problem (MaxReplicas=0 is categorical, not a position on a continuum). * Drop the now-unused labelInputs state in the Nodes page (the inline label editor's per-node draft state lived there and is now owned by KeyValueChips). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill] * test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright) Two breakages caught by CI that didn't surface in the local run: * tests/e2e/distributed/*.go — multiple files used the pre-PR2 registry signatures for SetNodeModel / IncrementInFlight / DecrementInFlight / RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo and one stale adapter.InstallBackend call in node_lifecycle_test.go. All updated to pass replicaIndex=0 — these tests don't exercise multi-replica behavior, they just need to compile against the new signatures. The chip-builder tests in core/services/nodes/ already cover the multi-replica logic. * core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the drawer's distill refactor moved Backends inside a "Manage" <details> disclosure that's collapsed by default. The test helper expanded the node row but never opened Manage, so the per-node backend table was never in the DOM. Helper now clicks `.node-manage > summary` after expanding the row. All 100 playwright tests pass locally; tests/e2e/distributed compiles clean. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
02bb715c0a |
fix(distributed): pass ExternalURI through NATS backend install (#9446)
When installing a backend with a custom OCI URI in distributed mode, the URI was captured in ManagementOp.ExternalURI by the HTTP handler but never forwarded to workers. BackendInstallRequest had no URI field, so workers fell through to the gallery lookup and failed with "no backend found with name <custom-name>". Add URI/Name/Alias fields to BackendInstallRequest and thread them from ManagementOp through DistributedBackendManager.InstallBackend() and the RemoteUnloaderAdapter. On the worker side, route to InstallExternalBackend when URI is set instead of InstallBackendFromGallery. Update all remaining InstallBackend call sites (UpgradeBackend, reconciler pending-op drain, router auto-install) to pass empty strings for the new params. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz> |
||
|
|
75a63f87d8 |
feat(distributed): sync state with frontends, better backend management reporting (#9426)
* fix(distributed): detect backend upgrades across worker nodes
Before this change `DistributedBackendManager.CheckUpgrades` delegated to the
local manager, which read backends from the frontend filesystem. In
distributed deployments the frontend has no backends installed locally —
they live on workers — so the upgrade-detection loop never ran and the UI
silently never surfaced upgrades even when the gallery advertised newer
versions or digests.
Worker-side: NATS backend.list reply now carries Version, URI and Digest
for each installed backend (read from metadata.json).
Frontend-side: DistributedBackendManager.ListBackends aggregates per-node
refs (name, status, version, digest) instead of deduping, and CheckUpgrades
feeds that aggregation into gallery.CheckUpgradesAgainst — a new entrypoint
factored out of CheckBackendUpgrades so both paths share the same core
logic.
Cluster drift policy: when per-node version/digest tuples disagree, the
backend is flagged upgradeable regardless of whether any single node
matches the gallery, and UpgradeInfo.NodeDrift enumerates the outliers so
operators can see *why* it is out of sync. The next upgrade-all realigns
the cluster.
Tests cover: drift detection, unanimous-match (no upgrade), and the
empty-installed-version path that the old distributed code silently
missed.
* feat(ui): surface backend upgrades in the System page
The System page (Manage.jsx) only showed updates as a tiny inline arrow,
so operators routinely missed them. Port the Backend Gallery's upgrade UX
so System speaks the same visual language:
- Yellow banner at the top of the Backends tab when upgrades are pending,
with an "Upgrade all" button (serial fan-out, matches the gallery) and a
"Updates only" filter toggle.
- Warning pill (↑ N) next to the tab label so the count is glanceable even
when the banner is scrolled out of view.
- Per-row labeled "Upgrade to vX.Y" button (replaces the icon-only button
that silently flipped semantics between Reinstall and Upgrade), plus an
"Update available" badge in the new Version column.
- New columns: Version (with upgrade + drift chips), Nodes (per-node
attribution badges for distributed mode, degrading to a compact
"on N nodes · M offline" chip above three nodes), Installed (relative
time).
- System backends render a "Protected" chip instead of a bare "—" so rows
still align and the reason is obvious.
- Delete uses the softer btn-danger-ghost so rows don't scream red; the
ConfirmDialog still owns the "are you sure".
The upgrade checker also needed the same per-worker fix as the previous
commit: NewUpgradeChecker now takes a BackendManager getter so its
periodic runs call the distributed CheckUpgrades (which asks workers)
instead of the empty frontend filesystem. Without this the /api/backends/
upgrades endpoint stayed empty in distributed mode even with the protocol
change in place.
New CSS primitives — .upgrade-banner, .tab-pill, .badge-row, .cell-stack,
.cell-mono, .cell-muted, .row-actions, .btn-danger-ghost — all live in
App.css so other pages can adopt them without duplicating styles.
* feat(ui): polish the Nodes page so it reads like a product
The Nodes page was the biggest visual liability in distributed mode.
Rework the main dashboard surfaces in place without changing behavior:
StatCards: uniform height (96px min), left accent bar colored by the
metric's semantic (success/warning/error/primary), icon lives in a
36x36 soft-tinted chip top-right, value is left-aligned and large.
Grid auto-fills so the row doesn't collapse on narrow viewports. This
replaces the previous thin-bordered boxes with inconsistent heights.
Table rows: expandable rows now show a chevron cue on the left (rotates
on expand) so users know rows open. Status cell became a dedicated chip
with an LED-style halo dot instead of a bare bullet. Action buttons gained
labels — "Approve", "Resume", "Drain" — so the icons aren't doing all
the semantic work; the destructive remove action uses the softer
btn-danger-ghost variant so rows don't scream red, with the ConfirmDialog
still owning the real "are you sure". Applied cell-mono/cell-muted
utility classes so label chips and addresses share one spacing/font
grammar instead of re-declaring inline styles everywhere.
Expanded drawer: empty states for Loaded Models and Installed Backends
now render as a proper drawer-empty card (dashed border, icon, one-line
hint) instead of a plain muted string that read like broken formatting.
Tabs: three inline-styled buttons became the shared .tab class so they
inherit focus ring, hover state, and the rest of the design system —
matches the System page.
"Add more workers" toggle turned into a .nodes-add-worker dashed-border
button labelled "Register a new worker" (action voice) instead of a
chevron + muted link that operators kept mistaking for broken text.
New shared CSS primitives carry over to other pages:
.stat-grid + .stat-card, .row-chevron, .node-status, .drawer-empty,
.nodes-add-worker.
* feat(distributed): durable backend fan-out + state reconciliation
Two connected problems handled together:
1) Backend delete/install/upgrade used to silently skip non-healthy nodes,
so a delete during an outage left a zombie on the offline node once it
returned. The fan-out now records intent in a new pending_backend_ops
table before attempting the NATS round-trip. Currently-healthy nodes
get an immediate attempt; everyone else is queued. Unique index on
(node_id, backend, op) means reissuing the same operation refreshes
next_retry_at instead of stacking duplicates.
2) Loaded-model state could drift from reality: a worker OOM'd, got
killed, or restarted a backend process would leave a node_models row
claiming the model was still loaded, feeding ghost entries into the
/api/nodes/models listing and the router's scheduling decisions.
The existing ReplicaReconciler gains two new passes that run under a
fresh KeyStateReconciler advisory lock (non-blocking, so one wedged
frontend doesn't freeze the cluster):
- drainPendingBackendOps: retries queued ops whose next_retry_at has
passed on currently-healthy nodes. Success deletes the row; failure
bumps attempts and pushes next_retry_at out with exponential backoff
(30s → 15m cap). ErrNoResponders also marks the node unhealthy.
- probeLoadedModels: gRPC-HealthChecks addresses the DB thinks are
loaded but hasn't seen touched in the last probeStaleAfter (2m).
Unreachable addresses are removed from the registry. A pluggable
ModelProber lets tests substitute a fake without standing up gRPC.
DistributedBackendManager exposes DeleteBackendDetailed so the HTTP
handler can surface per-node outcomes ("2 succeeded, 1 queued") to the
UI in a follow-up commit; the existing DeleteBackend still returns
error-only for callers that don't care about node breakdown.
Multi-frontend safety: the state pass uses advisorylock.TryWithLockCtx
on a new key so N frontends coordinate — the same pattern the health
monitor and replica reconciler already rely on. Single-node mode runs
both passes inline (adapter is nil, state drain is a no-op).
Tests cover the upsert semantics, backoff math, the probe removing an
unreachable model but keeping a reachable one, and filtering by
probeStaleAfter.
* feat(ui): show cluster distribution of models in the System page
When a frontend restarted in distributed mode, models that workers had
already loaded weren't visible until the operator clicked into each node
manually — the /api/models/capabilities endpoint only knew about
configs on the frontend's filesystem, not the registry-backed truth.
/api/models/capabilities now joins in ListAllLoadedModels() when the
registry is active, returning loaded_on[] with node id/name/state/status
for each model. Models that live in the registry but lack a local config
(the actual ghosts, not recovered from the frontend's file cache) still
surface with source="registry-only" so operators can see and persist
them; without that emission they'd be invisible to this frontend.
Manage → Models replaces the old Running/Idle pill with a distribution
cell that lists the first three nodes the model is loaded on as chips
colored by state (green loaded, blue loading, amber anything else). On
wider clusters the remaining count collapses into a +N chip with a
title-attribute breakdown. Disabled / single-node behavior unchanged.
Adopted models get an extra "Adopted" ghost-icon chip with hover copy
explaining what it means and how to make it permanent.
Distributed mode also enables a 10s auto-refresh and a "Last synced Xs
ago" indicator next to the Update button so ghost rows drop off within
one reconcile tick after their owning process dies. Non-distributed
mode is untouched — no polling, no cell-stack, same old Running/Idle.
* feat(ui): NodeDistributionChip — shared per-node attribution component
Large clusters were going to break the Manage → Backends Nodes column:
the old inline logic rendered every node as a badge and would shred the
layout at >10 workers, plus the Manage → Models distribution cell had
copy-pasted its own slightly-different version.
NodeDistributionChip handles any cluster size with two render modes:
- small (≤3 nodes): inline chips of node names, colored by health.
- large: a single "on N nodes · M offline · K drift" summary chip;
clicking opens a Popover with a per-node table (name, status,
version, digest for backends; name, status, state for models).
Drift counting mirrors the backend's summarizeNodeDrift so the UI
number matches UpgradeInfo.NodeDrift. Digests are truncated to the
docker-style 12-char form with the full value preserved in the title.
Popover is a new general-purpose primitive: fixed positioning anchored
to the trigger, flips above when there's no room below, closes on
outside-click or Escape, returns focus to the trigger. Uses .card as
its surface so theming is inherited. Also useful for a future
labels-editor popup and the user menu.
Manage.jsx drops its duplicated inline Nodes-column + loaded_on cell
and uses the shared chip with context="backends" / "models"
respectively. Delete code removes ~40 lines of ad-hoc logic.
* feat(ui): shared FilterBar across the System page tabs
The Backends gallery had a nice search + chip + toggle strip; the System
page had nothing, so the two surfaces felt like different apps. Lift the
pattern into a reusable FilterBar and wire both System tabs through it.
New component core/http/react-ui/src/components/FilterBar.jsx renders a
search input, a role="tablist" chip row (aria-selected for a11y), and
optional toggles / right slot. Chips support an optional `count` which
the System page uses to show "User 3", "Updates 1" etc.
System Models tab: search by id or backend; chips for
All/Running/Idle/Disabled/Pinned plus a conditional Distributed chip in
distributed mode. "Last synced" + Update button live in the right slot.
System Backends tab: search by name/alias/meta-backend-for; chips for
All/User/System/Meta plus conditional Updates / Offline-nodes chips
when relevant. The old ad-hoc "Updates only" toggle from the upgrade
banner folded into the Updates chip — one source of truth for that
filter. Offline chip only appears in distributed mode when at least
one backend has an unhealthy node, so the chip row stays quiet on
healthy clusters.
Filter state persists in URL query params (mq/mf/bq/bf) so deep links
and tab switches keep the operator's filter context instead of
resetting every time.
Also adds an "Adopted" distribution path: when a model in
/api/models/capabilities carries source="registry-only" (discovered on
a worker but not configured locally), the Models tab shows a ghost chip
labelled "Adopted" with hover copy explaining how to persist it — this
is what closes the loop on the ghost-model story end-to-end.
|
||
|
|
59108fbe32 |
feat: add distributed mode (#9124)
* feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |