mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-14 11:49:33 -04:00
4bb592cf91ebd33f342eee2dcaf559e8daca0e71
11 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a906438a69 |
fix(config): backend-gate the top_k=40 sampler default (#6632) (#10285)
fix(config): gate top_k=40 default on backend family (#6632) SetDefaults injected top_k=40 (llama.cpp's sampling default) for every model config regardless of backend. That value is wrong for backends whose native default differs: mlx_lm's intended default is top_k=0 (disabled) and mlx does not remap 0->40, so a client that omits top_k silently got 40 shipped to mlx, changing sampling. The mlx backend's own getattr(request,'TopK',0) fallback is dead because proto3 int32 is always present. Gate the injection on backend family via UsesLlamaSamplerDefaults: keep top_k=40 for the llama.cpp family and for the empty/auto backend (the GGUF auto-detect path resolves to llama.cpp, so existing behavior is preserved), but leave TopK nil for the known non-llama backends (mlx, mlx-vlm, mlx-distributed). gRPCPredictOpts now sends 0 when TopK is nil, which is the value mlx actually wants. Only TopK is gated - the confirmed bug. The sibling sampler defaults (top_p, temperature, min_p) are left global to avoid widening scope and introducing nil-deref risk; revisit per-backend if needed. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
085fc53bbc |
fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104)
* fix(router): score classifier production-readiness Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(backend): auto-size batch to context for embedding and rerank models Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(gallery): raise arch-router scoring output cap via parallel:64 Scoring decodes the whole prompt+candidate in a single llama_decode and reads one logit row per candidate token. The vendored llama.cpp server caps causal output rows at n_parallel, so the default of 1 aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route labels. Set options: [parallel:64] on both arch-router quant entries to lift the cap; kv_unified (the grpc-server default) keeps the full context per sequence, so this does not split the KV cache. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
a44bdb29d4 |
feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Insert with path recency refresh and entry cap Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): recency-weighted per-value Weight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Remove all entries for a value Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(radixtree): race-free concurrency smoke test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard Address review findings on the generic prefix tree: - Extract a shared pruneWalk helper parameterized by a shouldClear predicate and use it from Evict, Remove, and the MaxEntries path. Previously evictOldestLocked cleared a victim's value but never removed the now value-less node or its childless ancestors, so internal nodes accumulated under sustained churn at the cap. The MaxEntries path now prunes the victim and its empty ancestors. - DRY: pruneWalk replaces the duplicated logic in the former pruneLocked and Remove's inner closure. - Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take the read lock (RLock) while Insert, Evict and Remove keep the write lock. Confirmed race-clean under go test -race. - Document the strict greater-than TTL boundary on Options.TTL and expired: age exactly equal to TTL is still live. - Guard Insert against an empty key (no-op): the root never holds a value. Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation, the no-growth-past-cap invariant, the TTL boundary, and empty-key behavior for both Insert and LongestMatch. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): RoutePolicy enum with parse/resolve Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Config with defaults and validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): deterministic xxhash prefix-chain extractor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): pure filter-then-score replica selection Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Provider interface and radix-tree-backed Index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(prefixcache): gofmt policy enum comment alignment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): head-first prefix chunking and hoist Weight out of sort Address code-quality review findings in the prefixcache package. Correctness: ExtractChain now chunks from absolute offset 0 with fixed [0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth head blocks. The previous tail-keeping logic shifted the byte offset by a non-window amount once a conversation grew past MaxDepth*WindowBytes, changing every hash each turn and silently breaking cross-turn longest-prefix matching. The reusable KV/prefix cache lives at the head of the prompt, so anchoring at offset 0 makes the chain a true prefix-chain: P and P+suffix share their full leading overlap. Add a regression spec proving cross-turn stability past the cap. Performance: Index.Decide precomputes each candidate's Weight once (decorate-sort-undecorate) instead of calling the O(tree size) Weight inside the O(n log n) sort comparator. Behavior is unchanged. Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual byte loop, clearing the modernize rangeint finding. Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's documented concurrency safety under go test -race. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): prefixcache observe/invalidate subjects and payloads Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): NATS sync publish/apply for observe and invalidate Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): ctx carrier for prefix-hash chain Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): PrefixChainHook indirection for backend-side chain build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): stash prompt prefix chain on ctx before distributed routing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend): mirror modelID fallback for prefix-chain salt parity Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scheduling config columns for prefix-cache routing Add RoutePolicy and per-model balance/prefix-match override columns to ModelSchedulingConfig and include them in the SetModelScheduling upsert DoUpdates list so updates are not dropped on conflict. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): optional route preference in FindAndLockNodeWithModel Add a RoutePreference type and a new pref parameter so the atomic pick+lock+increment can be biased toward a preferred node without weakening atomicity. A nil preference reproduces the previous ORDER BY behavior exactly. Update the ModelRouter interface, both router.go call sites (pass nil for now; Phase 5 builds the real preference), the test doubles, and the distributed e2e caller. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): make Sync satisfy Provider with Evict Sync.Observe now returns whether the local index treated the assignment as new or extended, and Sync gains an Evict method that delegates to the wrapped index. Together these let SmartRouter hold a single prefixcache.Provider that broadcasts via NATS. Adds a compile-time Provider assertion and an Evict-delegates behavioral test. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil keeps routing byte-for-byte the round-robin floor). On each request Route now calls buildPreference: it reads the prompt prefix chain from ctx (distributedhdr.PrefixChain), resolves the per-model policy/thresholds over the global config, loads candidate replica in-flight via a new registry read LoadedReplicaStats (deduped to one entry per node using the MIN in-flight across that node's replicas), asks the provider to Decide, and runs prefixcache.Select. The chosen node is passed as the RoutePreference to FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick, cold scheduleAndLoad), and the served node is recorded via Observe only when the resolved policy is prefix_cache so round-robin models never pollute the tree. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): invalidate prefix-cache entries on unload and stale removal UnloadModel and both staleness fall-through paths in Route (after a failed gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model, nodeID), guarded by a nil-provider check so the round-robin floor is unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations also broadcast to peer frontends. Adds a test that a previously hot prefix no longer Decides to a node after UnloadModel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): rolling forced-disturb pressure counter Add a concurrency-safe per-model rolling counter that tracks how many times a request had a usable hot prefix match but the load guard forced it off the warm node. Entries outside the window are dropped lazily on Count so the backing slice stays bounded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): autoscale on prefix-cache forced-disturb pressure Wire the rolling forced-disturb counter into the SmartRouter and the ReplicaReconciler. Router: in buildPreference, after Decide + Select, record a forced-disturb when a usable hot prefix match existed (d.HotNodeID != "" and d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or nothing) because the load guard ruled the warm node out. This is the scale-worthy signal: the cache-warm replica is saturated. It deliberately does not fire for all-unique workloads (no hot match), avoiding false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil keeps the path a no-op. Reconciler: read the same Pressure instance in reconcileModel as an extra scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel guards and the UnsatisfiableUntil cooldown that gates the whole method. Pressure never overrides MaxReplicas and never force-evicts; a no-capacity model does not spin. Window and threshold come from prefixcache.Config (PressureWindow default 1m, PressureScaleThreshold default 1) and are configurable via ReplicaReconcilerOptions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow Record now prunes entries older than the rolling window (the same prune Count does), via a shared pruneLocked helper, so a model that takes forced-disturb records but is never Counted (e.g. one with zero loaded replicas the reconciler skips) no longer grows its backing slice unbounded. Also removes the dead pressureWindow struct field and the ReplicaReconcilerOptions.PressureWindow option from the reconciler: they were stored but never read (the window lives inside the *prefixcache.Pressure instance). The scale block now reads pressure.Count once into a local. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(api): prefix-cache fields in scheduling endpoint DTO with validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): prefix-cache routing controls in node scheduling form Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire prefix-cache index, NATS sync, and config Activates prefix-cache-aware routing in distributed mode. Builds the prefixcache Index + NATS-backed Sync + Pressure counter, installs the distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix chain per request, subscribes to prefixcache.observe/prefixcache.invalidate to apply peers' events to the local index (no re-broadcast), threads PrefixProvider/PrefixConfig/Pressure into the SmartRouter and Pressure/PressureThreshold into the ReplicaReconciler, and runs a background eviction ticker (every TTL/2) bound to the app context. Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE) opts out and leaves the provider/pressure nil so routing stays round-robin. --distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m) controls entry idle-timeout and eviction cadence. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(nodes): round-robin-floor invariant for prefix-cache routing Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never picked even with a perfect prefix match (round-robin floor holds), while a balanced hot node within the load slack is reused. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(prefixcache): clear branch lint findings and em dashes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): validate prefix-cache config at startup wiring Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(radixtree): single-walk WeightsFor for batch value weights Add Tree.WeightsFor(values, now) which computes the recency-weighted weight for many values in a single O(N + len(values)) tree traversal, versus calling Weight once per value (O(len(values) * N)). Consumers that score K candidates against the tree under the read lock no longer pay K full walks. Extract the per-entry contribution math into an unexported helper shared by both Weight and WeightsFor so the metric stays identical (DRY). Weight's public behavior is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): add ModelConfig.ModelID() single source of truth The c.Name fallback to c.Model was duplicated in core/backend/options.go (feeding model.WithModelID) and hand-copied into core/backend/llm.go (the prefix-chain salt). These MUST agree or the prefix-cache salt diverges silently from the id the model loader tracks. Consolidate both into a new config.ModelConfig.ModelID() helper and call it from both sites. Behavior is identical. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): reuse one xxhash.Digest in ExtractChain ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth per call) and grew the chain slice without preallocation. Reuse a single Digest via Reset() before each block and preallocate the chain to min(nBlocks, MaxDepth). xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical value to a fresh New()+Write. Output hashes are unchanged, preserving the cross-process determinism that peers rely on over NATS. Verified by capturing ExtractChain output for the existing test inputs before and after the refactor: identical. Existing extractor tests pass unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk Index.Decide called radixtree.LongestMatch over the whole tree, so the deepest match could be a node that is offline, unloaded, or simply not in the passed candidate set. Honoring that as HotNodeID produced a false forced-disturb signal upstream (buildPreference records pressure when chosen != HotNodeID), making it look like a warm replica was load saturated when it was actually absent. Build the candidate set once and only set HotNodeID/MatchRatio when the matched node is an actual candidate; otherwise fall back to cold placement. A future refinement could ask the tree for the longest match restricted to the candidate nodes (shallower-but-valid) instead of dropping it. Also replace the per-candidate tree.Weight call in the cold-order sort with a single tree.WeightsFor walk, turning O(K*N) under the read lock into O(N + K). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): remove Select's unreachable deterministic fallback buildPreference always passes ColdOrder as a permutation of the full candidate set, so the cold-order loop hits every eligible candidate. The trailing best/bestIF scan was dead. Replace it with a plain "return """ and document that ColdOrder is guaranteed to cover all candidates, so "" means none were eligible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): fetch model scheduling config once per Route GetModelScheduling was read three times per request - in resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling - three DB round-trips for one row that is immutable for the life of the request, and not a consistent snapshot. Fetch it once near the top of Route and thread the *ModelSchedulingConfig (may be nil) into all three helpers. scheduleNewModel keeps its own fetch since it runs outside the Route snapshot. Behavior is identical for nil sched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): add Pressure.Reset to consume forced-disturb signal Pressure.Count is non-draining (it prunes only by age), so a single burst of forced-disturbs stays within the rolling window for the whole window and keeps Count >= threshold on every reconciler tick. The reconciler will use Reset to clear a model's events after acting on the signal so a fresh scale-up requires fresh forced-disturbs to accumulate, rather than one burst driving the model toward MaxReplicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): at most one scale-up per reconcile tick, consume pressure Two autoscale bugs: 1. Over-scaling: the pressure scale-up block read Pressure.Count but never consumed it. With a non-draining counter a single forced-disturb burst kept Count >= threshold across the whole window, firing scaleUp on every tick and pushing the model toward MaxReplicas off one transient burst. After a successful pressure-triggered scale-up the reconciler now calls Pressure.Reset to consume the signal. 2. Double scale-up in one tick: the all-replicas-busy block and the pressure block could both fire in the same reconcileModel pass, each calling scaleUp(+1) against the same `current` read once at the top, so a model that was both busy and over threshold scaled +2 and could overshoot MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1) per tick: the pressure block is skipped if the busy block already scaled, and scale-down is skipped in any tick that scaled up. MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation Add SetReplicaRemovedHook to NodeRegistry and fire it from both RemoveNodeModel and RemoveAllNodeModelReplicas after a successful delete. This is the single chokepoint every replica-removal path funnels through (router eviction, reconciler scale-down, probe reaper, health-monitor node-down reap, RemoteUnloaderAdapter), so the prefix-cache index can be invalidated by construction rather than wiring each call site individually. The hook is stored in an atomic.Pointer so the startup wiring (setter) and the request/reconcile-time fire are race-free; it is nil-safe when unset. GORM Delete reports no error for a no-op delete, so the hook also fires when nothing was removed; the consumer's Invalidate(model, node) is idempotent so this is harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): invalidate prefix-cache on any replica removal via registry hook Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): single source of truth for threshold bounds Extract ValidateThresholds into prefixcache/config.go so the per-model override validation (nodes.go endpoint) and Config.Validate share one implementation of the numeric bounds (min_prefix_match in [0,1], balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead of hard-coding them in two places. The route_policy allow-list stays explicit (not ParsePolicy, which maps typos to Default). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): preserve prefix-cache settings on partial scheduling update A scheduling POST that omitted route_policy/thresholds (e.g. a min_replicas-only update) full-replaced every column and silently reset the model's previously-configured prefix-cache settings to empty/zero. Make the four prefix-cache request fields pointers so omitted is distinguishable from explicit zero, and merge PATCH-style in SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves the existing config value (zero default when none). Non-prefix fields keep their full-replace PUT semantics. Validation now runs on the resolved values via prefixcache.ValidateThresholds. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica removal of every model, including round-robin models that never used the prefix cache. Index.Invalidate previously called tree(model), which lazily created and permanently retained an empty radix tree for any model that ever lost a replica, growing the trees map without bound. Sync.Invalidate also published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op removals across the cluster. Index.Invalidate now looks the tree up read-only via existingTree and returns without allocating when none exists. The Provider interface is unchanged; Sync gates the broadcast through an optional invalidateExisting(bool) capability type-asserted from the wrapped Index, falling back to the prior always-broadcast behavior for other Provider implementations. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort WeightsFor already returns a map keyed by every requested candidate, so the separate candidates set built to validate the hot match was redundant: a node is a candidate iff it is a key in the weights map. Drop the extra map and gate the hot-match check on weights membership. Also skip the sort when there is at most one candidate, since the input order is already the cold order. Behavior is unchanged. Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins would need lazy cross-file changes and is out of scope here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns Bulk node-scoped node_models deletes (Register re-register cleanup, MarkOffline, MarkDraining, Deregister) removed rows directly without firing the replica-removed hook, so the prefix-cache index kept pointing at nodes whose models were gone. Capture the DISTINCT model names before each bulk delete and fire fireReplicaRemoved once per model after a successful delete, restoring the single-chokepoint invariant for all removal paths. The pre-query is skipped when no hook is set so the no-hook path stays cheap. Also narrow LoadedReplicaStats to SELECT only node_id and in_flight (the only fields the router consumer reads), dropping the JOIN-side available_vram fetch and unused columns while keeping the []ReplicaCandidate return type unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(reconciler): consume autoscale signals only on a real scale-up scaleUp was fire-and-forget (void) yet its callers unconditionally consumed the pressure signal (Pressure.Reset) and the MinReplicas hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp added nothing (ScheduleAndLoadModel errored, or no node could be loaded) the saturated warm replica got no new replica AND its accumulated forced-disturb history was wiped, forcing the signal to re-accumulate over a full PressureWindow before the next attempt. Make scaleUp return whether at least one replica was actually scheduled, and gate the side effects on it: - pressure block (2b): set scaledUp and call Pressure.Reset only on success; on failure preserve the signal so the next tick retries off the same accumulated pressure. - busy-burst block (2): set scaledUp from the return value so a failed attempt does not suppress the pressure path or scale-down. - MinReplicas block: call ClearUnsatisfiable only on success so a failed attempt does not reset the unsatisfiable counter. All existing invariants (MaxReplicas, capacity gating, UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are preserved. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): drop router's redundant prefix-cache Invalidate calls The NodeRegistry removal chokepoint (RemoveNodeModel / RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which invalidates the prefix-cache index. The router was also calling prefixProvider.Invalidate explicitly right after each registry removal on the two stale-replica health-probe fall-throughs in Route and in UnloadModel, so every router-side eviction invalidated twice (double tree-prune + double NATS broadcast). Remove the three redundant explicit Invalidate calls and their empty nil-guards. Each removed call sat immediately after a registry removal that fires the hook, so invalidation is preserved via the chokepoint. Decide/Observe usage is untouched. Re-point the unit test (fake registry fires no hook) to assert the removal chokepoint is exercised on unload instead of the router's direct invalidation. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): make capture+delete atomic in bulk node_models removal paths MarkOffline, MarkDraining, and the Register re-register cleanup ran the nodeModelNames SELECT and the bulk node_models DELETE as two separate statements on r.db with no transaction. A SetNodeModel landing between the two was deleted but its replica-removed hook never fired, leaving the prefix-cache index pointing at a removed replica until TTL or candidacy self-heal. Wrap the capture and the delete in a single db.Transaction in each path (mirroring how Deregister already does it). The captured model names are collected into a slice declared outside the closure; the replica-removed hook fires for each only after the transaction commits, so a rollback never invalidates the index for a removal that did not persist. The set of fired hooks now equals exactly the set of node_models rows actually deleted, with no interleaving gap. The status flip in MarkOffline/MarkDraining (setStatus) is a separate, pre-existing operation and routing already filters non-healthy nodes, so it stays outside the transaction; return contracts are unchanged. Deregister was already correct and is untouched. The cheap-path skip (no hook -> skip the SELECT) is preserved. Adds a spec asserting MarkOffline fires hooks for exactly the rows it deletes and leaves no node_models row behind (consistent snapshot). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(nodes): debug logging for prefix-cache routing decisions and observations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): match shared prefixes by valuing every node on insert Insert recorded the value (node id) only on the final node of the key chain, leaving every intermediate prefix node valueless. LongestMatch returns the deepest node that hasValue, so two chains that share a leading block but diverge in the tail never matched: only exact-repeat queries hit. That broke the prefix-cache routing core use cases (shared system prompt, multi-turn extension, volatile tail), all of which rely on prefix matching rather than exact-repeat. Set value/hasValue/lastSeen at every node along the chain so each prefix-block node remembers the node id that served that prefix (SGLang/vLLM-style). The deepest match wins, and the last writer owns a shared prefix node (a recency heuristic: the most recent chain through a block is the one most likely still warm). size now counts valued nodes, which is the intended meaning. Updated radixtree tests to the new semantics: deepest-prefix test uses non-overlapping chains, a new test asserts last-writer-owns-shared-node, Evict/Remove/MaxEntries expectations recomputed for per-prefix-node counting, and a shared-prefix LongestMatch red test added. Added a prefixcache Decide test proving a prefix-only query routes to the warm node. No prefixcache .go logic changed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): lock in prefix-cache routing behavior end to end Add a DB-backed e2e spec that drives SmartRouter against a real NodeRegistry (Postgres testcontainer) and the real prefixcache.Index radix-tree provider, using a fake gRPC backend factory so no real inference runs. Covers the five behaviors validated by hand: 1. Cold miss + observe: an unseen prefix chain cold-places and is recorded. 2. Hot-match affinity: the same chain returns to its warm node X. 3. Shared-prefix match: a divergent chain sharing X's leading prefix still routes to X (the radix-tree regression we fixed). 4. Negative control: an unrelated chain is a cold miss, not a false hot match on X. 5. Failover + invalidation: removing X's replica fires the registry chokepoint hook to invalidate the prefix entry, and the chain fails over to surviving node Y and re-homes there. Replaces the need for manual docker-compose re-runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): make prefix-cache affinity replica-granular Track prefix-cache affinity per loaded replica (a backend process with its own KV cache) instead of per node, so multiple replicas of the same model on one node each keep distinct affinity and a hot prefix routes back to the exact replica that served it. - radixtree: add RemoveFunc(pred) and reimplement Remove on top of it. - prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/ PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to drop every replica of a node; Invalidate drops one replica. Select returns (ReplicaKey, bool) and gains a deterministic least-in-flight eligible fallback (tiebreak NodeID then Replica). - messaging: carry Replica on PrefixCacheObserveEvent and PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node). - Sync delegates + broadcasts with replica; InvalidateNode broadcasts Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode. This is part 1 of 2; the registry/router/wiring consumers are updated separately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): make prefix-cache routing replica-granular Wire the SmartRouter, NodeRegistry, and distributed startup to the replica-keyed prefixcache API. Affinity is now tracked per replica (each replica is a separate process with its own KV cache), so a prefix served by (node,0) no longer leaks onto the same-node sibling (node,1). - RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks the EXACT (node_id, replica_index) row, falling through to the default ORDER BY when that replica is not loaded. - SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires the specific replica, RemoveAllNodeModelReplicas and the four bulk node-scoped deletes fire replica<0 (all replicas of the node). - buildPreference builds one Candidate per loaded replica and locks the exact replica the policy chose; observePrefix records the served ReplicaKey at every call site. - distributed.go routes the hook to InvalidateNode (replica<0) or Invalidate(key). - Tests updated to the replica-keyed API plus new coverage: a hot prefix on (node,0) prefers replica 0 over the same-node sibling (router unit + e2e), and FindAndLock locks the exact preferred replica. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): derive prefix chain from messages for tokenizer-template models Prefix-cache-aware routing built its prompt-prefix chain from the rendered prompt string `s` in ModelInference. For models with TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the backend tokenizes the structured messages itself - so `s` is empty, the chain is empty, and routing silently falls back to round-robin. That covers the bulk of modern chat models (qwen3, llama3, ...), so the feature effectively never engaged for them. Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable head-first serialization of the conversation (role + content per turn). Two requests sharing a leading system prompt and early turns share a leading byte prefix, which ExtractChain maps to a shared chain prefix - landing both on the same cache-warm replica. The rendered `s` is still preferred when present (higher fidelity for non-template models). Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision" logs despite per-request Route calls, traced to the empty-chain guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document prefix-cache routing roadmap Add a routing-and-caching roadmap section to the distributed-mode guide, linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and NVIDIA Dynamo. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
73cfedc023 |
fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052) (#10057)
* fix(grammars): honor properties_order entry at index 0 The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as its "is this key ordered?" guard. That treats index 0 — the first key listed in properties_order — as unset, so `properties_order: name,arguments` fell back to alphabetical ordering and still emitted "arguments" before "name". Use presence in the order map instead: listed keys sort by their index and ahead of unlisted keys, which keep a stable alphabetical order. This makes the documented `properties_order: name,arguments` actually produce name-first tool-call JSON. Relates to #10052. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(functions): defer tool grammar to the backend when the tokenizer template owns templating (#10052) When use_tokenizer_template delegates templating to the backend (llama.cpp), the backend also owns tool-call grammar generation and parsing. LocalAI was still generating its own GBNF grammar and sending it down. With a grammar present, llama.cpp does not hand the tools to its template, so its native peg/json tool parser never engages: it streams the grammar-constrained tool-call JSON back as plain content instead of emitting tool_calls. In streaming mode the JSON object leaked into the content field, and the Go-side incremental detector never gated content because the LocalAI-generated grammar emitted "arguments" before "name". The GGUF auto-import path already couples use_tokenizer_template with grammar.disable, but that block is skipped when a template is already configured, so gallery and hand-written configs (e.g. qwen3) that set the tokenizer template directly never got the paired grammar.disable. - SetDefaults now enforces the coupling for every config: when use_tokenizer_template is set, grammar generation is disabled and tools flow to the backend's native (name-first) pipeline. This also fixes already-installed models without editing each config. - Set function.grammar.disable in the shared gallery/qwen3.yaml, which is the base config referenced by every qwen3 gallery entry. Verified end to end against qwen3-4b with stream:true + tools: content no longer carries the tool-call JSON, reasoning is classified separately, and tool calls stream as proper name-first tool_calls deltas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6a80e23733 |
feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802)
Add a routing middleware stack and a cloud-proxy backend. * cloud-proxy: a Go gRPC backend that forwards OpenAI- and Anthropic-shaped chat requests to upstream providers, with an optional translate mode (OpenAI request -> Anthropic /v1/messages -> OpenAI response) and full tool-calling support. * routing: admission control, content-aware model routing (embedding cache + classifier + rerank + Arch-Router score), PII detection/redaction (regex + NER) with streaming filter and OpenAI/Anthropic adapters, and a per-user/per-key billing recorder backed by GORM or in-memory storage. * middleware: UsageMiddleware records usage via the billing recorder, plus admission, route-model, usage-stamp and trace middlewares. * observability: BackendTrace ring buffer stores full request bodies (capped), MITM proxy emits structured trace events, and router classifier decisions surface at /api/router/decide. * gallery: Arch-Router-1.5B (Q4_K_M and Q8_0). * UI: cloud-proxy model-editor fields, classifier system-prompt and score-normalization config, and a Traces page rendering request bodies. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
bbcaebc1ef |
feat(concurrency-groups): per-model exclusive groups for backend loading (#9662)
* feat(concurrency-groups): per-model exclusive groups for backend loading Adds `concurrency_groups: [...]` to model YAML configs. Two models that share a group cannot be loaded concurrently on the same node — loading one evicts the others, reusing the existing pinned/busy/retry policy from LRU eviction. Layered design: - Watchdog (pkg/model): per-node correctness floor — on every Load(), evict any loaded model that shares a group with the requested one. Pinned skips surface NeedMore so the loader retries (and ultimately logs a clear warning), instead of silently allowing the rule to be violated. - Distributed scheduler (core/services/nodes): soft anti-affinity hint — scheduleNewModel prefers nodes that don't already host a same-group model, falling back to eviction only if every candidate has a conflict. Composes with NodeSelector at the same point in the candidate pipeline. Per-node, not cluster-wide: VRAM is a node-local resource, and two heavy models running on different nodes is fine. The ConfigLoader is wired into SmartRouter via a small ConcurrencyConflictResolver interface so the nodes package keeps a narrow surface on core/config. Refactors the inner LRU eviction body into a shared collectEvictionsLocked helper and the loader retry loop into retryEnforce(fn, maxRetries, interval), so both LRU and group enforcement share busy/pinned/retry semantics. Closes #9659. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(watchdog): sync pinned + concurrency_groups at startup The startup-time watchdog setup lives in initializeWatchdog (startup.go), not in startWatchdog (watchdog.go). The latter is only invoked from the runtime-settings RestartWatchdog path. As a result, neither SyncPinnedModelsToWatchdog nor SyncModelGroupsToWatchdog ran at boot, so `pinned: true` and `concurrency_groups: [...]` only became effective after a settings-driven watchdog restart. Fix by adding both sync calls to initializeWatchdog. Confirmed end-to-end: loading model A in group "heavy", then C with no group (coexists), then B in group "heavy" now correctly evicts A and leaves [B, C]. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(test): satisfy errcheck on new os.Remove in concurrency_groups spec CI lint runs new-from-merge-base, so the existing pre-existing `defer os.Remove(tmp.Name())` lines are baseline-grandfathered but the one introduced by the concurrency_groups YAML round-trip test is held to errcheck. Wrap the remove in a closure that discards the error. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
4916f8c880 |
feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map
LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.
Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.
Operators can now write:
engine_args:
data_parallel_size: 8
enable_expert_parallel: true
all2all_backend: deepep_low_latency
speculative_config:
method: deepseek_mtp
num_speculative_tokens: 3
kv_cache_dtype: fp8
without further proto/Go/Python plumbing per field.
Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.
Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel
vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.
cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.
Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(vllm): bot job to bump cublas13 vLLM wheel pin
vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.
The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* docs(vllm): document engine_args and speculative decoding
The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
|
||
|
|
7270a98ce5 |
fix(realtime): Use user provided voice and allow pipeline models to have no backend (#8415)
* fix(realtime): Use the voice provided by the user or none at all Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(ui,config): Allow pipeline models to have no backend and use same validation in frontend Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
8bd7143a44 |
fix: propagate validation errors (#7787)
fix: validate MCP configuration in model config Fixes #7334 The Validate() function was not checking if MCP configuration (mcp.stdio and mcp.remote) contains valid JSON. This caused malformed JSON with missing commas to be silently accepted. Changes: - Add MCP configuration validation to ModelConfig.Validate() - Properly report validation errors instead of discarding them - Add test cases for valid and invalid MCP configurations The fix ensures that malformed JSON in MCP config sections will now be caught and reported during validation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: majiayu000 <1835304752@qq.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|
|
77bbeed57e |
feat(importer): unify importing code with CLI (#7299)
* feat(importer): support ollama and OCI, unify code Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: support importing from local file Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * support also yaml config files Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle local files Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Extract importing errors Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add importer tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add integration tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(UX): improve and specify supported URI formats Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fail if backend does not have a runfile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add cache for galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): remove handler duplicate File input handlers are now handled by Alpine.js @change handlers in chat.html. Removed duplicate listeners to prevent files from being processed twice Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): be consistent in attachments in the chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fail if no importer matches Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: propagate ops correctly Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
60b6472fa0 |
feat: Add Agentic MCP support with a new chat/completion endpoint (#6381)
* WIP - add endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Rename Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Wire the Completion API Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to make it functional Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Almost functional Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Bump golang versions used in tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add description of the tool Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Make it working Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small optimizations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanup/refactor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |