LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-12 02:38:19 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	e837921c2c	feat: forward reasoning_effort to the backend so jinja models honor it (#10184 ) * feat: forward reasoning_effort to the backend so jinja models honor it reasoning_effort was only mapped to the binary enable_thinking toggle and otherwise reached Go-side templates — it was never sent to the backend. So jinja-templated models whose chat template keys on reasoning_effort (gpt-oss Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and kept emitting <think>. Forward the effective reasoning_effort to the backend as a chat_template_kwarg (mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort (request value overrides config default, none->disable / level->enable, an operator's reasoning.disable wins). request.go now uses that helper. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): set the pipeline LLM's reasoning_effort Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime model is built (per-session copy, overrides the LLM's own reasoning_effort), and surface the resolved effort on the template input so Go-templated models get it too. jinja models receive it via the backend metadata. This lets a realtime pipeline disable thinking on models that only honor reasoning_effort (e.g. LFM2.5), which enable_thinking can't. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 13:45:43 +00:00
LocalAI [bot]	a44bdb29d4	feat: prefix-cache-aware routing for distributed mode (#10071 ) * feat(radixtree): generic prefix tree skeleton with longest-match Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Insert with path recency refresh and entry cap Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): recency-weighted per-value Weight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Remove all entries for a value Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(radixtree): race-free concurrency smoke test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard Address review findings on the generic prefix tree: - Extract a shared pruneWalk helper parameterized by a shouldClear predicate and use it from Evict, Remove, and the MaxEntries path. Previously evictOldestLocked cleared a victim's value but never removed the now value-less node or its childless ancestors, so internal nodes accumulated under sustained churn at the cap. The MaxEntries path now prunes the victim and its empty ancestors. - DRY: pruneWalk replaces the duplicated logic in the former pruneLocked and Remove's inner closure. - Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take the read lock (RLock) while Insert, Evict and Remove keep the write lock. Confirmed race-clean under go test -race. - Document the strict greater-than TTL boundary on Options.TTL and expired: age exactly equal to TTL is still live. - Guard Insert against an empty key (no-op): the root never holds a value. Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation, the no-growth-past-cap invariant, the TTL boundary, and empty-key behavior for both Insert and LongestMatch. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): RoutePolicy enum with parse/resolve Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Config with defaults and validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): deterministic xxhash prefix-chain extractor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): pure filter-then-score replica selection Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Provider interface and radix-tree-backed Index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(prefixcache): gofmt policy enum comment alignment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): head-first prefix chunking and hoist Weight out of sort Address code-quality review findings in the prefixcache package. Correctness: ExtractChain now chunks from absolute offset 0 with fixed [0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth head blocks. The previous tail-keeping logic shifted the byte offset by a non-window amount once a conversation grew past MaxDepthWindowBytes, changing every hash each turn and silently breaking cross-turn longest-prefix matching. The reusable KV/prefix cache lives at the head of the prompt, so anchoring at offset 0 makes the chain a true prefix-chain: P and P+suffix share their full leading overlap. Add a regression spec proving cross-turn stability past the cap. Performance: Index.Decide precomputes each candidate's Weight once (decorate-sort-undecorate) instead of calling the O(tree size) Weight inside the O(n log n) sort comparator. Behavior is unchanged. Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual byte loop, clearing the modernize rangeint finding. Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's documented concurrency safety under go test -race. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(messaging): prefixcache observe/invalidate subjects and payloads Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): NATS sync publish/apply for observe and invalidate Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): ctx carrier for prefix-hash chain Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): PrefixChainHook indirection for backend-side chain build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): stash prompt prefix chain on ctx before distributed routing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend): mirror modelID fallback for prefix-chain salt parity Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scheduling config columns for prefix-cache routing Add RoutePolicy and per-model balance/prefix-match override columns to ModelSchedulingConfig and include them in the SetModelScheduling upsert DoUpdates list so updates are not dropped on conflict. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): optional route preference in FindAndLockNodeWithModel Add a RoutePreference type and a new pref parameter so the atomic pick+lock+increment can be biased toward a preferred node without weakening atomicity. A nil preference reproduces the previous ORDER BY behavior exactly. Update the ModelRouter interface, both router.go call sites (pass nil for now; Phase 5 builds the real preference), the test doubles, and the distributed e2e caller. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): make Sync satisfy Provider with Evict Sync.Observe now returns whether the local index treated the assignment as new or extended, and Sync gains an Evict method that delegates to the wrapped index. Together these let SmartRouter hold a single prefixcache.Provider that broadcasts via NATS. Adds a compile-time Provider assertion and an Evict-delegates behavioral test. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil keeps routing byte-for-byte the round-robin floor). On each request Route now calls buildPreference: it reads the prompt prefix chain from ctx (distributedhdr.PrefixChain), resolves the per-model policy/thresholds over the global config, loads candidate replica in-flight via a new registry read LoadedReplicaStats (deduped to one entry per node using the MIN in-flight across that node's replicas), asks the provider to Decide, and runs prefixcache.Select. The chosen node is passed as the RoutePreference to FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick, cold scheduleAndLoad), and the served node is recorded via Observe only when the resolved policy is prefix_cache so round-robin models never pollute the tree. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): invalidate prefix-cache entries on unload and stale removal UnloadModel and both staleness fall-through paths in Route (after a failed gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model, nodeID), guarded by a nil-provider check so the round-robin floor is unchanged. At runtime the provider is the prefixcache.Sync, so invalidations also broadcast to peer frontends. Adds a test that a previously hot prefix no longer Decides to a node after UnloadModel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(prefixcache): rolling forced-disturb pressure counter Add a concurrency-safe per-model rolling counter that tracks how many times a request had a usable hot prefix match but the load guard forced it off the warm node. Entries outside the window are dropped lazily on Count so the backing slice stays bounded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): autoscale on prefix-cache forced-disturb pressure Wire the rolling forced-disturb counter into the SmartRouter and the ReplicaReconciler. Router: in buildPreference, after Decide + Select, record a forced-disturb when a usable hot prefix match existed (d.HotNodeID != "" and d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or nothing) because the load guard ruled the warm node out. This is the scale-worthy signal: the cache-warm replica is saturated. It deliberately does not fire for all-unique workloads (no hot match), avoiding false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil keeps the path a no-op. Reconciler: read the same Pressure instance in reconcileModel as an extra scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel guards and the UnsatisfiableUntil cooldown that gates the whole method. Pressure never overrides MaxReplicas and never force-evicts; a no-capacity model does not spin. Window and threshold come from prefixcache.Config (PressureWindow default 1m, PressureScaleThreshold default 1) and are configurable via ReplicaReconcilerOptions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow Record now prunes entries older than the rolling window (the same prune Count does), via a shared pruneLocked helper, so a model that takes forced-disturb records but is never Counted (e.g. one with zero loaded replicas the reconciler skips) no longer grows its backing slice unbounded. Also removes the dead pressureWindow struct field and the ReplicaReconcilerOptions.PressureWindow option from the reconciler: they were stored but never read (the window lives inside the prefixcache.Pressure instance). The scale block now reads pressure.Count once into a local. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(api): prefix-cache fields in scheduling endpoint DTO with validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): prefix-cache routing controls in node scheduling form Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire prefix-cache index, NATS sync, and config Activates prefix-cache-aware routing in distributed mode. Builds the prefixcache Index + NATS-backed Sync + Pressure counter, installs the distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix chain per request, subscribes to prefixcache.observe/prefixcache.invalidate to apply peers' events to the local index (no re-broadcast), threads PrefixProvider/PrefixConfig/Pressure into the SmartRouter and Pressure/PressureThreshold into the ReplicaReconciler, and runs a background eviction ticker (every TTL/2) bound to the app context. Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE) opts out and leaves the provider/pressure nil so routing stays round-robin. --distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m) controls entry idle-timeout and eviction cadence. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(nodes): round-robin-floor invariant for prefix-cache routing Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never picked even with a perfect prefix match (round-robin floor holds), while a balanced hot node within the load slack is reused. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(prefixcache): clear branch lint findings and em dashes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): validate prefix-cache config at startup wiring Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(radixtree): single-walk WeightsFor for batch value weights Add Tree.WeightsFor(values, now) which computes the recency-weighted weight for many values in a single O(N + len(values)) tree traversal, versus calling Weight once per value (O(len(values) * N)). Consumers that score K candidates against the tree under the read lock no longer pay K full walks. Extract the per-entry contribution math into an unexported helper shared by both Weight and WeightsFor so the metric stays identical (DRY). Weight's public behavior is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): add ModelConfig.ModelID() single source of truth The c.Name fallback to c.Model was duplicated in core/backend/options.go (feeding model.WithModelID) and hand-copied into core/backend/llm.go (the prefix-chain salt). These MUST agree or the prefix-cache salt diverges silently from the id the model loader tracks. Consolidate both into a new config.ModelConfig.ModelID() helper and call it from both sites. Behavior is identical. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): reuse one xxhash.Digest in ExtractChain ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth per call) and grew the chain slice without preallocation. Reuse a single Digest via Reset() before each block and preallocate the chain to min(nBlocks, MaxDepth). xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical value to a fresh New()+Write. Output hashes are unchanged, preserving the cross-process determinism that peers rely on over NATS. Verified by capturing ExtractChain output for the existing test inputs before and after the refactor: identical. Existing extractor tests pass unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk Index.Decide called radixtree.LongestMatch over the whole tree, so the deepest match could be a node that is offline, unloaded, or simply not in the passed candidate set. Honoring that as HotNodeID produced a false forced-disturb signal upstream (buildPreference records pressure when chosen != HotNodeID), making it look like a warm replica was load saturated when it was actually absent. Build the candidate set once and only set HotNodeID/MatchRatio when the matched node is an actual candidate; otherwise fall back to cold placement. A future refinement could ask the tree for the longest match restricted to the candidate nodes (shallower-but-valid) instead of dropping it. Also replace the per-candidate tree.Weight call in the cold-order sort with a single tree.WeightsFor walk, turning O(KN) under the read lock into O(N + K). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> refactor(prefixcache): remove Select's unreachable deterministic fallback buildPreference always passes ColdOrder as a permutation of the full candidate set, so the cold-order loop hits every eligible candidate. The trailing best/bestIF scan was dead. Replace it with a plain "return """ and document that ColdOrder is guaranteed to cover all candidates, so "" means none were eligible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): fetch model scheduling config once per Route GetModelScheduling was read three times per request - in resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling - three DB round-trips for one row that is immutable for the life of the request, and not a consistent snapshot. Fetch it once near the top of Route and thread the ModelSchedulingConfig (may be nil) into all three helpers. scheduleNewModel keeps its own fetch since it runs outside the Route snapshot. Behavior is identical for nil sched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(autoscale): add Pressure.Reset to consume forced-disturb signal Pressure.Count is non-draining (it prunes only by age), so a single burst of forced-disturbs stays within the rolling window for the whole window and keeps Count >= threshold on every reconciler tick. The reconciler will use Reset to clear a model's events after acting on the signal so a fresh scale-up requires fresh forced-disturbs to accumulate, rather than one burst driving the model toward MaxReplicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): at most one scale-up per reconcile tick, consume pressure Two autoscale bugs: 1. Over-scaling: the pressure scale-up block read Pressure.Count but never consumed it. With a non-draining counter a single forced-disturb burst kept Count >= threshold across the whole window, firing scaleUp on every tick and pushing the model toward MaxReplicas off one transient burst. After a successful pressure-triggered scale-up the reconciler now calls Pressure.Reset to consume the signal. 2. Double scale-up in one tick: the all-replicas-busy block and the pressure block could both fire in the same reconcileModel pass, each calling scaleUp(+1) against the same `current` read once at the top, so a model that was both busy and over threshold scaled +2 and could overshoot MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1) per tick: the pressure block is skipped if the busy block already scaled, and scale-down is skipped in any tick that scaled up. MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation Add SetReplicaRemovedHook to NodeRegistry and fire it from both RemoveNodeModel and RemoveAllNodeModelReplicas after a successful delete. This is the single chokepoint every replica-removal path funnels through (router eviction, reconciler scale-down, probe reaper, health-monitor node-down reap, RemoteUnloaderAdapter), so the prefix-cache index can be invalidated by construction rather than wiring each call site individually. The hook is stored in an atomic.Pointer so the startup wiring (setter) and the request/reconcile-time fire are race-free; it is nil-safe when unset. GORM Delete reports no error for a no-op delete, so the hook also fires when nothing was removed; the consumer's Invalidate(model, node) is idempotent so this is harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): invalidate prefix-cache on any replica removal via registry hook Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): single source of truth for threshold bounds Extract ValidateThresholds into prefixcache/config.go so the per-model override validation (nodes.go endpoint) and Config.Validate share one implementation of the numeric bounds (min_prefix_match in [0,1], balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead of hard-coding them in two places. The route_policy allow-list stays explicit (not ParsePolicy, which maps typos to Default). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): preserve prefix-cache settings on partial scheduling update A scheduling POST that omitted route_policy/thresholds (e.g. a min_replicas-only update) full-replaced every column and silently reset the model's previously-configured prefix-cache settings to empty/zero. Make the four prefix-cache request fields pointers so omitted is distinguishable from explicit zero, and merge PATCH-style in SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves the existing config value (zero default when none). Non-prefix fields keep their full-replace PUT semantics. Validation now runs on the resolved values via prefixcache.ValidateThresholds. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica removal of every model, including round-robin models that never used the prefix cache. Index.Invalidate previously called tree(model), which lazily created and permanently retained an empty radix tree for any model that ever lost a replica, growing the trees map without bound. Sync.Invalidate also published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op removals across the cluster. Index.Invalidate now looks the tree up read-only via existingTree and returns without allocating when none exists. The Provider interface is unchanged; Sync gates the broadcast through an optional invalidateExisting(bool) capability type-asserted from the wrapped Index, falling back to the prior always-broadcast behavior for other Provider implementations. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort WeightsFor already returns a map keyed by every requested candidate, so the separate candidates set built to validate the hot match was redundant: a node is a candidate iff it is a key in the weights map. Drop the extra map and gate the hot-match check on weights membership. Also skip the sort when there is at most one candidate, since the input order is already the cold order. Behavior is unchanged. Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins would need lazy cross-file changes and is out of scope here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns Bulk node-scoped node_models deletes (Register re-register cleanup, MarkOffline, MarkDraining, Deregister) removed rows directly without firing the replica-removed hook, so the prefix-cache index kept pointing at nodes whose models were gone. Capture the DISTINCT model names before each bulk delete and fire fireReplicaRemoved once per model after a successful delete, restoring the single-chokepoint invariant for all removal paths. The pre-query is skipped when no hook is set so the no-hook path stays cheap. Also narrow LoadedReplicaStats to SELECT only node_id and in_flight (the only fields the router consumer reads), dropping the JOIN-side available_vram fetch and unused columns while keeping the []ReplicaCandidate return type unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(reconciler): consume autoscale signals only on a real scale-up scaleUp was fire-and-forget (void) yet its callers unconditionally consumed the pressure signal (Pressure.Reset) and the MinReplicas hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp added nothing (ScheduleAndLoadModel errored, or no node could be loaded) the saturated warm replica got no new replica AND its accumulated forced-disturb history was wiped, forcing the signal to re-accumulate over a full PressureWindow before the next attempt. Make scaleUp return whether at least one replica was actually scheduled, and gate the side effects on it: - pressure block (2b): set scaledUp and call Pressure.Reset only on success; on failure preserve the signal so the next tick retries off the same accumulated pressure. - busy-burst block (2): set scaledUp from the return value so a failed attempt does not suppress the pressure path or scale-down. - MinReplicas block: call ClearUnsatisfiable only on success so a failed attempt does not reset the unsatisfiable counter. All existing invariants (MaxReplicas, capacity gating, UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are preserved. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): drop router's redundant prefix-cache Invalidate calls The NodeRegistry removal chokepoint (RemoveNodeModel / RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which invalidates the prefix-cache index. The router was also calling prefixProvider.Invalidate explicitly right after each registry removal on the two stale-replica health-probe fall-throughs in Route and in UnloadModel, so every router-side eviction invalidated twice (double tree-prune + double NATS broadcast). Remove the three redundant explicit Invalidate calls and their empty nil-guards. Each removed call sat immediately after a registry removal that fires the hook, so invalidation is preserved via the chokepoint. Decide/Observe usage is untouched. Re-point the unit test (fake registry fires no hook) to assert the removal chokepoint is exercised on unload instead of the router's direct invalidation. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): make capture+delete atomic in bulk node_models removal paths MarkOffline, MarkDraining, and the Register re-register cleanup ran the nodeModelNames SELECT and the bulk node_models DELETE as two separate statements on r.db with no transaction. A SetNodeModel landing between the two was deleted but its replica-removed hook never fired, leaving the prefix-cache index pointing at a removed replica until TTL or candidacy self-heal. Wrap the capture and the delete in a single db.Transaction in each path (mirroring how Deregister already does it). The captured model names are collected into a slice declared outside the closure; the replica-removed hook fires for each only after the transaction commits, so a rollback never invalidates the index for a removal that did not persist. The set of fired hooks now equals exactly the set of node_models rows actually deleted, with no interleaving gap. The status flip in MarkOffline/MarkDraining (setStatus) is a separate, pre-existing operation and routing already filters non-healthy nodes, so it stays outside the transaction; return contracts are unchanged. Deregister was already correct and is untouched. The cheap-path skip (no hook -> skip the SELECT) is preserved. Adds a spec asserting MarkOffline fires hooks for exactly the rows it deletes and leaves no node_models row behind (consistent snapshot). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(nodes): debug logging for prefix-cache routing decisions and observations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): match shared prefixes by valuing every node on insert Insert recorded the value (node id) only on the final node of the key chain, leaving every intermediate prefix node valueless. LongestMatch returns the deepest node that hasValue, so two chains that share a leading block but diverge in the tail never matched: only exact-repeat queries hit. That broke the prefix-cache routing core use cases (shared system prompt, multi-turn extension, volatile tail), all of which rely on prefix matching rather than exact-repeat. Set value/hasValue/lastSeen at every node along the chain so each prefix-block node remembers the node id that served that prefix (SGLang/vLLM-style). The deepest match wins, and the last writer owns a shared prefix node (a recency heuristic: the most recent chain through a block is the one most likely still warm). size now counts valued nodes, which is the intended meaning. Updated radixtree tests to the new semantics: deepest-prefix test uses non-overlapping chains, a new test asserts last-writer-owns-shared-node, Evict/Remove/MaxEntries expectations recomputed for per-prefix-node counting, and a shared-prefix LongestMatch red test added. Added a prefixcache Decide test proving a prefix-only query routes to the warm node. No prefixcache .go logic changed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): lock in prefix-cache routing behavior end to end Add a DB-backed e2e spec that drives SmartRouter against a real NodeRegistry (Postgres testcontainer) and the real prefixcache.Index radix-tree provider, using a fake gRPC backend factory so no real inference runs. Covers the five behaviors validated by hand: 1. Cold miss + observe: an unseen prefix chain cold-places and is recorded. 2. Hot-match affinity: the same chain returns to its warm node X. 3. Shared-prefix match: a divergent chain sharing X's leading prefix still routes to X (the radix-tree regression we fixed). 4. Negative control: an unrelated chain is a cold miss, not a false hot match on X. 5. Failover + invalidation: removing X's replica fires the registry chokepoint hook to invalidate the prefix entry, and the chain fails over to surviving node Y and re-homes there. Replaces the need for manual docker-compose re-runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): make prefix-cache affinity replica-granular Track prefix-cache affinity per loaded replica (a backend process with its own KV cache) instead of per node, so multiple replicas of the same model on one node each keep distinct affinity and a hot prefix routes back to the exact replica that served it. - radixtree: add RemoveFunc(pred) and reimplement Remove on top of it. - prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/ PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to drop every replica of a node; Invalidate drops one replica. Select returns (ReplicaKey, bool) and gains a deterministic least-in-flight eligible fallback (tiebreak NodeID then Replica). - messaging: carry Replica on PrefixCacheObserveEvent and PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node). - Sync delegates + broadcasts with replica; InvalidateNode broadcasts Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode. This is part 1 of 2; the registry/router/wiring consumers are updated separately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): make prefix-cache routing replica-granular Wire the SmartRouter, NodeRegistry, and distributed startup to the replica-keyed prefixcache API. Affinity is now tracked per replica (each replica is a separate process with its own KV cache), so a prefix served by (node,0) no longer leaks onto the same-node sibling (node,1). - RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks the EXACT (node_id, replica_index) row, falling through to the default ORDER BY when that replica is not loaded. - SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires the specific replica, RemoveAllNodeModelReplicas and the four bulk node-scoped deletes fire replica<0 (all replicas of the node). - buildPreference builds one Candidate per loaded replica and locks the exact replica the policy chose; observePrefix records the served ReplicaKey at every call site. - distributed.go routes the hook to InvalidateNode (replica<0) or Invalidate(key). - Tests updated to the replica-keyed API plus new coverage: a hot prefix on (node,0) prefers replica 0 over the same-node sibling (router unit + e2e), and FindAndLock locks the exact preferred replica. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): derive prefix chain from messages for tokenizer-template models Prefix-cache-aware routing built its prompt-prefix chain from the rendered prompt string `s` in ModelInference. For models with TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the backend tokenizes the structured messages itself - so `s` is empty, the chain is empty, and routing silently falls back to round-robin. That covers the bulk of modern chat models (qwen3, llama3, ...), so the feature effectively never engaged for them. Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable head-first serialization of the conversation (role + content per turn). Two requests sharing a leading system prompt and early turns share a leading byte prefix, which ExtractChain maps to a shared chain prefix - landing both on the same cache-warm replica. The rendered `s` is still preferred when present (higher fidelity for non-template models). Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision" logs despite per-request Route calls, traced to the empty-chain guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document prefix-cache routing roadmap Add a routing-and-caching roadmap section to the distributed-mode guide, linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and NVIDIA Dynamo. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-30 23:24:22 +02:00
Richard Palethorpe	6a80e23733	feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802 ) Add a routing middleware stack and a cloud-proxy backend. * cloud-proxy: a Go gRPC backend that forwards OpenAI- and Anthropic-shaped chat requests to upstream providers, with an optional translate mode (OpenAI request -> Anthropic /v1/messages -> OpenAI response) and full tool-calling support. * routing: admission control, content-aware model routing (embedding cache + classifier + rerank + Arch-Router score), PII detection/redaction (regex + NER) with streaming filter and OpenAI/Anthropic adapters, and a per-user/per-key billing recorder backed by GORM or in-memory storage. * middleware: UsageMiddleware records usage via the billing recorder, plus admission, route-model, usage-stamp and trace middlewares. * observability: BackendTrace ring buffer stores full request bodies (capped), MITM proxy emits structured trace events, and router classifier decisions surface at /api/router/decide. * gallery: Arch-Router-1.5B (Q4_K_M and Q8_0). * UI: cloud-proxy model-editor fields, classifier system-prompt and score-normalization config, and a Traces page rendering request bodies. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-25 09:28:27 +02:00
LocalAI [bot]	1198d10b58	fix(traces): cap backend trace Data to keep admin UI responsive (#9960 ) * fix(traces): cap backend trace Data field so the admin UI stays responsive The previous fix (#9946) capped API trace bodies but missed backend traces, which carry the same blast radius: - LLM backend traces store the full chat messages JSON, full response, and full streaming deltas. Every agent-pool reasoning step ships the full RAG-augmented history (50-500 KiB per trace, often 100+ traces queued). - TTS / audio_transform / transcript traces embed a 30s audio snippet as base64, around 1.3 MiB per trace. Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces page then keeps re-downloading and re-parsing the buffer faster than the 5s auto-refresh and stays in the loading state forever, the same symptom the API-side fix addressed. Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES: Option A (safety net in core/trace): RecordBackendTrace walks the Data map recursively and replaces any string value larger than the cap with "<truncated: N bytes>". Catches anything a future producer forgets. Option B (head-preserving at the producer): - core/backend/llm.go: TruncateToBytes on messages, response, and chat_deltas content/reasoning_content so the leading content stays readable in the UI. - core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded blob would exceed the cap (truncated base64 is undecodable). The quality metrics still ship and the UI's WaveformPlayer simply skips when the field is absent. TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's head-preserving output alone instead of replacing it with the bare marker. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES), CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main Settings page nor the inline Traces panel offered an input for it. Admins hitting the "Traces UI stuck loading" symptom had to know to set an env var or PUT raw JSON to /api/settings to dial the cap. Add a "Max Body Bytes" row next to "Max Items" in both places. Same input type, same disabled-when-tracing-off semantics, placeholder shows the 65536 default so users see what they're inheriting. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * test(react-ui): disambiguate Max Items locator after adding Max Body Bytes The Tracing settings panel now has two number inputs. The previous spec matched 'input[type="number"]' which became ambiguous and triggered a Playwright strict-mode violation in CI. Switch to getByPlaceholder('100') for Max Items and add a parallel spec for the new Max Body Bytes field using getByPlaceholder('65536'). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-23 14:50:40 +02:00
LocalAI [bot]	c500461c69	feat(config): default prompt_cache_all to true (#9951 ) Upstream llama.cpp defaults `cache_prompt = true` (common/common.h), but `parse_options` in the grpc-server backend unconditionally forwards the proto `PromptCacheAll` field, so any model that didn't set `prompt_cache_all: true` in its YAML was getting `cache_prompt=false` — silently overriding llama.cpp's own default. With `kv_unified` and `cache_idle_slots` already on by default, this was the last piece preventing the per-request prompt cache from being usable out of the box. Make `PromptCacheAll` tristate (`*bool`), default it to `true` in `SetDefaults`, and dereference at the proto boundary. Users can still opt out with an explicit `prompt_cache_all: false`. Same pattern as `MMap`, `MMlock`, `Reranking`, etc. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 22:06:22 +02:00
LocalAI [bot]	70cf8ac546	fix(backend): resolve relative draft_model paths against the models dir (#9680 ) * fix(backend): resolve relative draft_model paths against the models dir The main model file and mmproj are joined with the configured models directory before reaching the backend, but draft_model was sent verbatim. With a relative draft_model in the YAML config, llama.cpp opens the path from the backend process's CWD and fails with "No such file or directory", forcing users to hard-code an absolute path. Mirror the existing mmproj resolution: if draft_model is relative, join it with modelPath. Absolute paths are passed through unchanged. Adds an e2e regression test against the mock backend that asserts the main model file, mmproj, and draft_model all arrive at the backend resolved to absolute paths. Closes #9675 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] * fix(backend): always join draft_model with models dir (drop IsAbs shortcut) The previous commit kept absolute draft_model paths intact via an IsAbs check. That left a path-traversal vector open: a user-supplied YAML config could set draft_model to /etc/passwd (or any other host file the backend process can read) and the path would be sent through unchanged. filepath.Join cleans the leading slash from absolute components, so joining unconditionally — the way mmproj already does — keeps the result rooted at the configured models directory regardless of input. Adds a second e2e spec that feeds an absolute draft_model into the mock backend and asserts the path is clamped under modelsPath. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 00:58:38 +02:00
Richard Palethorpe	4916f8c880	feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563 ) * feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map LocalAI's vLLM backend wraps a small typed subset of vLLM's AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.). Anything outside that subset -- pipeline/data/expert parallelism, speculative_config, kv_transfer_config, all2all_backend, prefix caching, chunked prefill, etc. -- requires a new protobuf field, a Go struct field, an options.go line, and a backend.py mapping per feature. That cadence is the bottleneck on shipping vLLM's production feature set. Add a generic `engine_args:` map on the model YAML that is JSON-serialised into a new ModelOptions.EngineArgs proto field and applied verbatim to AsyncEngineArgs at LoadModel time. Validation is done by the Python backend via dataclasses.fields(); unknown keys fail with the closest valid name as a hint. dataclasses.replace() is used so vLLM's __post_init__ re-runs and auto-converts dict values into nested config dataclasses (CompilationConfig, AttentionConfig, ...). speculative_config and kv_transfer_config flow through as dicts; vLLM converts them at engine init. Operators can now write: engine_args: data_parallel_size: 8 enable_expert_parallel: true all2all_backend: deepep_low_latency speculative_config: method: deepseek_mtp num_speculative_tokens: 3 kv_cache_dtype: fp8 without further proto/Go/Python plumbing per field. Production defaults seeded by hooks_vllm.go: enable_prefix_caching and enable_chunked_prefill default to true unless explicitly set. Existing typed YAML fields (gpu_memory_utilization, tensor_parallel_size, etc.) remain for back-compat; engine_args overrides them when both are set. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130 simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and includes the DFlash speculative-decoding method that landed in 0.20.0. cublas13 install gets --index-strategy=unsafe-best-match so uv consults both the cu130 index and PyPI when resolving — PyPI also publishes vllm==0.20.0, but with cu12 binaries that error at import time. Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat completions on RTX 5070 Ti (sm_120, cu130). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(vllm): bot job to bump cublas13 vLLM wheel pin vLLM's cu130 wheel index URL is itself version-locked (wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM bump means rewriting two values atomically — the URL segment and the version constraint. bump_deps.sh handles git-sha-in-Makefile only; add a sibling bump_vllm_wheel.sh and a matching workflow job that mirrors the existing matrix's PR-creation pattern. The bumper queries /releases/latest (which excludes prereleases), strips the leading 'v', and seds both lines unconditionally. When the file is already on the latest tag the rewrite is a no-op and peter-evans/create-pull-request opens no PR. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * docs(vllm): document engine_args and speculative decoding The new engine_args: map plumbs arbitrary AsyncEngineArgs through to vLLM, but the public docs only covered the basic typed fields. Add a short subsection in the vLLM section explaining the typed/generic split and showing a worked DFlash speculative-decoding config, with pointers to vLLM's SpeculativeConfig reference and z-lab's drafter collection. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-29 00:49:28 +02:00
Ettore Di Giacinto	8862e3ce60	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 ) * always enable parallel requests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: move tests to ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(smart router): order by available vram Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 08:28:56 +02:00
Ettore Di Giacinto	59108fbe32	feat: add distributed mode (#9124 ) * feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-30 00:47:27 +02:00
Ettore Di Giacinto	031a36c995	feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092 ) * feat: wire min_p Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: inferencing defaults Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(refactor): re-use iterative parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: generate automatically inference defaults from unsloth Instead of trying to re-invent the wheel and maintain here the inference defaults, prefer to consume unsloth ones, and contribute there as necessary. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: apply defaults also to models installed via gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: be consistent and apply fallback to all endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-22 00:57:15 +01:00
Richard Palethorpe	35d509d8e7	feat(ui): Per model backend logs and various fixes (#9028 ) * feat(gallery): Switch to expandable box instead of pop-over and display model files Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(ui, backends): Add individual backend logging Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(ui): Set the context settings from the model config Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-03-18 08:31:26 +01:00
LocalAI [bot]	c6a51289b0	fix: Automatically disable mmap for Intel SYCL backends (#9012 ) (#9015 ) * fix: Automatically disable mmap for Intel SYCL backends Fixes issue #9012 where Qwen3.5 models fail to load on Intel Arc GPU with RPC EOF error. The Intel SYCL backend has a known issue where mmap enabled causes the backend to hang. This change automatically disables mmap when detecting Intel or SYCL backends. References: - https://github.com/mudler/LocalAI/issues/9012 - Documentation mentions: SYCL hangs when mmap: true is set * feat: Add logging for mmap auto-disable on Intel SYCL backends As requested in PR review, add xlog.Info call to log when mmap is automatically disabled for Intel SYCL backends. This helps with debugging and confirms the auto-disable logic is working. --------- Co-authored-by: localai-bot <localai-bot@users.noreply.github.com>	2026-03-15 21:06:35 +01:00
Ettore Di Giacinto	580517f9db	feat: pass-by metadata to predict options (#8795 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-05 22:50:10 +01:00
Ettore Di Giacinto	5f6c941399	fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path (#7832 ) fix(mmproj): fix loading mmproj in nested sub-dirs Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-02 20:17:30 +01:00
Ettore Di Giacinto	c37785b78c	chore(refactor): move logging to common package based on slog (#7668 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-21 19:33:13 +01:00
Ettore Di Giacinto	d7f9f3ac93	feat: add support to logitbias and logprobs (#7283 ) * feat: add support to logprobs in results Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: add support to logitbias Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-16 13:27:36 +01:00
Ettore Di Giacinto	cd1e1124ea	fix(llama.cpp): correctly set grammar triggers (#6432 ) * fix(llama.cpp): correctly set grammar triggers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Do not enable lazy by default Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-10-10 19:50:17 +02:00
Ettore Di Giacinto	739573e41b	feat(flash_attention): set auto for flash_attention in llama.cpp (#6168 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-08-31 17:59:09 +02:00
Ettore Di Giacinto	089efe05fd	feat(backends): add system backend, refactor (#6059 ) - Add a system backend path - Refactor and consolidate system information in system state - Use system state in all the components to figure out the system paths to used whenever needed - Refactor BackendConfig -> ModelConfig. This was otherway misleading as now we do have a backend configuration which is not the model config. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-08-14 19:38:26 +02:00
Ettore Di Giacinto	98e5291afc	feat: refactor build process, drop embedded backends (#5875 ) * feat: split remaining backends and drop embedded backends - Drop silero-vad, huggingface, and stores backend from embedded binaries - Refactor Makefile and Dockerfile to avoid building grpc backends - Drop golang code that was used to embed backends - Simplify building by using goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(gallery): be specific with llama-cpp backend templates Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(docs): update Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ci): minor fixes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: drop all ffmpeg references Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: run protogen-go Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Always enable p2p mode Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gorelease file Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(stores): do not always load Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix linting issues Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Mac OS fixup Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-07-22 16:31:04 +02:00
Ettore Di Giacinto	dfadc3696e	feat(llama.cpp): allow to set kv-overrides (#5745 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-06-28 21:26:07 +02:00
Ettore Di Giacinto	3b0cf52f6a	feat(llama.cpp): add reranking (#5396 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-05-22 21:49:30 +02:00
Ettore Di Giacinto	b2f9fc870b	chore(defaults): enlarge defaults, drop gpu layers which is infered (#5308 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-05-03 18:44:51 +02:00
Ettore Di Giacinto	61cc76c455	chore(autogptq): drop archived backend (#5214 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-04-19 15:52:29 +02:00
Ettore Di Giacinto	2c425e9c69	feat(loader): enhance single active backend by treating as singleton (#5107 ) feat(loader): enhance single active backend by treating at singleton Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-04-01 20:58:11 +02:00
Ettore Di Giacinto	67f7bffd18	chore(deps): update llama.cpp and sync with upstream changes (#4950 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-03-06 00:40:58 +01:00
Brandon Beiler	6a6e1a0ea9	feat(vllm): Additional vLLM config options (Disable logging, dtype, and Per-Prompt media limits) (#4855 ) * Adding the following vLLM config options: disable_log_status, dtype, limit_mm_per_prompt Signed-off-by: TheDropZone <brandonbeiler@gmail.com> * using " marks in the config.yaml file Signed-off-by: TheDropZone <brandonbeiler@gmail.com> * adding in missing colon Signed-off-by: TheDropZone <brandonbeiler@gmail.com> --------- Signed-off-by: TheDropZone <brandonbeiler@gmail.com>	2025-02-18 19:27:58 +01:00
Ettore Di Giacinto	1d6afbd65d	feat(llama.cpp): Add support to grammar triggers (#4733 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-02-02 13:25:03 +01:00
Ettore Di Giacinto	7d0ac1ea3f	chore(vall-e-x): Drop backend (#4619 ) There are many new architectures that are SOTA and replaces vall-e-x nowadays. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-01-17 09:35:10 +01:00
Ettore Di Giacinto	d4c1746c7d	feat(llama.cpp): expose cache_type_k and cache_type_v for quant of kv cache (#4329 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-12-06 10:23:59 +01:00
Ettore Di Giacinto	44a5dac312	feat(backend): add stablediffusion-ggml (#4289 ) * feat(backend): add stablediffusion-ggml Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ci): track stablediffusion-ggml Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use default scheduler and sampler if not specified Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move cfg scale out of diffusers block Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Make it working Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: set free_params_immediately to false to call the model in sequence https://github.com/leejet/stable-diffusion.cpp/issues/366 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-12-03 22:41:22 +01:00
Ettore Di Giacinto	6daef00d30	chore(refactor): drop unnecessary code in loader (#4096 ) * chore: simplify passing options to ModelOptions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(refactor): do not expose internal backend Loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-11-08 21:54:25 +01:00
Ettore Di Giacinto	947224b952	feat(diffusers): allow multiple lora adapters (#4081 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-11-05 15:14:33 +01:00
Ettore Di Giacinto	ae1ec4e096	feat(vllm): expose 'load_format' (#3943 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-10-23 15:34:57 +02:00
Ettore Di Giacinto	0965c6cd68	feat: track internally started models by ID (#3693 ) * chore(refactor): track internally started models by ID Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Just extend options, no need to copy Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Improve debugging for rerankers failures Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify model loading with rerankers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Be more consistent when generating model options Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Uncommitted code Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Make deleteProcess more idiomatic Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt CLI for sound generation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup threads definition Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Handle corner case where c.Seed is nil Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Consistently use ModelOptions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt new code to refactoring Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Dave <dave@gray101.com>	2024-10-02 08:55:58 +02:00
Sertaç Özercan	ee21b00a8d	feat: auto load into memory on startup (#3627 ) Signed-off-by: Sertac Ozercan <sozercan@gmail.com>	2024-09-22 10:03:30 +02:00
Ettore Di Giacinto	35561edb6e	feat(llama.cpp): support embeddings endpoints (#2871 ) * feat(llama.cpp): add embeddings Also enable embeddings by default for llama.cpp models Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(Makefile): prepare llama.cpp sources only once Otherwise we keep cloning llama.cpp for each of the variants Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not set embeddings to false Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add embeddings to the YAML config reference Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-07-15 22:54:16 +02:00
Ettore Di Giacinto	a8bfb6f9c2	feat(options): add `repeat_last_n` (#2660 ) feat(options): add repeat_last_n Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-06-26 14:58:50 +02:00
Sertaç Özercan	5866fc8ded	chore: fix go.mod module (#2635 ) Signed-off-by: Sertac Ozercan <sozercan@gmail.com>	2024-06-23 08:24:36 +00:00
Ettore Di Giacinto	e49ea0123b	feat(llama.cpp): add `flash_attention` and `no_kv_offloading` (#2310 ) feat(llama.cpp): add flash_attn and no_kv_offload Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-05-13 19:07:51 +02:00
Dave	2cd4936c99	fix: security scanner warning noise: error handlers part 1 (#2141 ) first group of error handlers to reduce security scanner warning noise level Signed-off-by: Dave Lee <dave@gray101.com>	2024-04-26 10:34:31 +02:00
Dave	c8dd8e5ef4	fix: reduce chmod permissions for created files and directories (#2137 ) quiet more security scanner issues: pass one of chmod restriction to remove group and other permissions Signed-off-by: Dave Lee <dave@gray101.com>	2024-04-26 00:47:06 +02:00
Taikono-Himazin	03adc1f60d	Add tensor_parallel_size setting to vllm setting items (#2085 ) Signed-off-by: Taikono-Himazin <kazu@po.harenet.ne.jp>	2024-04-20 14:37:02 +00:00
Ettore Di Giacinto	af9e5a2d05	Revert #1963 (#2056 ) * Revert "fix(fncall): fix regression introduced in #1963 (#2048)" This reverts commit `6b06d4e0af`. * Revert "fix: action-tmate back to upstream, dead code removal (#2038)" This reverts commit `fdec8a9d00`. * Revert "feat(grpc): return consumed token count and update response accordingly (#2035)" This reverts commit `e843d7df0e`. * Revert "refactor: backend/service split, channel-based llm flow (#1963)" This reverts commit `eed5706994`. * feat(grpc): return consumed token count and update response accordingly Fixes: #1920 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-04-17 23:33:49 +02:00
Dave	eed5706994	refactor: backend/service split, channel-based llm flow (#1963 ) Refactor: channel based llm flow and services split --------- Signed-off-by: Dave Lee <dave@gray101.com>	2024-04-13 09:45:34 +02:00
Ettore Di Giacinto	8342553214	fix(llama.cpp): set better defaults for llama.cpp (#1961 ) fix(defaults): set better defaults for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-04-06 22:56:45 +02:00
Ettore Di Giacinto	ff77d3bc22	fix(seed): generate random seed per-request if -1 is set (#1952 ) * fix(seed): generate random seed per-request if -1 is set Also update ci with new workflows and allow the aio tests to run with an api key Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(openvino): Add OpenVINO example Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-04-03 22:25:47 +02:00
Ettore Di Giacinto	f895d06605	fix(config): set better defaults for inferencing (#1822 ) * fix(defaults): set better defaults for inferencing This changeset aim to have better defaults and to properly detect when no inference settings are provided with the model. If not specified, we defaults to mirostat sampling, and offload all the GPU layers (if a GPU is detected). Related to https://github.com/mudler/LocalAI/issues/1373 and https://github.com/mudler/LocalAI/issues/1723 * Adapt tests * Also pre-initialize default seed	2024-03-13 10:05:30 +01:00
Ettore Di Giacinto	5d1018495f	feat(intel): add diffusers/transformers support (#1746 ) * feat(intel): add diffusers support * try to consume upstream container image * Debug * Manually install deps * Map transformers/hf cache dir to modelpath if not specified * fix(compel): update initialization, pass by all gRPC options * fix: add dependencies, implement transformers for xpu * base it from the oneapi image * Add pillow * set threads if specified when launching the API * Skip conda install if intel * defaults to non-intel * ci: add to pipelines * prepare compel only if enabled * Skip conda install if intel * fix cleanup * Disable compel by default * Install torch 2.1.0 with Intel * Skip conda on some setups * Detect python * Quiet output * Do not override system python with conda * Prefer python3 * Fixups * exllama2: do not install without conda (overrides pytorch version) * exllama/exllama2: do not install if not using cuda * Add missing dataset dependency * Small fixups, symlink to python, add requirements * Add neural_speed to the deps * correctly handle model offloading * fix: device_map == xpu * go back at calling python, fixed at dockerfile level * Exllama2 restricted to only nvidia gpus * Tokenizer to xpu	2024-03-07 14:37:45 +01:00
Ludovic Leroux	939411300a	Bump vLLM version + more options when loading models in vLLM (#1782 ) * Bump vLLM version to 0.3.2 * Add vLLM model loading options * Remove transformers-exllama * Fix install exllama	2024-03-01 22:48:53 +01:00

1 2

54 Commits