Commit Graph

54 Commits

Author SHA1 Message Date
LocalAI [bot]
e837921c2c feat: forward reasoning_effort to the backend so jinja models honor it (#10184)
* feat: forward reasoning_effort to the backend so jinja models honor it

reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.

Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): set the pipeline LLM's reasoning_effort

Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 13:45:43 +00:00
LocalAI [bot]
a44bdb29d4 feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Insert with path recency refresh and entry cap

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): recency-weighted per-value Weight

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Remove all entries for a value

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(radixtree): race-free concurrency smoke test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard

Address review findings on the generic prefix tree:

- Extract a shared pruneWalk helper parameterized by a shouldClear
  predicate and use it from Evict, Remove, and the MaxEntries path.
  Previously evictOldestLocked cleared a victim's value but never
  removed the now value-less node or its childless ancestors, so
  internal nodes accumulated under sustained churn at the cap. The
  MaxEntries path now prunes the victim and its empty ancestors.
- DRY: pruneWalk replaces the duplicated logic in the former
  pruneLocked and Remove's inner closure.
- Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take
  the read lock (RLock) while Insert, Evict and Remove keep the write
  lock. Confirmed race-clean under go test -race.
- Document the strict greater-than TTL boundary on Options.TTL and
  expired: age exactly equal to TTL is still live.
- Guard Insert against an empty key (no-op): the root never holds a
  value.

Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation,
the no-growth-past-cap invariant, the TTL boundary, and empty-key
behavior for both Insert and LongestMatch.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): RoutePolicy enum with parse/resolve

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Config with defaults and validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): deterministic xxhash prefix-chain extractor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): pure filter-then-score replica selection

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Provider interface and radix-tree-backed Index

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(prefixcache): gofmt policy enum comment alignment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): head-first prefix chunking and hoist Weight out of sort

Address code-quality review findings in the prefixcache package.

Correctness: ExtractChain now chunks from absolute offset 0 with fixed
[0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth
head blocks. The previous tail-keeping logic shifted the byte offset by a
non-window amount once a conversation grew past MaxDepth*WindowBytes,
changing every hash each turn and silently breaking cross-turn
longest-prefix matching. The reusable KV/prefix cache lives at the head
of the prompt, so anchoring at offset 0 makes the chain a true
prefix-chain: P and P+suffix share their full leading overlap. Add a
regression spec proving cross-turn stability past the cap.

Performance: Index.Decide precomputes each candidate's Weight once
(decorate-sort-undecorate) instead of calling the O(tree size) Weight
inside the O(n log n) sort comparator. Behavior is unchanged.

Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual
byte loop, clearing the modernize rangeint finding.

Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's
documented concurrency safety under go test -race.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): prefixcache observe/invalidate subjects and payloads

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): NATS sync publish/apply for observe and invalidate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): ctx carrier for prefix-hash chain

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): PrefixChainHook indirection for backend-side chain build

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): stash prompt prefix chain on ctx before distributed routing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend): mirror modelID fallback for prefix-chain salt parity

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scheduling config columns for prefix-cache routing

Add RoutePolicy and per-model balance/prefix-match override columns to
ModelSchedulingConfig and include them in the SetModelScheduling upsert
DoUpdates list so updates are not dropped on conflict.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): optional route preference in FindAndLockNodeWithModel

Add a RoutePreference type and a new pref parameter so the atomic
pick+lock+increment can be biased toward a preferred node without
weakening atomicity. A nil preference reproduces the previous ORDER BY
behavior exactly. Update the ModelRouter interface, both router.go call
sites (pass nil for now; Phase 5 builds the real preference), the test
doubles, and the distributed e2e caller.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): make Sync satisfy Provider with Evict

Sync.Observe now returns whether the local index treated the assignment as
new or extended, and Sync gains an Evict method that delegates to the wrapped
index. Together these let SmartRouter hold a single prefixcache.Provider that
broadcasts via NATS. Adds a compile-time Provider assertion and an
Evict-delegates behavioral test.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route

Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil
keeps routing byte-for-byte the round-robin floor). On each request Route now
calls buildPreference: it reads the prompt prefix chain from ctx
(distributedhdr.PrefixChain), resolves the per-model policy/thresholds over
the global config, loads candidate replica in-flight via a new registry read
LoadedReplicaStats (deduped to one entry per node using the MIN in-flight
across that node's replicas), asks the provider to Decide, and runs
prefixcache.Select. The chosen node is passed as the RoutePreference to
FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick,
cold scheduleAndLoad), and the served node is recorded via Observe only when
the resolved policy is prefix_cache so round-robin models never pollute the
tree.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): invalidate prefix-cache entries on unload and stale removal

UnloadModel and both staleness fall-through paths in Route (after a failed
gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model,
nodeID), guarded by a nil-provider check so the round-robin floor is
unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations
also broadcast to peer frontends. Adds a test that a previously hot prefix no
longer Decides to a node after UnloadModel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): rolling forced-disturb pressure counter

Add a concurrency-safe per-model rolling counter that tracks how many
times a request had a usable hot prefix match but the load guard forced
it off the warm node. Entries outside the window are dropped lazily on
Count so the backing slice stays bounded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): autoscale on prefix-cache forced-disturb pressure

Wire the rolling forced-disturb counter into the SmartRouter and the
ReplicaReconciler.

Router: in buildPreference, after Decide + Select, record a forced-disturb
when a usable hot prefix match existed (d.HotNodeID != "" and
d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or
nothing) because the load guard ruled the warm node out. This is the
scale-worthy signal: the cache-warm replica is saturated. It deliberately
does not fire for all-unique workloads (no hot match), avoiding
false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil
keeps the path a no-op.

Reconciler: read the same Pressure instance in reconcileModel as an extra
scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel
guards and the UnsatisfiableUntil cooldown that gates the whole method.
Pressure never overrides MaxReplicas and never force-evicts; a no-capacity
model does not spin. Window and threshold come from prefixcache.Config
(PressureWindow default 1m, PressureScaleThreshold default 1) and are
configurable via ReplicaReconcilerOptions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow

Record now prunes entries older than the rolling window (the same prune
Count does), via a shared pruneLocked helper, so a model that takes
forced-disturb records but is never Counted (e.g. one with zero loaded
replicas the reconciler skips) no longer grows its backing slice
unbounded.

Also removes the dead pressureWindow struct field and the
ReplicaReconcilerOptions.PressureWindow option from the reconciler: they
were stored but never read (the window lives inside the *prefixcache.Pressure
instance). The scale block now reads pressure.Count once into a local.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): prefix-cache fields in scheduling endpoint DTO with validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): prefix-cache routing controls in node scheduling form

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire prefix-cache index, NATS sync, and config

Activates prefix-cache-aware routing in distributed mode. Builds the
prefixcache Index + NATS-backed Sync + Pressure counter, installs the
distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix
chain per request, subscribes to prefixcache.observe/prefixcache.invalidate
to apply peers' events to the local index (no re-broadcast), threads
PrefixProvider/PrefixConfig/Pressure into the SmartRouter and
Pressure/PressureThreshold into the ReplicaReconciler, and runs a
background eviction ticker (every TTL/2) bound to the app context.

Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE)
opts out and leaves the provider/pressure nil so routing stays round-robin.
--distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m)
controls entry idle-timeout and eviction cadence.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(nodes): round-robin-floor invariant for prefix-cache routing

Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never
picked even with a perfect prefix match (round-robin floor holds), while a
balanced hot node within the load slack is reused.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(prefixcache): clear branch lint findings and em dashes

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): validate prefix-cache config at startup wiring

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(radixtree): single-walk WeightsFor for batch value weights

Add Tree.WeightsFor(values, now) which computes the recency-weighted
weight for many values in a single O(N + len(values)) tree traversal,
versus calling Weight once per value (O(len(values) * N)). Consumers
that score K candidates against the tree under the read lock no longer
pay K full walks.

Extract the per-entry contribution math into an unexported helper shared
by both Weight and WeightsFor so the metric stays identical (DRY).
Weight's public behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): add ModelConfig.ModelID() single source of truth

The c.Name fallback to c.Model was duplicated in core/backend/options.go
(feeding model.WithModelID) and hand-copied into core/backend/llm.go (the
prefix-chain salt). These MUST agree or the prefix-cache salt diverges
silently from the id the model loader tracks. Consolidate both into a new
config.ModelConfig.ModelID() helper and call it from both sites. Behavior
is identical.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): reuse one xxhash.Digest in ExtractChain

ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth
per call) and grew the chain slice without preallocation. Reuse a single
Digest via Reset() before each block and preallocate the chain to
min(nBlocks, MaxDepth).

xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical
value to a fresh New()+Write. Output hashes are unchanged, preserving the
cross-process determinism that peers rely on over NATS. Verified by capturing
ExtractChain output for the existing test inputs before and after the
refactor: identical. Existing extractor tests pass unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk

Index.Decide called radixtree.LongestMatch over the whole tree, so the
deepest match could be a node that is offline, unloaded, or simply not in
the passed candidate set. Honoring that as HotNodeID produced a false
forced-disturb signal upstream (buildPreference records pressure when
chosen != HotNodeID), making it look like a warm replica was load
saturated when it was actually absent.

Build the candidate set once and only set HotNodeID/MatchRatio when the
matched node is an actual candidate; otherwise fall back to cold
placement. A future refinement could ask the tree for the longest match
restricted to the candidate nodes (shallower-but-valid) instead of
dropping it.

Also replace the per-candidate tree.Weight call in the cold-order sort
with a single tree.WeightsFor walk, turning O(K*N) under the read lock
into O(N + K).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): remove Select's unreachable deterministic fallback

buildPreference always passes ColdOrder as a permutation of the full
candidate set, so the cold-order loop hits every eligible candidate. The
trailing best/bestIF scan was dead. Replace it with a plain "return """
and document that ColdOrder is guaranteed to cover all candidates, so ""
means none were eligible.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): fetch model scheduling config once per Route

GetModelScheduling was read three times per request - in
resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling -
three DB round-trips for one row that is immutable for the life of the
request, and not a consistent snapshot. Fetch it once near the top of
Route and thread the *ModelSchedulingConfig (may be nil) into all three
helpers. scheduleNewModel keeps its own fetch since it runs outside the
Route snapshot. Behavior is identical for nil sched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): add Pressure.Reset to consume forced-disturb signal

Pressure.Count is non-draining (it prunes only by age), so a single burst
of forced-disturbs stays within the rolling window for the whole window and
keeps Count >= threshold on every reconciler tick. The reconciler will use
Reset to clear a model's events after acting on the signal so a fresh
scale-up requires fresh forced-disturbs to accumulate, rather than one burst
driving the model toward MaxReplicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): at most one scale-up per reconcile tick, consume pressure

Two autoscale bugs:

1. Over-scaling: the pressure scale-up block read Pressure.Count but never
   consumed it. With a non-draining counter a single forced-disturb burst
   kept Count >= threshold across the whole window, firing scaleUp on every
   tick and pushing the model toward MaxReplicas off one transient burst.
   After a successful pressure-triggered scale-up the reconciler now calls
   Pressure.Reset to consume the signal.

2. Double scale-up in one tick: the all-replicas-busy block and the pressure
   block could both fire in the same reconcileModel pass, each calling
   scaleUp(+1) against the same `current` read once at the top, so a model
   that was both busy and over threshold scaled +2 and could overshoot
   MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1)
   per tick: the pressure block is skipped if the busy block already scaled,
   and scale-down is skipped in any tick that scaled up.

MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation

Add SetReplicaRemovedHook to NodeRegistry and fire it from both
RemoveNodeModel and RemoveAllNodeModelReplicas after a successful
delete. This is the single chokepoint every replica-removal path funnels
through (router eviction, reconciler scale-down, probe reaper,
health-monitor node-down reap, RemoteUnloaderAdapter), so the
prefix-cache index can be invalidated by construction rather than wiring
each call site individually.

The hook is stored in an atomic.Pointer so the startup wiring (setter)
and the request/reconcile-time fire are race-free; it is nil-safe when
unset. GORM Delete reports no error for a no-op delete, so the hook also
fires when nothing was removed; the consumer's Invalidate(model, node)
is idempotent so this is harmless.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): invalidate prefix-cache on any replica removal via registry hook

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): single source of truth for threshold bounds

Extract ValidateThresholds into prefixcache/config.go so the per-model
override validation (nodes.go endpoint) and Config.Validate share one
implementation of the numeric bounds (min_prefix_match in [0,1],
balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead
of hard-coding them in two places. The route_policy allow-list stays
explicit (not ParsePolicy, which maps typos to Default).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): preserve prefix-cache settings on partial scheduling update

A scheduling POST that omitted route_policy/thresholds (e.g. a
min_replicas-only update) full-replaced every column and silently reset
the model's previously-configured prefix-cache settings to empty/zero.

Make the four prefix-cache request fields pointers so omitted is
distinguishable from explicit zero, and merge PATCH-style in
SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves
the existing config value (zero default when none). Non-prefix fields
keep their full-replace PUT semantics. Validation now runs on the
resolved values via prefixcache.ValidateThresholds.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts

A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica
removal of every model, including round-robin models that never used the
prefix cache. Index.Invalidate previously called tree(model), which lazily
created and permanently retained an empty radix tree for any model that ever
lost a replica, growing the trees map without bound. Sync.Invalidate also
published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op
removals across the cluster.

Index.Invalidate now looks the tree up read-only via existingTree and returns
without allocating when none exists. The Provider interface is unchanged;
Sync gates the broadcast through an optional invalidateExisting(bool) capability
type-asserted from the wrapped Index, falling back to the prior always-broadcast
behavior for other Provider implementations.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort

WeightsFor already returns a map keyed by every requested candidate, so the
separate candidates set built to validate the hot match was redundant: a node
is a candidate iff it is a key in the weights map. Drop the extra map and gate
the hot-match check on weights membership. Also skip the sort when there is at
most one candidate, since the input order is already the cold order. Behavior
is unchanged.

Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins
would need lazy cross-file changes and is out of scope here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns

Bulk node-scoped node_models deletes (Register re-register cleanup,
MarkOffline, MarkDraining, Deregister) removed rows directly without
firing the replica-removed hook, so the prefix-cache index kept
pointing at nodes whose models were gone. Capture the DISTINCT model
names before each bulk delete and fire fireReplicaRemoved once per
model after a successful delete, restoring the single-chokepoint
invariant for all removal paths. The pre-query is skipped when no hook
is set so the no-hook path stays cheap.

Also narrow LoadedReplicaStats to SELECT only node_id and in_flight
(the only fields the router consumer reads), dropping the JOIN-side
available_vram fetch and unused columns while keeping the
[]ReplicaCandidate return type unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(reconciler): consume autoscale signals only on a real scale-up

scaleUp was fire-and-forget (void) yet its callers unconditionally
consumed the pressure signal (Pressure.Reset) and the MinReplicas
hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp
added nothing (ScheduleAndLoadModel errored, or no node could be
loaded) the saturated warm replica got no new replica AND its
accumulated forced-disturb history was wiped, forcing the signal to
re-accumulate over a full PressureWindow before the next attempt.

Make scaleUp return whether at least one replica was actually
scheduled, and gate the side effects on it:

- pressure block (2b): set scaledUp and call Pressure.Reset only on
  success; on failure preserve the signal so the next tick retries off
  the same accumulated pressure.
- busy-burst block (2): set scaledUp from the return value so a failed
  attempt does not suppress the pressure path or scale-down.
- MinReplicas block: call ClearUnsatisfiable only on success so a
  failed attempt does not reset the unsatisfiable counter.

All existing invariants (MaxReplicas, capacity gating,
UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are
preserved.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): drop router's redundant prefix-cache Invalidate calls

The NodeRegistry removal chokepoint (RemoveNodeModel /
RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which
invalidates the prefix-cache index. The router was also calling
prefixProvider.Invalidate explicitly right after each registry removal
on the two stale-replica health-probe fall-throughs in Route and in
UnloadModel, so every router-side eviction invalidated twice (double
tree-prune + double NATS broadcast).

Remove the three redundant explicit Invalidate calls and their empty
nil-guards. Each removed call sat immediately after a registry removal
that fires the hook, so invalidation is preserved via the chokepoint.
Decide/Observe usage is untouched.

Re-point the unit test (fake registry fires no hook) to assert the
removal chokepoint is exercised on unload instead of the router's
direct invalidation.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): make capture+delete atomic in bulk node_models removal paths

MarkOffline, MarkDraining, and the Register re-register cleanup ran the
nodeModelNames SELECT and the bulk node_models DELETE as two separate
statements on r.db with no transaction. A SetNodeModel landing between
the two was deleted but its replica-removed hook never fired, leaving
the prefix-cache index pointing at a removed replica until TTL or
candidacy self-heal.

Wrap the capture and the delete in a single db.Transaction in each path
(mirroring how Deregister already does it). The captured model names are
collected into a slice declared outside the closure; the
replica-removed hook fires for each only after the transaction commits,
so a rollback never invalidates the index for a removal that did not
persist. The set of fired hooks now equals exactly the set of
node_models rows actually deleted, with no interleaving gap.

The status flip in MarkOffline/MarkDraining (setStatus) is a separate,
pre-existing operation and routing already filters non-healthy nodes, so
it stays outside the transaction; return contracts are unchanged.
Deregister was already correct and is untouched. The cheap-path skip
(no hook -> skip the SELECT) is preserved.

Adds a spec asserting MarkOffline fires hooks for exactly the rows it
deletes and leaves no node_models row behind (consistent snapshot).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(nodes): debug logging for prefix-cache routing decisions and observations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): match shared prefixes by valuing every node on insert

Insert recorded the value (node id) only on the final node of the key
chain, leaving every intermediate prefix node valueless. LongestMatch
returns the deepest node that hasValue, so two chains that share a
leading block but diverge in the tail never matched: only exact-repeat
queries hit. That broke the prefix-cache routing core use cases (shared
system prompt, multi-turn extension, volatile tail), all of which rely
on prefix matching rather than exact-repeat.

Set value/hasValue/lastSeen at every node along the chain so each
prefix-block node remembers the node id that served that prefix
(SGLang/vLLM-style). The deepest match wins, and the last writer owns a
shared prefix node (a recency heuristic: the most recent chain through a
block is the one most likely still warm). size now counts valued nodes,
which is the intended meaning.

Updated radixtree tests to the new semantics: deepest-prefix test uses
non-overlapping chains, a new test asserts last-writer-owns-shared-node,
Evict/Remove/MaxEntries expectations recomputed for per-prefix-node
counting, and a shared-prefix LongestMatch red test added. Added a
prefixcache Decide test proving a prefix-only query routes to the warm
node. No prefixcache .go logic changed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): lock in prefix-cache routing behavior end to end

Add a DB-backed e2e spec that drives SmartRouter against a real
NodeRegistry (Postgres testcontainer) and the real prefixcache.Index
radix-tree provider, using a fake gRPC backend factory so no real
inference runs. Covers the five behaviors validated by hand:

1. Cold miss + observe: an unseen prefix chain cold-places and is recorded.
2. Hot-match affinity: the same chain returns to its warm node X.
3. Shared-prefix match: a divergent chain sharing X's leading prefix
   still routes to X (the radix-tree regression we fixed).
4. Negative control: an unrelated chain is a cold miss, not a false
   hot match on X.
5. Failover + invalidation: removing X's replica fires the registry
   chokepoint hook to invalidate the prefix entry, and the chain fails
   over to surviving node Y and re-homes there.

Replaces the need for manual docker-compose re-runs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): make prefix-cache affinity replica-granular

Track prefix-cache affinity per loaded replica (a backend process with its
own KV cache) instead of per node, so multiple replicas of the same model on
one node each keep distinct affinity and a hot prefix routes back to the exact
replica that served it.

- radixtree: add RemoveFunc(pred) and reimplement Remove on top of it.
- prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/
  PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to
  drop every replica of a node; Invalidate drops one replica. Select returns
  (ReplicaKey, bool) and gains a deterministic least-in-flight eligible
  fallback (tiebreak NodeID then Replica).
- messaging: carry Replica on PrefixCacheObserveEvent and
  PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node).
- Sync delegates + broadcasts with replica; InvalidateNode broadcasts
  Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode.

This is part 1 of 2; the registry/router/wiring consumers are updated
separately.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): make prefix-cache routing replica-granular

Wire the SmartRouter, NodeRegistry, and distributed startup to the
replica-keyed prefixcache API. Affinity is now tracked per replica
(each replica is a separate process with its own KV cache), so a prefix
served by (node,0) no longer leaks onto the same-node sibling (node,1).

- RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks
  the EXACT (node_id, replica_index) row, falling through to the default
  ORDER BY when that replica is not loaded.
- SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires
  the specific replica, RemoveAllNodeModelReplicas and the four bulk
  node-scoped deletes fire replica<0 (all replicas of the node).
- buildPreference builds one Candidate per loaded replica and locks the
  exact replica the policy chose; observePrefix records the served
  ReplicaKey at every call site.
- distributed.go routes the hook to InvalidateNode (replica<0) or
  Invalidate(key).
- Tests updated to the replica-keyed API plus new coverage: a hot prefix
  on (node,0) prefers replica 0 over the same-node sibling (router unit +
  e2e), and FindAndLock locks the exact preferred replica.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): derive prefix chain from messages for tokenizer-template models

Prefix-cache-aware routing built its prompt-prefix chain from the rendered
prompt string `s` in ModelInference. For models with
TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the
backend tokenizes the structured messages itself - so `s` is empty, the chain
is empty, and routing silently falls back to round-robin. That covers the bulk
of modern chat models (qwen3, llama3, ...), so the feature effectively never
engaged for them.

Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable
head-first serialization of the conversation (role + content per turn). Two
requests sharing a leading system prompt and early turns share a leading byte
prefix, which ExtractChain maps to a shared chain prefix - landing both on the
same cache-warm replica. The rendered `s` is still preferred when present
(higher fidelity for non-template models).

Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision"
logs despite per-request Route calls, traced to the empty-chain guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document prefix-cache routing roadmap

Add a routing-and-caching roadmap section to the distributed-mode guide,
linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced
from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and
NVIDIA Dynamo.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 23:24:22 +02:00
Richard Palethorpe
6a80e23733 feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802)
Add a routing middleware stack and a cloud-proxy backend.

* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
  Anthropic-shaped chat requests to upstream providers, with an
  optional translate mode (OpenAI request -> Anthropic /v1/messages
  -> OpenAI response) and full tool-calling support.

* routing: admission control, content-aware model routing
  (embedding cache + classifier + rerank + Arch-Router score),
  PII detection/redaction (regex + NER) with streaming filter and
  OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
  backed by GORM or in-memory storage.

* middleware: UsageMiddleware records usage via the billing recorder,
  plus admission, route-model, usage-stamp and trace middlewares.

* observability: BackendTrace ring buffer stores full request bodies
  (capped), MITM proxy emits structured trace events, and router
  classifier decisions surface at /api/router/decide.

* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).

* UI: cloud-proxy model-editor fields, classifier system-prompt and
  score-normalization config, and a Traces page rendering request
  bodies.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-25 09:28:27 +02:00
LocalAI [bot]
1198d10b58 fix(traces): cap backend trace Data to keep admin UI responsive (#9960)
* fix(traces): cap backend trace Data field so the admin UI stays responsive

The previous fix (#9946) capped API trace bodies but missed backend traces,
which carry the same blast radius:

  - LLM backend traces store the full chat messages JSON, full response, and
    full streaming deltas. Every agent-pool reasoning step ships the full
    RAG-augmented history (50-500 KiB per trace, often 100+ traces queued).
  - TTS / audio_transform / transcript traces embed a 30s audio snippet as
    base64, around 1.3 MiB per trace.

Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces
page then keeps re-downloading and re-parsing the buffer faster than the
5s auto-refresh and stays in the loading state forever, the same symptom
the API-side fix addressed.

Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES:

Option A (safety net in core/trace): RecordBackendTrace walks the Data map
recursively and replaces any string value larger than the cap with
"<truncated: N bytes>". Catches anything a future producer forgets.

Option B (head-preserving at the producer):
  - core/backend/llm.go: TruncateToBytes on messages, response, and
    chat_deltas content/reasoning_content so the leading content stays
    readable in the UI.
  - core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded
    blob would exceed the cap (truncated base64 is undecodable). The
    quality metrics still ship and the UI's WaveformPlayer simply skips
    when the field is absent.

TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's
head-preserving output alone instead of replacing it with the bare marker.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels

The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES),
CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main
Settings page nor the inline Traces panel offered an input for it. Admins
hitting the "Traces UI stuck loading" symptom had to know to set an env var
or PUT raw JSON to /api/settings to dial the cap.

Add a "Max Body Bytes" row next to "Max Items" in both places. Same input
type, same disabled-when-tracing-off semantics, placeholder shows the 65536
default so users see what they're inheriting.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* test(react-ui): disambiguate Max Items locator after adding Max Body Bytes

The Tracing settings panel now has two number inputs. The previous spec
matched 'input[type="number"]' which became ambiguous and triggered a
Playwright strict-mode violation in CI. Switch to getByPlaceholder('100')
for Max Items and add a parallel spec for the new Max Body Bytes field
using getByPlaceholder('65536').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 14:50:40 +02:00
LocalAI [bot]
c500461c69 feat(config): default prompt_cache_all to true (#9951)
Upstream llama.cpp defaults `cache_prompt = true` (common/common.h),
but `parse_options` in the grpc-server backend unconditionally forwards
the proto `PromptCacheAll` field, so any model that didn't set
`prompt_cache_all: true` in its YAML was getting `cache_prompt=false` —
silently overriding llama.cpp's own default. With `kv_unified` and
`cache_idle_slots` already on by default, this was the last piece
preventing the per-request prompt cache from being usable out of the
box.

Make `PromptCacheAll` tristate (`*bool`), default it to `true` in
`SetDefaults`, and dereference at the proto boundary. Users can still
opt out with an explicit `prompt_cache_all: false`. Same pattern as
`MMap`, `MMlock`, `Reranking`, etc.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 22:06:22 +02:00
LocalAI [bot]
70cf8ac546 fix(backend): resolve relative draft_model paths against the models dir (#9680)
* fix(backend): resolve relative draft_model paths against the models dir

The main model file and mmproj are joined with the configured models
directory before reaching the backend, but draft_model was sent
verbatim. With a relative draft_model in the YAML config, llama.cpp
opens the path from the backend process's CWD and fails with "No such
file or directory", forcing users to hard-code an absolute path.

Mirror the existing mmproj resolution: if draft_model is relative,
join it with modelPath. Absolute paths are passed through unchanged.

Adds an e2e regression test against the mock backend that asserts the
main model file, mmproj, and draft_model all arrive at the backend
resolved to absolute paths.

Closes #9675

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]

* fix(backend): always join draft_model with models dir (drop IsAbs shortcut)

The previous commit kept absolute draft_model paths intact via an
IsAbs check. That left a path-traversal vector open: a user-supplied
YAML config could set draft_model to /etc/passwd (or any other host
file the backend process can read) and the path would be sent through
unchanged.

filepath.Join cleans the leading slash from absolute components, so
joining unconditionally — the way mmproj already does — keeps the
result rooted at the configured models directory regardless of input.

Adds a second e2e spec that feeds an absolute draft_model into the
mock backend and asserts the path is clamped under modelsPath.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 00:58:38 +02:00
Richard Palethorpe
4916f8c880 feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(vllm): bot job to bump cublas13 vLLM wheel pin

vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* docs(vllm): document engine_args and speculative decoding

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-29 00:49:28 +02:00
Ettore Di Giacinto
8862e3ce60 feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186)
* always enable parallel requests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: move tests to ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(smart router): order by available vram

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:28:56 +02:00
Ettore Di Giacinto
59108fbe32 feat: add distributed mode (#9124)
* feat: add distributed mode (experimental)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix data races, mutexes, transactions

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix events and tool stream in agent chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* use ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cron): compute correctly time boundaries avoiding re-triggering

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not flood of healthy checks

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not list obvious backends as text backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* tests fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop redundant healthcheck

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-30 00:47:27 +02:00
Ettore Di Giacinto
031a36c995 feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092)
* feat: wire min_p

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: inferencing defaults

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(refactor): re-use iterative parser

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: generate automatically inference defaults from unsloth

Instead of trying to re-invent the wheel and maintain here the inference
defaults, prefer to consume unsloth ones, and contribute there as
necessary.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: apply defaults also to models installed via gallery

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: be consistent and apply fallback to all endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:57:15 +01:00
Richard Palethorpe
35d509d8e7 feat(ui): Per model backend logs and various fixes (#9028)
* feat(gallery): Switch to expandable box instead of pop-over and display model files

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(ui, backends): Add individual backend logging

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(ui): Set the context settings from the model config

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-18 08:31:26 +01:00
LocalAI [bot]
c6a51289b0 fix: Automatically disable mmap for Intel SYCL backends (#9012) (#9015)
* fix: Automatically disable mmap for Intel SYCL backends

Fixes issue #9012 where Qwen3.5 models fail to load on Intel Arc GPU
with RPC EOF error.

The Intel SYCL backend has a known issue where mmap enabled causes
the backend to hang. This change automatically disables mmap when
detecting Intel or SYCL backends.

References:
- https://github.com/mudler/LocalAI/issues/9012
- Documentation mentions: SYCL hangs when mmap: true is set

* feat: Add logging for mmap auto-disable on Intel SYCL backends

As requested in PR review, add xlog.Info call to log when mmap
is automatically disabled for Intel SYCL backends. This helps
with debugging and confirms the auto-disable logic is working.

---------

Co-authored-by: localai-bot <localai-bot@users.noreply.github.com>
2026-03-15 21:06:35 +01:00
Ettore Di Giacinto
580517f9db feat: pass-by metadata to predict options (#8795)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-05 22:50:10 +01:00
Ettore Di Giacinto
5f6c941399 fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path (#7832)
fix(mmproj): fix loading mmproj in nested sub-dirs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-02 20:17:30 +01:00
Ettore Di Giacinto
c37785b78c chore(refactor): move logging to common package based on slog (#7668)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-21 19:33:13 +01:00
Ettore Di Giacinto
d7f9f3ac93 feat: add support to logitbias and logprobs (#7283)
* feat: add support to logprobs in results

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add support to logitbias

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-11-16 13:27:36 +01:00
Ettore Di Giacinto
cd1e1124ea fix(llama.cpp): correctly set grammar triggers (#6432)
* fix(llama.cpp): correctly set grammar triggers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Do not enable lazy by default

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-10-10 19:50:17 +02:00
Ettore Di Giacinto
739573e41b feat(flash_attention): set auto for flash_attention in llama.cpp (#6168)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-31 17:59:09 +02:00
Ettore Di Giacinto
089efe05fd feat(backends): add system backend, refactor (#6059)
- Add a system backend path
- Refactor and consolidate system information in system state
- Use system state in all the components to figure out the system paths
  to used whenever needed
- Refactor BackendConfig -> ModelConfig. This was otherway misleading as
  now we do have a backend configuration which is not the model config.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-14 19:38:26 +02:00
Ettore Di Giacinto
98e5291afc feat: refactor build process, drop embedded backends (#5875)
* feat: split remaining backends and drop embedded backends

- Drop silero-vad, huggingface, and stores backend from embedded
  binaries
- Refactor Makefile and Dockerfile to avoid building grpc backends
- Drop golang code that was used to embed backends
- Simplify building by using goreleaser

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(gallery): be specific with llama-cpp backend templates

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(docs): update

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ci): minor fixes

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: drop all ffmpeg references

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: run protogen-go

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Always enable p2p mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update gorelease file

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(stores): do not always load

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix linting issues

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Simplify

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Mac OS fixup

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-22 16:31:04 +02:00
Ettore Di Giacinto
dfadc3696e feat(llama.cpp): allow to set kv-overrides (#5745)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-06-28 21:26:07 +02:00
Ettore Di Giacinto
3b0cf52f6a feat(llama.cpp): add reranking (#5396)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-05-22 21:49:30 +02:00
Ettore Di Giacinto
b2f9fc870b chore(defaults): enlarge defaults, drop gpu layers which is infered (#5308)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-05-03 18:44:51 +02:00
Ettore Di Giacinto
61cc76c455 chore(autogptq): drop archived backend (#5214)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-04-19 15:52:29 +02:00
Ettore Di Giacinto
2c425e9c69 feat(loader): enhance single active backend by treating as singleton (#5107)
feat(loader): enhance single active backend by treating at singleton

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-04-01 20:58:11 +02:00
Ettore Di Giacinto
67f7bffd18 chore(deps): update llama.cpp and sync with upstream changes (#4950)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-03-06 00:40:58 +01:00
Brandon Beiler
6a6e1a0ea9 feat(vllm): Additional vLLM config options (Disable logging, dtype, and Per-Prompt media limits) (#4855)
* Adding the following vLLM config options: disable_log_status, dtype, limit_mm_per_prompt

Signed-off-by: TheDropZone <brandonbeiler@gmail.com>

* using " marks in the config.yaml file

Signed-off-by: TheDropZone <brandonbeiler@gmail.com>

* adding in missing colon

Signed-off-by: TheDropZone <brandonbeiler@gmail.com>

---------

Signed-off-by: TheDropZone <brandonbeiler@gmail.com>
2025-02-18 19:27:58 +01:00
Ettore Di Giacinto
1d6afbd65d feat(llama.cpp): Add support to grammar triggers (#4733)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-02-02 13:25:03 +01:00
Ettore Di Giacinto
7d0ac1ea3f chore(vall-e-x): Drop backend (#4619)
There are many new architectures that are SOTA and replaces vall-e-x
nowadays.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-01-17 09:35:10 +01:00
Ettore Di Giacinto
d4c1746c7d feat(llama.cpp): expose cache_type_k and cache_type_v for quant of kv cache (#4329)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-12-06 10:23:59 +01:00
Ettore Di Giacinto
44a5dac312 feat(backend): add stablediffusion-ggml (#4289)
* feat(backend): add stablediffusion-ggml

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ci): track stablediffusion-ggml

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Use default scheduler and sampler if not specified

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Move cfg scale out of diffusers block

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Make it working

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: set free_params_immediately to false to call the model in sequence

https://github.com/leejet/stable-diffusion.cpp/issues/366

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-12-03 22:41:22 +01:00
Ettore Di Giacinto
6daef00d30 chore(refactor): drop unnecessary code in loader (#4096)
* chore: simplify passing options to ModelOptions

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(refactor): do not expose internal backend Loader

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-11-08 21:54:25 +01:00
Ettore Di Giacinto
947224b952 feat(diffusers): allow multiple lora adapters (#4081)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-11-05 15:14:33 +01:00
Ettore Di Giacinto
ae1ec4e096 feat(vllm): expose 'load_format' (#3943)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-10-23 15:34:57 +02:00
Ettore Di Giacinto
0965c6cd68 feat: track internally started models by ID (#3693)
* chore(refactor): track internally started models by ID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Just extend options, no need to copy

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Improve debugging for rerankers failures

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Simplify model loading with rerankers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Be more consistent when generating model options

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Uncommitted code

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Make deleteProcess more idiomatic

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Adapt CLI for sound generation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixup threads definition

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Handle corner case where c.Seed is nil

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Consistently use ModelOptions

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Adapt new code to refactoring

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Dave <dave@gray101.com>
2024-10-02 08:55:58 +02:00
Sertaç Özercan
ee21b00a8d feat: auto load into memory on startup (#3627)
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
2024-09-22 10:03:30 +02:00
Ettore Di Giacinto
35561edb6e feat(llama.cpp): support embeddings endpoints (#2871)
* feat(llama.cpp): add embeddings

Also enable embeddings by default for llama.cpp models

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(Makefile): prepare llama.cpp sources only once

Otherwise we keep cloning llama.cpp for each of the variants

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not set embeddings to false

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add embeddings to the YAML config reference

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-07-15 22:54:16 +02:00
Ettore Di Giacinto
a8bfb6f9c2 feat(options): add repeat_last_n (#2660)
feat(options): add repeat_last_n

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-06-26 14:58:50 +02:00
Sertaç Özercan
5866fc8ded chore: fix go.mod module (#2635)
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
2024-06-23 08:24:36 +00:00
Ettore Di Giacinto
e49ea0123b feat(llama.cpp): add flash_attention and no_kv_offloading (#2310)
feat(llama.cpp): add flash_attn and no_kv_offload

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-05-13 19:07:51 +02:00
Dave
2cd4936c99 fix: security scanner warning noise: error handlers part 1 (#2141)
first group of error handlers to reduce security scanner warning noise level

Signed-off-by: Dave Lee <dave@gray101.com>
2024-04-26 10:34:31 +02:00
Dave
c8dd8e5ef4 fix: reduce chmod permissions for created files and directories (#2137)
quiet more security scanner issues: pass one of chmod restriction to remove group and other permissions

Signed-off-by: Dave Lee <dave@gray101.com>
2024-04-26 00:47:06 +02:00
Taikono-Himazin
03adc1f60d Add tensor_parallel_size setting to vllm setting items (#2085)
Signed-off-by: Taikono-Himazin <kazu@po.harenet.ne.jp>
2024-04-20 14:37:02 +00:00
Ettore Di Giacinto
af9e5a2d05 Revert #1963 (#2056)
* Revert "fix(fncall): fix regression introduced in #1963 (#2048)"

This reverts commit 6b06d4e0af.

* Revert "fix: action-tmate back to upstream, dead code removal (#2038)"

This reverts commit fdec8a9d00.

* Revert "feat(grpc): return consumed token count and update response accordingly (#2035)"

This reverts commit e843d7df0e.

* Revert "refactor: backend/service split, channel-based llm flow (#1963)"

This reverts commit eed5706994.

* feat(grpc): return consumed token count and update response accordingly

Fixes: #1920

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-04-17 23:33:49 +02:00
Dave
eed5706994 refactor: backend/service split, channel-based llm flow (#1963)
Refactor: channel based llm flow and services split

---------

Signed-off-by: Dave Lee <dave@gray101.com>
2024-04-13 09:45:34 +02:00
Ettore Di Giacinto
8342553214 fix(llama.cpp): set better defaults for llama.cpp (#1961)
fix(defaults): set better defaults for llama.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-04-06 22:56:45 +02:00
Ettore Di Giacinto
ff77d3bc22 fix(seed): generate random seed per-request if -1 is set (#1952)
* fix(seed): generate random seed per-request if -1 is set

Also update ci with new workflows and allow the aio tests to run with an
api key

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(openvino): Add OpenVINO example

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-04-03 22:25:47 +02:00
Ettore Di Giacinto
f895d06605 fix(config): set better defaults for inferencing (#1822)
* fix(defaults): set better defaults for inferencing

This changeset aim to have better defaults and to properly detect when
no inference settings are provided with the model.

If not specified, we defaults to mirostat sampling, and offload all the
GPU layers (if a GPU is detected).

Related to https://github.com/mudler/LocalAI/issues/1373 and https://github.com/mudler/LocalAI/issues/1723

* Adapt tests

* Also pre-initialize default seed
2024-03-13 10:05:30 +01:00
Ettore Di Giacinto
5d1018495f feat(intel): add diffusers/transformers support (#1746)
* feat(intel): add diffusers support

* try to consume upstream container image

* Debug

* Manually install deps

* Map transformers/hf cache dir to modelpath if not specified

* fix(compel): update initialization, pass by all gRPC options

* fix: add dependencies, implement transformers for xpu

* base it from the oneapi image

* Add pillow

* set threads if specified when launching the API

* Skip conda install if intel

* defaults to non-intel

* ci: add to pipelines

* prepare compel only if enabled

* Skip conda install if intel

* fix cleanup

* Disable compel by default

* Install torch 2.1.0 with Intel

* Skip conda on some setups

* Detect python

* Quiet output

* Do not override system python with conda

* Prefer python3

* Fixups

* exllama2: do not install without conda (overrides pytorch version)

* exllama/exllama2: do not install if not using cuda

* Add missing dataset dependency

* Small fixups, symlink to python, add requirements

* Add neural_speed to the deps

* correctly handle model offloading

* fix: device_map == xpu

* go back at calling python, fixed at dockerfile level

* Exllama2 restricted to only nvidia gpus

* Tokenizer to xpu
2024-03-07 14:37:45 +01:00
Ludovic Leroux
939411300a Bump vLLM version + more options when loading models in vLLM (#1782)
* Bump vLLM version to 0.3.2

* Add vLLM model loading options

* Remove transformers-exllama

* Fix install exllama
2024-03-01 22:48:53 +01:00