* feat(radixtree): generic prefix tree skeleton with longest-match
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): Insert with path recency refresh and entry cap
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): recency-weighted per-value Weight
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): Remove all entries for a value
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(radixtree): race-free concurrency smoke test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard
Address review findings on the generic prefix tree:
- Extract a shared pruneWalk helper parameterized by a shouldClear
predicate and use it from Evict, Remove, and the MaxEntries path.
Previously evictOldestLocked cleared a victim's value but never
removed the now value-less node or its childless ancestors, so
internal nodes accumulated under sustained churn at the cap. The
MaxEntries path now prunes the victim and its empty ancestors.
- DRY: pruneWalk replaces the duplicated logic in the former
pruneLocked and Remove's inner closure.
- Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take
the read lock (RLock) while Insert, Evict and Remove keep the write
lock. Confirmed race-clean under go test -race.
- Document the strict greater-than TTL boundary on Options.TTL and
expired: age exactly equal to TTL is still live.
- Guard Insert against an empty key (no-op): the root never holds a
value.
Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation,
the no-growth-past-cap invariant, the TTL boundary, and empty-key
behavior for both Insert and LongestMatch.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): RoutePolicy enum with parse/resolve
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): Config with defaults and validation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): deterministic xxhash prefix-chain extractor
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): pure filter-then-score replica selection
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): Provider interface and radix-tree-backed Index
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(prefixcache): gofmt policy enum comment alignment
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): head-first prefix chunking and hoist Weight out of sort
Address code-quality review findings in the prefixcache package.
Correctness: ExtractChain now chunks from absolute offset 0 with fixed
[0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth
head blocks. The previous tail-keeping logic shifted the byte offset by a
non-window amount once a conversation grew past MaxDepth*WindowBytes,
changing every hash each turn and silently breaking cross-turn
longest-prefix matching. The reusable KV/prefix cache lives at the head
of the prompt, so anchoring at offset 0 makes the chain a true
prefix-chain: P and P+suffix share their full leading overlap. Add a
regression spec proving cross-turn stability past the cap.
Performance: Index.Decide precomputes each candidate's Weight once
(decorate-sort-undecorate) instead of calling the O(tree size) Weight
inside the O(n log n) sort comparator. Behavior is unchanged.
Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual
byte loop, clearing the modernize rangeint finding.
Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's
documented concurrency safety under go test -race.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(messaging): prefixcache observe/invalidate subjects and payloads
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): NATS sync publish/apply for observe and invalidate
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributedhdr): ctx carrier for prefix-hash chain
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributedhdr): PrefixChainHook indirection for backend-side chain build
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backend): stash prompt prefix chain on ctx before distributed routing
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(backend): mirror modelID fallback for prefix-chain salt parity
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): scheduling config columns for prefix-cache routing
Add RoutePolicy and per-model balance/prefix-match override columns to
ModelSchedulingConfig and include them in the SetModelScheduling upsert
DoUpdates list so updates are not dropped on conflict.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): optional route preference in FindAndLockNodeWithModel
Add a RoutePreference type and a new pref parameter so the atomic
pick+lock+increment can be biased toward a preferred node without
weakening atomicity. A nil preference reproduces the previous ORDER BY
behavior exactly. Update the ModelRouter interface, both router.go call
sites (pass nil for now; Phase 5 builds the real preference), the test
doubles, and the distributed e2e caller.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): make Sync satisfy Provider with Evict
Sync.Observe now returns whether the local index treated the assignment as
new or extended, and Sync gains an Evict method that delegates to the wrapped
index. Together these let SmartRouter hold a single prefixcache.Provider that
broadcasts via NATS. Adds a compile-time Provider assertion and an
Evict-delegates behavioral test.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route
Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil
keeps routing byte-for-byte the round-robin floor). On each request Route now
calls buildPreference: it reads the prompt prefix chain from ctx
(distributedhdr.PrefixChain), resolves the per-model policy/thresholds over
the global config, loads candidate replica in-flight via a new registry read
LoadedReplicaStats (deduped to one entry per node using the MIN in-flight
across that node's replicas), asks the provider to Decide, and runs
prefixcache.Select. The chosen node is passed as the RoutePreference to
FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick,
cold scheduleAndLoad), and the served node is recorded via Observe only when
the resolved policy is prefix_cache so round-robin models never pollute the
tree.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): invalidate prefix-cache entries on unload and stale removal
UnloadModel and both staleness fall-through paths in Route (after a failed
gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model,
nodeID), guarded by a nil-provider check so the round-robin floor is
unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations
also broadcast to peer frontends. Adds a test that a previously hot prefix no
longer Decides to a node after UnloadModel.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): rolling forced-disturb pressure counter
Add a concurrency-safe per-model rolling counter that tracks how many
times a request had a usable hot prefix match but the load guard forced
it off the warm node. Entries outside the window are dropped lazily on
Count so the backing slice stays bounded.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): autoscale on prefix-cache forced-disturb pressure
Wire the rolling forced-disturb counter into the SmartRouter and the
ReplicaReconciler.
Router: in buildPreference, after Decide + Select, record a forced-disturb
when a usable hot prefix match existed (d.HotNodeID != "" and
d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or
nothing) because the load guard ruled the warm node out. This is the
scale-worthy signal: the cache-warm replica is saturated. It deliberately
does not fire for all-unique workloads (no hot match), avoiding
false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil
keeps the path a no-op.
Reconciler: read the same Pressure instance in reconcileModel as an extra
scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel
guards and the UnsatisfiableUntil cooldown that gates the whole method.
Pressure never overrides MaxReplicas and never force-evicts; a no-capacity
model does not spin. Window and threshold come from prefixcache.Config
(PressureWindow default 1m, PressureScaleThreshold default 1) and are
configurable via ReplicaReconcilerOptions.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow
Record now prunes entries older than the rolling window (the same prune
Count does), via a shared pruneLocked helper, so a model that takes
forced-disturb records but is never Counted (e.g. one with zero loaded
replicas the reconciler skips) no longer grows its backing slice
unbounded.
Also removes the dead pressureWindow struct field and the
ReplicaReconcilerOptions.PressureWindow option from the reconciler: they
were stored but never read (the window lives inside the *prefixcache.Pressure
instance). The scale block now reads pressure.Count once into a local.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(api): prefix-cache fields in scheduling endpoint DTO with validation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): prefix-cache routing controls in node scheduling form
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): wire prefix-cache index, NATS sync, and config
Activates prefix-cache-aware routing in distributed mode. Builds the
prefixcache Index + NATS-backed Sync + Pressure counter, installs the
distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix
chain per request, subscribes to prefixcache.observe/prefixcache.invalidate
to apply peers' events to the local index (no re-broadcast), threads
PrefixProvider/PrefixConfig/Pressure into the SmartRouter and
Pressure/PressureThreshold into the ReplicaReconciler, and runs a
background eviction ticker (every TTL/2) bound to the app context.
Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE)
opts out and leaves the provider/pressure nil so routing stays round-robin.
--distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m)
controls entry idle-timeout and eviction cadence.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(nodes): round-robin-floor invariant for prefix-cache routing
Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never
picked even with a perfect prefix match (round-robin floor holds), while a
balanced hot node within the load slack is reused.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(prefixcache): clear branch lint findings and em dashes
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): validate prefix-cache config at startup wiring
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* perf(radixtree): single-walk WeightsFor for batch value weights
Add Tree.WeightsFor(values, now) which computes the recency-weighted
weight for many values in a single O(N + len(values)) tree traversal,
versus calling Weight once per value (O(len(values) * N)). Consumers
that score K candidates against the tree under the read lock no longer
pay K full walks.
Extract the per-entry contribution math into an unexported helper shared
by both Weight and WeightsFor so the metric stays identical (DRY).
Weight's public behavior is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(config): add ModelConfig.ModelID() single source of truth
The c.Name fallback to c.Model was duplicated in core/backend/options.go
(feeding model.WithModelID) and hand-copied into core/backend/llm.go (the
prefix-chain salt). These MUST agree or the prefix-cache salt diverges
silently from the id the model loader tracks. Consolidate both into a new
config.ModelConfig.ModelID() helper and call it from both sites. Behavior
is identical.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* perf(prefixcache): reuse one xxhash.Digest in ExtractChain
ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth
per call) and grew the chain slice without preallocation. Reuse a single
Digest via Reset() before each block and preallocate the chain to
min(nBlocks, MaxDepth).
xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical
value to a fresh New()+Write. Output hashes are unchanged, preserving the
cross-process determinism that peers rely on over NATS. Verified by capturing
ExtractChain output for the existing test inputs before and after the
refactor: identical. Existing extractor tests pass unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk
Index.Decide called radixtree.LongestMatch over the whole tree, so the
deepest match could be a node that is offline, unloaded, or simply not in
the passed candidate set. Honoring that as HotNodeID produced a false
forced-disturb signal upstream (buildPreference records pressure when
chosen != HotNodeID), making it look like a warm replica was load
saturated when it was actually absent.
Build the candidate set once and only set HotNodeID/MatchRatio when the
matched node is an actual candidate; otherwise fall back to cold
placement. A future refinement could ask the tree for the longest match
restricted to the candidate nodes (shallower-but-valid) instead of
dropping it.
Also replace the per-candidate tree.Weight call in the cold-order sort
with a single tree.WeightsFor walk, turning O(K*N) under the read lock
into O(N + K).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(prefixcache): remove Select's unreachable deterministic fallback
buildPreference always passes ColdOrder as a permutation of the full
candidate set, so the cold-order loop hits every eligible candidate. The
trailing best/bestIF scan was dead. Replace it with a plain "return """
and document that ColdOrder is guaranteed to cover all candidates, so ""
means none were eligible.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(nodes): fetch model scheduling config once per Route
GetModelScheduling was read three times per request - in
resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling -
three DB round-trips for one row that is immutable for the life of the
request, and not a consistent snapshot. Fetch it once near the top of
Route and thread the *ModelSchedulingConfig (may be nil) into all three
helpers. scheduleNewModel keeps its own fetch since it runs outside the
Route snapshot. Behavior is identical for nil sched.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(autoscale): add Pressure.Reset to consume forced-disturb signal
Pressure.Count is non-draining (it prunes only by age), so a single burst
of forced-disturbs stays within the rolling window for the whole window and
keeps Count >= threshold on every reconciler tick. The reconciler will use
Reset to clear a model's events after acting on the signal so a fresh
scale-up requires fresh forced-disturbs to accumulate, rather than one burst
driving the model toward MaxReplicas.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(autoscale): at most one scale-up per reconcile tick, consume pressure
Two autoscale bugs:
1. Over-scaling: the pressure scale-up block read Pressure.Count but never
consumed it. With a non-draining counter a single forced-disturb burst
kept Count >= threshold across the whole window, firing scaleUp on every
tick and pushing the model toward MaxReplicas off one transient burst.
After a successful pressure-triggered scale-up the reconciler now calls
Pressure.Reset to consume the signal.
2. Double scale-up in one tick: the all-replicas-busy block and the pressure
block could both fire in the same reconcileModel pass, each calling
scaleUp(+1) against the same `current` read once at the top, so a model
that was both busy and over threshold scaled +2 and could overshoot
MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1)
per tick: the pressure block is skipped if the busy block already scaled,
and scale-down is skipped in any tick that scaled up.
MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation
Add SetReplicaRemovedHook to NodeRegistry and fire it from both
RemoveNodeModel and RemoveAllNodeModelReplicas after a successful
delete. This is the single chokepoint every replica-removal path funnels
through (router eviction, reconciler scale-down, probe reaper,
health-monitor node-down reap, RemoteUnloaderAdapter), so the
prefix-cache index can be invalidated by construction rather than wiring
each call site individually.
The hook is stored in an atomic.Pointer so the startup wiring (setter)
and the request/reconcile-time fire are race-free; it is nil-safe when
unset. GORM Delete reports no error for a no-op delete, so the hook also
fires when nothing was removed; the consumer's Invalidate(model, node)
is idempotent so this is harmless.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): invalidate prefix-cache on any replica removal via registry hook
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(prefixcache): single source of truth for threshold bounds
Extract ValidateThresholds into prefixcache/config.go so the per-model
override validation (nodes.go endpoint) and Config.Validate share one
implementation of the numeric bounds (min_prefix_match in [0,1],
balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead
of hard-coding them in two places. The route_policy allow-list stays
explicit (not ParsePolicy, which maps typos to Default).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): preserve prefix-cache settings on partial scheduling update
A scheduling POST that omitted route_policy/thresholds (e.g. a
min_replicas-only update) full-replaced every column and silently reset
the model's previously-configured prefix-cache settings to empty/zero.
Make the four prefix-cache request fields pointers so omitted is
distinguishable from explicit zero, and merge PATCH-style in
SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves
the existing config value (zero default when none). Non-prefix fields
keep their full-replace PUT semantics. Validation now runs on the
resolved values via prefixcache.ValidateThresholds.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts
A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica
removal of every model, including round-robin models that never used the
prefix cache. Index.Invalidate previously called tree(model), which lazily
created and permanently retained an empty radix tree for any model that ever
lost a replica, growing the trees map without bound. Sync.Invalidate also
published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op
removals across the cluster.
Index.Invalidate now looks the tree up read-only via existingTree and returns
without allocating when none exists. The Provider interface is unchanged;
Sync gates the broadcast through an optional invalidateExisting(bool) capability
type-asserted from the wrapped Index, falling back to the prior always-broadcast
behavior for other Provider implementations.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort
WeightsFor already returns a map keyed by every requested candidate, so the
separate candidates set built to validate the hot match was redundant: a node
is a candidate iff it is a key in the weights map. Drop the extra map and gate
the hot-match check on weights membership. Also skip the sort when there is at
most one candidate, since the input order is already the cold order. Behavior
is unchanged.
Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins
would need lazy cross-file changes and is out of scope here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns
Bulk node-scoped node_models deletes (Register re-register cleanup,
MarkOffline, MarkDraining, Deregister) removed rows directly without
firing the replica-removed hook, so the prefix-cache index kept
pointing at nodes whose models were gone. Capture the DISTINCT model
names before each bulk delete and fire fireReplicaRemoved once per
model after a successful delete, restoring the single-chokepoint
invariant for all removal paths. The pre-query is skipped when no hook
is set so the no-hook path stays cheap.
Also narrow LoadedReplicaStats to SELECT only node_id and in_flight
(the only fields the router consumer reads), dropping the JOIN-side
available_vram fetch and unused columns while keeping the
[]ReplicaCandidate return type unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(reconciler): consume autoscale signals only on a real scale-up
scaleUp was fire-and-forget (void) yet its callers unconditionally
consumed the pressure signal (Pressure.Reset) and the MinReplicas
hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp
added nothing (ScheduleAndLoadModel errored, or no node could be
loaded) the saturated warm replica got no new replica AND its
accumulated forced-disturb history was wiped, forcing the signal to
re-accumulate over a full PressureWindow before the next attempt.
Make scaleUp return whether at least one replica was actually
scheduled, and gate the side effects on it:
- pressure block (2b): set scaledUp and call Pressure.Reset only on
success; on failure preserve the signal so the next tick retries off
the same accumulated pressure.
- busy-burst block (2): set scaledUp from the return value so a failed
attempt does not suppress the pressure path or scale-down.
- MinReplicas block: call ClearUnsatisfiable only on success so a
failed attempt does not reset the unsatisfiable counter.
All existing invariants (MaxReplicas, capacity gating,
UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are
preserved.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(nodes): drop router's redundant prefix-cache Invalidate calls
The NodeRegistry removal chokepoint (RemoveNodeModel /
RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which
invalidates the prefix-cache index. The router was also calling
prefixProvider.Invalidate explicitly right after each registry removal
on the two stale-replica health-probe fall-throughs in Route and in
UnloadModel, so every router-side eviction invalidated twice (double
tree-prune + double NATS broadcast).
Remove the three redundant explicit Invalidate calls and their empty
nil-guards. Each removed call sat immediately after a registry removal
that fires the hook, so invalidation is preserved via the chokepoint.
Decide/Observe usage is untouched.
Re-point the unit test (fake registry fires no hook) to assert the
removal chokepoint is exercised on unload instead of the router's
direct invalidation.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): make capture+delete atomic in bulk node_models removal paths
MarkOffline, MarkDraining, and the Register re-register cleanup ran the
nodeModelNames SELECT and the bulk node_models DELETE as two separate
statements on r.db with no transaction. A SetNodeModel landing between
the two was deleted but its replica-removed hook never fired, leaving
the prefix-cache index pointing at a removed replica until TTL or
candidacy self-heal.
Wrap the capture and the delete in a single db.Transaction in each path
(mirroring how Deregister already does it). The captured model names are
collected into a slice declared outside the closure; the
replica-removed hook fires for each only after the transaction commits,
so a rollback never invalidates the index for a removal that did not
persist. The set of fired hooks now equals exactly the set of
node_models rows actually deleted, with no interleaving gap.
The status flip in MarkOffline/MarkDraining (setStatus) is a separate,
pre-existing operation and routing already filters non-healthy nodes, so
it stays outside the transaction; return contracts are unchanged.
Deregister was already correct and is untouched. The cheap-path skip
(no hook -> skip the SELECT) is preserved.
Adds a spec asserting MarkOffline fires hooks for exactly the rows it
deletes and leaves no node_models row behind (consistent snapshot).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(nodes): debug logging for prefix-cache routing decisions and observations
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(radixtree): match shared prefixes by valuing every node on insert
Insert recorded the value (node id) only on the final node of the key
chain, leaving every intermediate prefix node valueless. LongestMatch
returns the deepest node that hasValue, so two chains that share a
leading block but diverge in the tail never matched: only exact-repeat
queries hit. That broke the prefix-cache routing core use cases (shared
system prompt, multi-turn extension, volatile tail), all of which rely
on prefix matching rather than exact-repeat.
Set value/hasValue/lastSeen at every node along the chain so each
prefix-block node remembers the node id that served that prefix
(SGLang/vLLM-style). The deepest match wins, and the last writer owns a
shared prefix node (a recency heuristic: the most recent chain through a
block is the one most likely still warm). size now counts valued nodes,
which is the intended meaning.
Updated radixtree tests to the new semantics: deepest-prefix test uses
non-overlapping chains, a new test asserts last-writer-owns-shared-node,
Evict/Remove/MaxEntries expectations recomputed for per-prefix-node
counting, and a shared-prefix LongestMatch red test added. Added a
prefixcache Decide test proving a prefix-only query routes to the warm
node. No prefixcache .go logic changed.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): lock in prefix-cache routing behavior end to end
Add a DB-backed e2e spec that drives SmartRouter against a real
NodeRegistry (Postgres testcontainer) and the real prefixcache.Index
radix-tree provider, using a fake gRPC backend factory so no real
inference runs. Covers the five behaviors validated by hand:
1. Cold miss + observe: an unseen prefix chain cold-places and is recorded.
2. Hot-match affinity: the same chain returns to its warm node X.
3. Shared-prefix match: a divergent chain sharing X's leading prefix
still routes to X (the radix-tree regression we fixed).
4. Negative control: an unrelated chain is a cold miss, not a false
hot match on X.
5. Failover + invalidation: removing X's replica fires the registry
chokepoint hook to invalidate the prefix entry, and the chain fails
over to surviving node Y and re-homes there.
Replaces the need for manual docker-compose re-runs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(prefixcache): make prefix-cache affinity replica-granular
Track prefix-cache affinity per loaded replica (a backend process with its
own KV cache) instead of per node, so multiple replicas of the same model on
one node each keep distinct affinity and a hot prefix routes back to the exact
replica that served it.
- radixtree: add RemoveFunc(pred) and reimplement Remove on top of it.
- prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/
PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to
drop every replica of a node; Invalidate drops one replica. Select returns
(ReplicaKey, bool) and gains a deterministic least-in-flight eligible
fallback (tiebreak NodeID then Replica).
- messaging: carry Replica on PrefixCacheObserveEvent and
PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node).
- Sync delegates + broadcasts with replica; InvalidateNode broadcasts
Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode.
This is part 1 of 2; the registry/router/wiring consumers are updated
separately.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): make prefix-cache routing replica-granular
Wire the SmartRouter, NodeRegistry, and distributed startup to the
replica-keyed prefixcache API. Affinity is now tracked per replica
(each replica is a separate process with its own KV cache), so a prefix
served by (node,0) no longer leaks onto the same-node sibling (node,1).
- RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks
the EXACT (node_id, replica_index) row, falling through to the default
ORDER BY when that replica is not loaded.
- SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires
the specific replica, RemoveAllNodeModelReplicas and the four bulk
node-scoped deletes fire replica<0 (all replicas of the node).
- buildPreference builds one Candidate per loaded replica and locks the
exact replica the policy chose; observePrefix records the served
ReplicaKey at every call site.
- distributed.go routes the hook to InvalidateNode (replica<0) or
Invalidate(key).
- Tests updated to the replica-keyed API plus new coverage: a hot prefix
on (node,0) prefers replica 0 over the same-node sibling (router unit +
e2e), and FindAndLock locks the exact preferred replica.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): derive prefix chain from messages for tokenizer-template models
Prefix-cache-aware routing built its prompt-prefix chain from the rendered
prompt string `s` in ModelInference. For models with
TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the
backend tokenizes the structured messages itself - so `s` is empty, the chain
is empty, and routing silently falls back to round-robin. That covers the bulk
of modern chat models (qwen3, llama3, ...), so the feature effectively never
engaged for them.
Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable
head-first serialization of the conversation (role + content per turn). Two
requests sharing a leading system prompt and early turns share a leading byte
prefix, which ExtractChain maps to a shared chain prefix - landing both on the
same cache-warm replica. The rendered `s` is still preferred when present
(higher fidelity for non-template models).
Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision"
logs despite per-request Route calls, traced to the empty-chain guard.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): document prefix-cache routing roadmap
Add a routing-and-caching roadmap section to the distributed-mode guide,
linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced
from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and
NVIDIA Dynamo.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add per-request node ID context holder
Introduce pkg/distributedhdr, a leaf package carrying a per-request
*atomic.Value holder for the picked worker node ID from the
SmartRouter (core/services/nodes) up to the HTTP response writer
wrapper (core/http/middleware). Avoids the import cycle that a shared
key in either consumer would create.
Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The
holder is atomic.Value so cross-goroutine publish from the router to
the response writer wrapper is race-clean.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add ExposeNodeHeader middleware + response writer wrapper
New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI
flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID
reveals internal topology and is opt-in).
The middleware creates a per-request *atomic.Value holder, attaches it
to c.Request().Context() via distributedhdr.WithHolder, and wraps
c.Response().Writer with a custom http.ResponseWriter that sets the
X-LocalAI-Node header on first Write / WriteHeader / Flush by reading
the holder. Implements http.Flusher, http.Hijacker, Unwrap so it
composes cleanly with Echo and http.NewResponseController.
request.go propagates the holder onto derived contexts via
distributedhdr.Inherit so the holder survives the correlation-ID
context replacement.
Unit + race-clean concurrency + integration specs.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): stamp node ID in router and wire middleware to inference routes
ModelRouterAdapter.Route stamps the picked node ID into the
per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right
after replica selection.
Wire ExposeNodeHeader middleware to:
- OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting
- Anthropic /v1/messages
- Ollama /api/chat, /api/generate, /api/embed, /api/embeddings
- Jina /v1/rerank
- LocalAI /v1/vad
The middleware's wrapper reads the holder on first byte and sets the
X-LocalAI-Node response header before delegating to the underlying
writer. Per-request scope means no race under concurrent multi-replica
routing.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): thread request context through backend Load + cover ctx propagation
Five non-OpenAI backend helpers were silently using app.Context instead
of the request context for the gRPC backend call: transcription, TTS,
image generation, rerank, VAD. Effect: distributedhdr.Stamp in the
router callback was a silent no-op for these paths, AND client
cancellation didn't propagate to in-flight inference.
Thread c.Request().Context() (or the equivalent input.Context after
the request middleware has installed the correlation-ID derived
context) through each helper and into ModelOptions via
model.WithContext(ctx). ImageGeneration's signature gains a leading
ctx parameter; in-tree callers (openai image, openai inpainting,
openai inpainting_test) are updated to match.
ModelEmbedding gains a leading ctx parameter for the same reason; the
openai and ollama embedding handlers pass the request context through.
chat_stream_workers.go defers the initial role=assistant chunk
emission until the first token callback so the wrapper's lazy
X-LocalAI-Node lookup against the loader runs AFTER ml.Load has
stamped the per-modelID node ID; semantically identical for clients
(role still arrives before any text).
Regression test core/backend/ctx_propagation_test.go pins ctx
propagation for all five helpers.
Docs updated to enumerate the full endpoint coverage of the
--expose-node-header flag.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(traces): cap backend trace Data field so the admin UI stays responsive
The previous fix (#9946) capped API trace bodies but missed backend traces,
which carry the same blast radius:
- LLM backend traces store the full chat messages JSON, full response, and
full streaming deltas. Every agent-pool reasoning step ships the full
RAG-augmented history (50-500 KiB per trace, often 100+ traces queued).
- TTS / audio_transform / transcript traces embed a 30s audio snippet as
base64, around 1.3 MiB per trace.
Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces
page then keeps re-downloading and re-parsing the buffer faster than the
5s auto-refresh and stays in the loading state forever, the same symptom
the API-side fix addressed.
Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES:
Option A (safety net in core/trace): RecordBackendTrace walks the Data map
recursively and replaces any string value larger than the cap with
"<truncated: N bytes>". Catches anything a future producer forgets.
Option B (head-preserving at the producer):
- core/backend/llm.go: TruncateToBytes on messages, response, and
chat_deltas content/reasoning_content so the leading content stays
readable in the UI.
- core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded
blob would exceed the cap (truncated base64 is undecodable). The
quality metrics still ship and the UI's WaveformPlayer simply skips
when the field is absent.
TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's
head-preserving output alone instead of replacing it with the bare marker.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
* fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels
The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES),
CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main
Settings page nor the inline Traces panel offered an input for it. Admins
hitting the "Traces UI stuck loading" symptom had to know to set an env var
or PUT raw JSON to /api/settings to dial the cap.
Add a "Max Body Bytes" row next to "Max Items" in both places. Same input
type, same disabled-when-tracing-off semantics, placeholder shows the 65536
default so users see what they're inheriting.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
* test(react-ui): disambiguate Max Items locator after adding Max Body Bytes
The Tracing settings panel now has two number inputs. The previous spec
matched 'input[type="number"]' which became ambiguous and triggered a
Playwright strict-mode violation in CI. Switch to getByPlaceholder('100')
for Max Items and add a parallel spec for the new Max Body Bytes field
using getByPlaceholder('65536').
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): verify backend OCI images with keyless cosign
Close a trust gap where a registry compromise or MITM could silently
replace a backend image: the gallery YAML tells LocalAI which image to
pull, but until now nothing verified the bytes came from our CI.
Consumer (pkg/oci/cosignverify):
- New package using sigstore-go to verify keyless-cosign signatures.
- OCI 1.1 referrers API + new bundle format (no legacy :tag.sig).
- Policy fields: Issuer / IssuerRegex / Identity / IdentityRegex /
NotBefore. NotBefore is the revocation lever — keyless Fulcio certs
are ephemeral so revocation is policy-side; advancing not_before in
the gallery YAML invalidates every signature predating the cutoff.
- TUF trusted root cached process-wide so N backends from one gallery
do 1 fetch, not N.
Plumbing:
- pkg/downloader: ImageVerifier interface + WithImageVerifier option
threaded through DownloadFileWithContext. Verification runs between
oci.GetImage and oci.ExtractOCIImage, with digest pinning via
pinnedImageRef to close the TOCTOU window. Skips the verifier's HEAD
when the ref is already digest-pinned.
- core/config: Gallery.Verification YAML block.
- core/gallery: backendDownloadOptions builds the verifier from the
policy; applied on initial URI, mirrors, and tag fallbacks.
- core/gallery/upgrade: the upgrade path now routes through the same
options builder. A regression Ginkgo spec pins this contract —
without it, UpgradeBackend silently bypassed verification.
- core/cli: --require-backend-integrity (LOCALAI_REQUIRE_BACKEND_INTEGRITY)
escalates missing policy / empty SHA256 from warn to hard-fail.
Producer (.github/workflows/backend_merge.yml):
- id-token: write at job scope (PR-fork-safe via existing event gate).
- sigstore/cosign-installer@v3 pinned to v2.4.1.
- After each docker buildx imagetools create, resolve the manifest
list digest and run cosign sign --recursive --new-bundle-format
--registry-referrers-mode=oci-1-1 against repo@digest. --recursive
signs the index and every per-arch entry, matching how the consumer
resolves a tag to a platform-specific manifest before verifying.
Rollout: backend/index.yaml has no `verification:` block yet, so this
PR is backward-compatible — installs proceed with a warning until the
gallery is populated. Strict mode is opt-in.
Assisted-by: claude-code:claude-opus-4-7 [Bash] [Edit] [Read] [Write] [WebSearch] [WebFetch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* refactor(gallery): plumb RequireBackendIntegrity through config instead of env
The previous implementation re-exported the --require-backend-integrity
CLI flag into LOCALAI_REQUIRE_BACKEND_INTEGRITY via os.Setenv, then
re-read it in core/gallery via os.Getenv. This leaked process state
into the gallery package and made the flag impossible to override
per-call or test without touching the env.
Add RequireBackendIntegrity to ApplicationConfig (with a matching
WithRequireBackendIntegrity AppOption) and thread the bool through
every install/upgrade path: InstallBackend, InstallBackendFromGallery,
UpgradeBackend, InstallModelFromGallery, InstallExternalBackend,
ApplyGalleryFromString/File, startup.InstallModels. Worker subcommands
gain the same env-bound flag on WorkerFlags so distributed-worker
installs honor it consistently with the worker daemon path.
Add a forbidigo lint rule against os.Getenv / os.LookupEnv / os.Environ
to keep the env-leak pattern from creeping back. Existing offenders
(p2p, config loaders, etc.) are baseline-grandfathered by the existing
new-from-merge-base: origin/master setting; targeted path exclusions
cover the legitimate cases — kong CLI entry points, backend
subprocesses, system capability probes, gRPC AUTH_TOKEN inheritance,
test gating env vars.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Upstream llama.cpp (PR #21962) switched the server-side mtmd media
marker to a random per-server string and removed the legacy
"<__media__>" backward-compat replacement in mtmd_tokenizer. The
Go layer still emitted the hardcoded "<__media__>", so on the
non-tokenizer-template path the prompt arrived with a marker mtmd
did not recognize and tokenization failed with "number of bitmaps
(1) does not match number of markers (0)".
Report the active media marker via ModelMetadataResponse.media_marker
and substitute the sentinel "<__media__>" with it right before the
gRPC call, after the backend has been loaded and probed. Also skip
the Go-side multimodal templating entirely when UseTokenizerTemplate
is true — llama.cpp's oaicompat_chat_params_parse already injects its
own marker and StringContent is unused in that path. Backends that do
not expose the field keep the legacy "<__media__>" behavior.
* feat: add distributed mode (experimental)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix data races, mutexes, transactions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix events and tool stream in agent chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* use ginkgo
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(cron): compute correctly time boundaries avoiding re-triggering
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not flood of healthy checks
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not list obvious backends as text backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* tests fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop redundant healthcheck
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): Switch to expandable box instead of pop-over and display model files
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(ui, backends): Add individual backend logging
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(ui): Set the context settings from the model config
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): WebRTC support
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(tracing): Show full LLM opts and deltas
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(functions): add peg-based parsing
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: support returning toolcalls directly from backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: do run PEG only if backend didn't send deltas
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(loader): refactor single active backend support to LRU
This changeset introduces LRU management of loaded backends. Users can
set now a maximum number of models to be loaded concurrently, and, when
setting LocalAI in single active backend mode we set LRU to 1 for
backward compatibility.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: add tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update docs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: add support to logprobs in results
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: add support to logitbias
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): allow to cancel ops
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Improve progress text
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Cancel queued ops, don't show up message cancellation always
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: fixup displaying of total progress over multiple files
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama.cpp): expose env vars as options for consistency
This allows to configure everything in the YAML file of the model rather
than have global configurations
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Detect template exists if use tokenizer template is enabled
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Better recognization of chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixes to support tool calls while using templates from tokenizer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop template guessing, fix passing tools to tokenizer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Extract grammar and other options from chat template, add schema struct
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Automatically set use_jinja
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Cleanups, identify by default gguf models for chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update docs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
- Add a system backend path
- Refactor and consolidate system information in system state
- Use system state in all the components to figure out the system paths
to used whenever needed
- Refactor BackendConfig -> ModelConfig. This was otherway misleading as
now we do have a backend configuration which is not the model config.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): Initial Realtime API implementation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: go mod tidy
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat: Implement transcription only mode for realtime API
Reduce the scope of the real time API for the initial realease and make
transcription only mode functional.
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(build): Build backends on a separate layer to speed up core only changes
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* Add machine tag option, add extraUsage option, grpc-server -> proto -> endpoint extraUsage data is broken for now
Signed-off-by: mintyleaf <mintyleafdev@gmail.com>
* remove redurant timing fields, fix not working timings output
Signed-off-by: mintyleaf <mintyleafdev@gmail.com>
* use middleware for Machine-Tag only if tag is specified
Signed-off-by: mintyleaf <mintyleafdev@gmail.com>
---------
Signed-off-by: mintyleaf <mintyleafdev@gmail.com>
* Use pb.Reply instead of []byte with Reply.GetMessage() in llama grpc to get the proper usage data in reply streaming mode at the last [DONE] frame
* Fix 'hang' on empty message from the start
Seems like that empty message marker trick was unnecessary
---------
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Use pb.Reply instead of []byte with Reply.GetMessage() in llama grpc to get the proper usage data in reply streaming mode at the last [DONE] frame
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
When offloading template construction to the backend, we want to keep
text around in case of multimodal requests.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(refactor): track internally started models by ID
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Just extend options, no need to copy
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Improve debugging for rerankers failures
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Simplify model loading with rerankers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Be more consistent when generating model options
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Uncommitted code
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Make deleteProcess more idiomatic
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Adapt CLI for sound generation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixup threads definition
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Handle corner case where c.Seed is nil
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Consistently use ModelOptions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Adapt new code to refactoring
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Dave <dave@gray101.com>
This prepares the API to receive videos as well for video understanding.
It works similarly to images, where the request should be in the form:
{
"type": "video_url",
"video_url": { "url": "url or base64 data" }
}
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: extract output with regexes from LLMs
This changset adds `extract_regex` to the LLM config. It is a list of
regexes that can match output and will be used to re extract text from
the LLM output. This is particularly useful for LLMs which outputs final
results into tags.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add tests, enhance output in case of configuration error
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* start by checking /scan during the checksum update
Signed-off-by: Dave Lee <dave@gray101.com>
* add back in golang side features: downloader/uri gets struct and scan function, gallery uses it, and secscan/models calls it.
Signed-off-by: Dave Lee <dave@gray101.com>
* add a param to scan specific urls - useful for debugging
Signed-off-by: Dave Lee <dave@gray101.com>
* helpful printouts
Signed-off-by: Dave Lee <dave@gray101.com>
* fix offsets
Signed-off-by: Dave Lee <dave@gray101.com>
* fix error and naming
Signed-off-by: Dave Lee <dave@gray101.com>
* expose error
Signed-off-by: Dave Lee <dave@gray101.com>
* fix json tags
Signed-off-by: Dave Lee <dave@gray101.com>
* slight wording change
Signed-off-by: Dave Lee <dave@gray101.com>
* go mod tidy - getting warnings
Signed-off-by: Dave Lee <dave@gray101.com>
* split out python to make editing easier, add some simple code to delete contaminated entries from gallery
Signed-off-by: Dave Lee <dave@gray101.com>
* o7 to my favorite part of our old name, go-skynet
Signed-off-by: Dave Lee <dave@gray101.com>
* merge fix
Signed-off-by: Dave Lee <dave@gray101.com>
* merge fix
Signed-off-by: Dave Lee <dave@gray101.com>
* merge fix
Signed-off-by: Dave Lee <dave@gray101.com>
* address review comments
Signed-off-by: Dave Lee <dave@gray101.com>
* forgot secscan could accept multiple URL at once
Signed-off-by: Dave Lee <dave@gray101.com>
* invert naming and actually use it
Signed-off-by: Dave Lee <dave@gray101.com>
* missed cli/models.go
Signed-off-by: Dave Lee <dave@gray101.com>
* Update .github/check_and_update.py
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Signed-off-by: Dave <dave@gray101.com>
---------
Signed-off-by: Dave Lee <dave@gray101.com>
Signed-off-by: Dave <dave@gray101.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* refactor(gallery): move under core/
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(unarchive): do not allow symlinks
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(defaults): set better defaults for inferencing
This changeset aim to have better defaults and to properly detect when
no inference settings are provided with the model.
If not specified, we defaults to mirostat sampling, and offload all the
GPU layers (if a GPU is detected).
Related to https://github.com/mudler/LocalAI/issues/1373 and https://github.com/mudler/LocalAI/issues/1723
* Adapt tests
* Also pre-initialize default seed