mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-19 06:09:07 -04:00
3bafc2e1ab7fd6df85ff9ed23384ee41b03935db
147 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3bafc2e1ab |
feat(ui): canvas fullscreen toggle + keyboard tab navigation
The canvas header gains a fullscreen toggle (expands the panel to cover the viewport; resize handle hidden while fullscreen). The artifact tab strip is now a proper ARIA tablist with roving tabindex and Left/Right arrow-key navigation. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8991b94a80 |
feat(ui): jump-to-latest pill when scrolled up in chat
When the user scrolls away from the bottom of a conversation, a floating "Jump to latest" pill appears (sticky, centered above the composer); clicking it smooth-scrolls to the newest message and re-pins auto-scroll. Resets on chat switch. Adds the chat.actions.jumpToLatest i18n key to all locales. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5510262aab |
feat(ui): image attachments show a thumbnail in the composer
Staged image attachments now preview as a 28px thumbnail (from their data URL) instead of a bare file icon; other types keep the icon. File names truncate and the remove button gets an aria-label. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
eb8b9b125d |
feat(ui): per-message code blocks get a language header + copy button
Chat code blocks now render inside a framed block with a header showing the language and a copy button (delegated handler, copies the block and flips to a check briefly). Decoration + highlighting run from a MutationObserver scoped to the messages container, which fires reliably for streamed responses AND for chats loaded/switched from storage - the prior render-keyed effect missed the load path (code was left unhighlighted on reload). The observer disconnects while mutating so it does not retrigger on its own edits. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a5a4701f05 |
feat(ui): canvas drag-to-resize + slide-in, fix hooks order, typed download
Canvas was a fixed pane; make it a workbench: - Drag the panel's left edge to resize (clamped 360px..75vw), persisted to localStorage, double-click to reset; hidden and full-width on narrow screens. - Slide-in/fade on open via canvasSlideIn (honored by reduced-motion). - Fix a rules-of-hooks bug: the `if (!current) return null` early return sat above useEffect, so the hook count changed when artifacts emptied. All hooks now run unconditionally before the guard. - Downloads use the artifact language's real extension + MIME (a Python artifact saves as .py, not .txt) via extensionForLanguage. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a753f82823 |
feat(ui): ThemeToggle i18n + animated icon, drop transition:all
The theme toggle hard-coded its English tooltip; route it through the existing nav switchToLightMode/switchToDarkMode keys and add an aria-label. The sun/moon icon now replays a small rotate+fade on theme change (keyed remount; honored by the global reduced-motion block). Replace the .theme-toggle `transition: all` with explicit properties. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
67e2f2caba |
test(ui): home greeting asserts sans font, not the dropped serif
The greeting render-smoke still asserted Fraunces; update it to assert the Geist sans display font (and not Fraunces), matching the all-sans direction. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
30245d7bec |
refactor(ui): adopt PageHeader on ops/admin/media pages (batch B)
Replace hand-rolled .page-header title blocks with the shared editorial PageHeader component across 14 pages (Manage, Middleware, Models, NodeBackendLogs, Nodes, P2P, SkillEdit, Skills, Sound, Traces, TTS, Usage, Users, VideoGen). Title/subtitle move into PageHeader; header-own action clusters (Models stats+buttons, Skills search+buttons) move into the actions slot. Tabs, filters, stat cards, ResourceMonitor and page body stay as siblings. Eyebrow is left to auto-derive from the route. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f1132f9143 |
refactor(ui): adopt PageHeader on agent/media/import/backend pages (batch A)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] |
||
|
|
a24cc84f7d |
feat(ui): editorial PageHeader with section eyebrow + scroll-to-top on nav
PageHeader now derives its eyebrow from the route's section/console (Build /
Operate / Create) via sectionKeyForPath, so pages get a consistent, meaningful
eyebrow with no per-page wiring (override with the eyebrow prop, suppress with
eyebrow={null}). Settings adopts it as the first consumer.
Also fix a navigation scroll bug: the default layout uses the document as its
scroll container and route changes did not reset it, so navigating the console
rail from a scrolled page landed mid-view. App now scrolls to top on pathname
change.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
7cca8352cb |
refactor(ui): declutter Home - discoverable + dismissable API, vertical balance
Home felt overloaded and top-heavy. Three changes from review: - The API endpoint catalog (12 endpoints) is collapsed by default behind a "Browse the API" disclosure; only the base URL + copy stay visible, so the catalog is discoverable without dominating the page. - The whole connect card is dismissable (x): dismissing unmounts it so the vertical space is recovered, and the choice is remembered (localStorage). - .home-page now fills its column and vertically centers its content when there is slack, so sparse states (no models / card dismissed) read as a balanced launcher instead of content jammed at the top. Overflow-safe: tall content flows from the top and scrolls. Adds connect.browse / connect.hide / connect.dismiss i18n keys to all locales. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e8cac3506b |
refactor(ui): all-sans editorial headings + tint-only active nav
Per design review, pivot the heading strategy from hybrid-serif to a refined grotesk: drop the Fraunces dependency, token, and import; page titles, the Home greeting, and section/empty-state titles now use Geist at semibold with the editorial fluid sizing and tight tracking. No serif anywhere. Active sidebar item is now a tint-only treatment (accent text + tinted background); the left accent rail is removed and the shared base .nav-item.active inset bar is suppressed in the sidebar (as the console rail already does). Update the design-system e2e specs to assert the sans display font and the tinted-background active state. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f9a130e446 |
fix(ui): single focus ring (no double-ring) + neutralize stagger delay under reduced motion
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] |
||
|
|
76f3e032ae |
feat(ui): Home loaded-models skeleton list, button hierarchy, EmptyState wizard
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c7d9dbda6b |
feat(ui): Home editorial header + status line (north-star redesign)
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
0838d7d3e6 |
feat(ui): StatusPill primitive + time-of-day greeting helper
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
abd8162aad |
feat(ui): PageHeader + SectionHeading editorial primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
b73b93850c |
feat(ui): Skeleton shimmer primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f31f311e38 |
feat(ui): EmptyState primitive with serif title
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
24623d52e8 |
refactor(ui): route boot fallback through the LoadingSpinner primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6ef88c78f2 |
refactor(ui): fix dead token refs + dedupe toggle to one primitive
Migrate all .toggle-slider consumers (Users, Chat, AgentChat) to the canonical BEM toggle primitive and delete the legacy duplicate CSS block. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
7d501f1ba1 |
feat(ui): orchestrated page reveal + stagger motion primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
53af1807b6 |
feat(ui): un-overload accent — nav rail, stronger focus ring, neutral hover
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
b5d8992531 |
feat(ui): serif display tier + section-heading typography scale
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
9110002592 |
feat(ui): add Fraunces variable serif + --font-serif token
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3fa7b2955c |
feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360)
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see backup/pii-ner-tier-engine-prerebase). Net change: - privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan). TokenClassify moves off the patched llama.cpp path onto this backend. - PII filter reworked to be NER-centric (encoder/NER detection tier scanning whole conversations as one document), with a recreated bounded restricted- regex secret-matching pattern detector tier alongside it (per-model pii_detection.builtins / .patterns + core/services/routing/piipattern). - Detection labelled by source (ner vs pattern); backend trace / confidence / debug observability; analyze/redact exposed as a synchronous API. - Instance-wide default detector policy + per-usecase default-on; request filtering extended to completions, embeddings, edits & Ollama. - React UI: NER-centric PII editor, detector-models table, pattern/builtins editor, middleware default-policy UI. - Gallery: privacy-filter-multilingual token-classify model + NER install filter; token_classify known_usecase; batch sized to context for NER models. privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13 meta + image entries with a capabilities map) matching its CI matrix jobs, and an /import-model auto-detect importer (PrivacyFilterImporter, narrow privacy-filter GGUF detection) replacing the prior pref-only registration. Reconciled against master's independent evolution: - Dropped master's PIIPatternOverrides feature (global-pattern runtime overrides + /api/pii/patterns API + runtime_settings.json persistence). The per-model NER + pattern-detector design supersedes it; it was built on the global redactor pattern set this branch replaced. - Reverted the llama.cpp Score carry-patch (0006-server-task-type-score): removed the patch and restored master's grpc-server.cpp Score RPC (direct llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's model_config validation forbidding score + chat/completion/embeddings on llama-cpp. token_classify is unaffected (it runs on the privacy-filter backend, not llama-cpp). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
88726f2da4 |
fix(react-ui): restore sidebar collapse in dev + stop Talk page auto-scroll (#10383)
The sidebar collapse toggle silently no-op'd in dev builds. toggleCollapse ran its side effects (localStorage write + sidebar-collapse dispatch) inside the setCollapsed updater. StrictMode double-invokes updaters in dev to surface impurity, and the synchronous dispatch re-entered setState from the App/Sidebar listeners mid-update, so the toggle never committed. Production builds don't double-invoke, which is why only the dev server was affected. Compute next from current state and move the persist + broadcast into the handler body so the updater is pure. Also fix the Talk page anchoring to the transcript box on load. The transcript is its own overflow container, but scrollIntoView bubbles to every scrollable ancestor including the window, yanking the whole page down on mount. Scroll the transcript container directly instead. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5ac864dbed |
feat(ui): console-based navigation + drop-in API endpoint section (#10377)
* feat(ui): restructure sidebar into Create/Recognition/Build tiers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): preserve exact sidebar gating for agent items and fine-tune/quantize Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * i18n(ui): add nav tier + console keys to all locales Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): add grouped admin console via pathless layout route Wrap the existing admin pages in a pathless AdminConsoleLayout route so they keep their exact flat URLs while gaining a grouped left rail (Inference / Cluster / Observability / Access / System). Rail item gating mirrors the sidebar (adminOnly / authOnly / feature + /api/features). The layout forwards the App-level outlet context (addToast) to the wrapped pages, which read it via useOutletContext(). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): fold Audio Transform into Studio as a tab Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(ui): update e2e specs for tiered nav + admin console Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): gate embedded Studio transform view on audio_transform feature Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): visual polish + console-ize Build/Recognition tiers Generalize the one-off admin console into a reusable ConsoleLayout driven by a shared consoleConfig (single source of truth for the rail, its gating, and the sidebar entry that opens it — removes the prior rail/sidebar drift). - Promote Install Models to the top menu next to Home. - Build and Operate are now console tiers (secondary rail); Create stays inline. - Fold Recognition (Faces/Voices) into the Build console as a group alongside Automation and Training so it no longer feels split off. - Style the console rail as a panel (header, grouped dividers, rounded active pills) with a hover nudge; sidebar items become inset rounded pills. The rail slide-in plays only when entering a console, not on item-to-item sub-nav (which remounts the layout), so switching no longer flashes the menu. All token-based (light + dark), respects reduced-motion. - Add a delayed RouteFallback loader so lazy routes no longer flash blank; scoped inside ConsoleLayout so the rail stays put while the body loads. - Update e2e specs for the new structure (.console-* classes, console entries). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): persist console layout across sub-nav + add drop-in endpoint section - Keep the page-transition key stable within a console (derived from the shared console config) so the ConsoleLayout and its rail persist across item-to-item navigation instead of remounting — fixes the submenu flash. Cache /api/features across mounts and play the rail entrance animation only when actually entering a console. - Add a "One endpoint, every API" section to Home: leads with LocalAI's own native API (images, video, realtime voice over WebRTC/WS, depth, object detection, rerank, audio/TTS, face & voice recognition) plus a Full API reference link, then the drop-in compatibility layer (OpenAI, Anthropic, Ollama, OpenAI Responses) with the live copyable base URL. All 7 locales. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): revert Middleware nav label rename (keep Middleware in all locales) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
9ba8521e7e |
feat(react-ui): localize models and fix 'Import' typo (#10341)
* feat(react-ui): localize SearchableSelect component Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> * feat(react-ui): localize ModelSelector component Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> * fix(react-ui): dynamically localize back navigation caption to match page title Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> * feat(react-ui): localize back navigation state on Models page Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> * feat(react-ui): localize ModelEditor page Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> * fix(react-ui): fix Indonesian typo 'Import' to 'Impor' in importModel locale Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> --------- Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
fdc352a618 |
fix(settings): start watchdog on cold-enable from the React UI (#9125) (#10287)
fix(watchdog): start the live watchdog on a cold enable from Settings (#9125) The React Settings "Enable Watchdog" master toggle only ever writes the idle/busy flags; watchdog_enabled is vestigial in that UI. The live start/stop decision in UpdateSettingsEndpoint keyed off the raw, stale watchdog_enabled request field, so a cold enable (idle/busy=true, watchdog_enabled=false) called StopWatchdog() and the watchdog stayed stopped until the next restart - at which point startup re-derived it from the idle flag. Net: enabling the watchdog appeared to do nothing. Derive the run-state from idle||busy as the single source of truth, mirroring the startup invariant: - ApplyRuntimeSettings now sets WatchDog = idle||busy whenever either field is present (so a full disable also brings it down), while an API client posting only watchdog_enabled keeps its explicit value. - Add ApplicationConfig.WatchdogShouldRun() mirroring startWatchdog's gating (idle/busy, LRU eviction, memory reclaimer); the /api/settings handler uses it to decide start vs stop. - Belt-and-suspenders: the Settings.jsx master toggle also writes watchdog_enabled = idle||busy. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
36b4a81d1e |
feat(i18n): add Korean (ko) translation (#10312)
Add a full Korean locale (core/http/react-ui/public/locales/ko/, 13 namespaces,
840 keys, full parity with en/) and register ko in SUPPORTED_LANGUAGES
(core/http/react-ui/src/i18n/index.js). All i18next {{interpolation}} and
_one/_other plural keys preserved; brand/model names kept untranslated.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: moduvoice <moduvoicr77@gmail.com>
|
||
|
|
7637f8cf1b |
feat(distributed): declarative per-model scheduling via env/args (#10308)
* feat(distributed): add SpreadAll column and authoritative scheduling seeding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): parse declarative model scheduling config (env/file) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): reconcile spread_all to one replica per matching node Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire LOCALAI_MODEL_SCHEDULING env/args and startup seeding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): expose spread_all on the scheduling API endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): add spread-to-all-nodes mode to the scheduling UI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_MODEL_SCHEDULING env/args Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): clarify replica modes and all-nodes spread in scheduling config Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e1556aa1dc |
fix(react-ui): make agent chat timestamps format-agnostic (#9867) (#10290)
fix(agents): make React agent chat timestamps format-agnostic The agent SSE bridge emits the json_message timestamp in three different encodings depending on deploy mode: an RFC3339 string (standalone agent pool), Unix milliseconds (local dispatcher), and Unix nanoseconds (the older NATS path). The React AgentChat handler passed data.timestamp straight through, so the standalone string and any numeric value outside the millisecond range rendered as "Invalid Timestamp" or a constant epoch-ish time. Add a small pure helper, normalizeTimestampMs, that accepts an RFC3339 string or a numeric epoch in s/ms/us/ns and returns JS milliseconds, falling back to Date.now() on null/empty/unparseable input. Use it in the json_message handler so the rendered time is correct regardless of which backend path produced it. Fixes #9867 Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
99c8205740 |
fix(react-ui): stop Talk pipeline overflow and center collapsed-rail avatar (#10305)
Two small visual fixes in the React UI:
- Talk page pipeline summary: the four-column grid used
`repeat(4, 1fr)`, which resolves to `minmax(auto, 1fr)` so each track
refuses to shrink below the min-content width of its `nowrap` model
name. Long names (e.g. a verbose GGUF LLM id) blew the grid out past
the container despite the per-cell ellipsis styling. Switching to
`minmax(0, 1fr)` lets the tracks shrink and the ellipsis take effect.
- Sidebar user avatar: the desktop collapsed look centers the avatar via
`.sidebar.collapsed .sidebar-user{-link}` rules, but the tablet
icon-rail (640-1023px) collapses visually through `.sidebar:not(.open)`
without necessarily carrying the `.collapsed` class, so the avatar kept
its left-aligned negative margins and looked misaligned. Mirror the
centering rules under `.sidebar:not(.open)`.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
51f4f67c47 |
fix(agents): emit chat event timestamps in milliseconds (#9867) (#10243)
Agent chat replies rendered a broken timestamp in the web UI
("Invalid Timestamp" / "12:00 AM", identical for every reply) because
the SSE timestamp unit was inconsistent across producers.
EventBridge.PublishEvent emitted Unix nanoseconds while the local
dispatcher (dispatcher.go) already emitted Unix milliseconds, and the
React UI fed the value straight into `new Date(ts)` after dividing by
1e6. Nanoseconds also overflow JS's safe-integer range (~1.7e18).
Standardize on Unix milliseconds: switch PublishEvent to UnixMilli and
drop the /1e6 conversion in AgentChat.jsx so both SSE paths agree and
match the React UI's expectation. Add a regression test asserting the
published timestamp is in milliseconds.
|
||
|
|
085fc53bbc |
fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104)
* fix(router): score classifier production-readiness Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(backend): auto-size batch to context for embedding and rerank models Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(gallery): raise arch-router scoring output cap via parallel:64 Scoring decodes the whole prompt+candidate in a single llama_decode and reads one logit row per candidate token. The vendored llama.cpp server caps causal output rows at n_parallel, so the default of 1 aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route labels. Set options: [parallel:64] on both arch-router quant entries to lift the cap; kv_unified (the grpc-server default) keeps the full context per sequence, so this does not split the KV cache. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
1cea96f09f |
feat(react-ui): add Indonesian language support (#10266)
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com> |
||
|
|
892fc49949 |
feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)
* feat(realtime): pipeline streaming + disable_thinking config
Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining
streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): streaming TTS/transcription methods on Model interface
Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): emitSpeech with flag-gated streaming TTS
emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.
Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): route response audio through emitSpeech (streaming TTS)
Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): streaming transcription text deltas
Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): pipeline disable_thinking maps to enable_thinking off
applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): speechStreamer for token-streamed LLM->TTS
emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): wire streamLLMResponse for token-streamed replies
triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(realtime): document pipeline streaming + disable_thinking
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): register pipeline streaming/thinking config fields
TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.
Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): always strip reasoning from spoken output
disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).
Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): clean TTS temp path before read (gosec G304)
emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.
Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): buffer whole message for TTS, drop sentence segmenter
Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.
Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): stream tool-call turns via tokenizer-template autoparser
Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.
- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
(grammar-based function calling still buffers, since its call is emitted as
JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
in the token callback, parses tool calls from the final ChatDeltas, and
creates the assistant content item lazily so a content-less tool turn emits
only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
calls, response.done, and server-side assistant-tool follow-ups identically.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): script-aware clause chunking + streamed-reply fixes
Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in
|
||
|
|
9323f4b5ca |
feat(llama-cpp): video input support (mtmd #24269) (#10216)
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): forward video input to mtmd (template + non-template paths) Wire request->videos() into grpc-server.cpp mirroring the existing image and audio handling: a video_data build + non-template files extraction, and input_video chat chunks on the tokenizer-template path. allow_video is auto-set at model load by the vendored upstream chat_params. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): add video attachment support to the chat UI Mirror the image/audio attachment path for video: emit video_url content parts, accept video/* in the picker, keep video files as base64, show a film icon badge, and render attached video inline with a <video> player. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(llama-cpp): patch mtmd video stdin double-close (heap crash) Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy() fclose()s the same pointer again -> heap corruption that aborts the backend on any base64 input_video request (the CLI --video file path is unaffected). Vendor a one-line fix (null sp->stdin_file after fclose) via prepare.sh's patches/ until upstream merges it. Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue'). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc handling"), which includes the same `sp->stdin_file = nullptr` guard our patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to that branch head and drop patches/0001 - it's now redundant. Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode and the model answers correctly (red clip -> "Red", blue -> "Blue"). NOTE: #24316 is not yet merged, so this pins to its branch-head commit (28ca1e60). Re-pin to the squash-merge commit on master once it lands, otherwise `git fetch` may lose the commit after the branch is deleted. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
92dea961c2 |
fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling' (#10214)
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus The processingBackends map (the UI 'reinstalling' spinner source) only cleared an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and Upgrade buttons never poll, so completed installs leaked into processingBackends forever and the backend card spun 'reinstalling' even though the install had finished. Evict terminal ops on the list read instead; DeleteUUID already broadcasts the eviction so peer replicas converge. Reproduced on a live 5-node distributed cluster: 5 backends sat in processingBackends with underlying jobs reporting completed:true,progress:100. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): clear pending backend ops behind offline/draining nodes ListDuePendingBackendOps filters status=healthy, so a backend op queued against a node that went offline (stale heartbeat) or draining (admin action) was never retried, aged out, or deleted - it leaked forever and kept the UI operation spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass: draining nodes are cleared immediately (model rows already purged), offline nodes once their heartbeat is older than a grace window (blip protection). Reproduced on a live cluster: orphaned llama-cpp install rows targeting an offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0 indefinitely. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): stream per-node progress during backend upgrade The install dispatch subscribed to a per-op progress subject and streamed per-node download ticks; the upgrade dispatch did a bare 15-minute blocking NATS round-trip with no subscription, so the UI showed progress:0 the whole time (the 'reinstalling but nothing happens' report on a slow node). Thread the op ID through BackendManager.UpgradeBackend -> the distributed manager -> the adapter, and have the adapter subscribe to the per-op progress subject before the request (extracted into a shared subscribeProgress helper reused by install/upgrade/force-fallback). The worker's upgradeBackend now creates the same DebouncedInstallProgressPublisher installBackend uses. An upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress rather than minting a new subject - no new NATS permission, no new rolling-update compat surface. Reconciler-driven retries pass empty opID/onProgress and stay on the silent path. Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow sat at progress:0 for 4+ minutes with no per-node feedback. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): persist cancellation + periodically reap orphaned ops Two distributed gaps surfaced when a replica was killed mid-upgrade on a live cluster, leaving the backend stuck 'processing' in the UI forever: 1. CancelOperation flipped the in-memory status to cancelled and broadcast a NATS event but never persisted the terminal status. On the next replica restart the still-active row re-hydrated straight back into processingBackends and the UI spun again. It now calls store.Cancel(id) so the cancel survives a restart. 2. CleanStale (which marks abandoned active ops failed) only ran once on startup, so an op orphaned AFTER startup - its owning replica's foreground handler goroutine gone - was never reaped until the next restart. Add GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale now returns the reaped count for observability). Neither is covered by the OpCache self-evict fix: an orphaned op never reaches Processed, so it would never self-evict. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(review): address self-review findings on the distributed install fixes Three findings from an adversarial review of this branch: 1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns the live internal map by reference, so deleting from it on the read path was an unsynchronized write to a map four HTTP handlers poll every ~1s -> a 'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build a fresh result map, and apply evictions via the locked DeleteUUID after the loop. Added a -race concurrency regression guard. 2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so the admin can read the error and click Dismiss). Eviction now fires only for terminal ops with Error == nil (success/cancelled); failures are retained. 3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node marked unhealthy on a NATS ErrNoResponders never transitions to offline (health.go skips re-marking it), so its pending ops leaked exactly like the offline case. Unhealthy is now reaped via the same stale-heartbeat grace path (a fresh-heartbeat node is recovering and keeps its op). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops Second review pass found two issues: 1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling soft-path op. That op is deliberately Processed=true with no error to show a yellow in-progress state when a worker timed out the NATS round-trip but is still installing in the background; the reconciler confirms the real outcome later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an install that may still fail. Eviction is now scoped to a clean success (progress 100 + 'completed', matching the job-poll's historical condition) or a cancellation - the soft-path (progress != 100) and failures are kept. 2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an 'Installing...' spinner, so a failed op (now intentionally kept in the list for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from the card spinner, mirroring Models.jsx (isInstalling already excludes op.error). The error + Dismiss still surface in the global OperationsBar. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): refresh Manage backends table when an operation settles The Manage backends table fetched installed backends only on mount/after delete and checked upgrades only on tab activation. After a reinstall/upgrade completed neither re-ran, so the installed-version cell and the 'update available' badge stayed stale until the user switched tabs - the op looked like it 'did nothing'. Watch the operations list (via useOperations) and re-fetch installed backends + available upgrades whenever the count settles, mirroring the operations.length watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades check into the same effect. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5a0013defe |
test(react-ui): add page render-smoke specs, reset the coverage gate (#10122)
The UI coverage gate was tightened to 0.1pp against a fast-local measurement (39.86% baseline); CI's slower runners measure ~0.9pp lower, so tests-ui-e2e failed there. UI e2e coverage is diffusely non-deterministic and tracks machine speed — a 0.1pp band can't hold across environments. Rather than loosen the gate, raise the floor under it: a render-smoke spec mounts each lazy page (navigate + assert the header renders), covering a dozen previously-untested pages and lifting coverage from ~39% to ~42.7% locally. Restore the tolerance to 0.8pp and set the baseline conservatively (40.0), below the slow-CI floor, so the ratchet holds without flapping. Document the coverage policy — install the git hooks and don't bypass them (no --no-verify, no hand-lowering the baseline or widening the tolerance); raise coverage by adding tests instead; set the UI baseline below the slow-CI floor — in AGENTS.md, CONTRIBUTING.md and .agents/building-and-testing.md. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
718223f33b |
feat(localvqe/audio): v1.3 release and add spectrograms to audio transform UI (#10113)
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2 (1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models allowlist (and the arch_version=3 loader) so both load without LOCALVQE_ALLOW_UNHASHED. Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m (SHA-256 verified against the downloaded weights) and update the audio-transform docs to make v1.3 the current default while noting the compact v1.1/v1.2 alternatives. Assisted-by: Claude:claude-opus-4-8 Claude-Code Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(flake): add ffmpeg-headless to the dev shell pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the pre-commit gate runs those tests via `make test-coverage`. Without ffmpeg in the dev shell the gate fails with "executable file not found in $PATH". The headless build provides the CLI without GUI/X deps. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(localvqe): parse WAV by walking RIFF sub-chunks Walk the RIFF chunk list instead of assuming the canonical 44-byte header layout. Real inputs (browser-recorded clips, ffmpeg output with an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata) would otherwise splice header/metadata bytes into the PCM stream as an audible impulse. Honour the `data` chunk size and validate that both `fmt ` and `data` chunks are present. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(security-headers): allow blob: in connect-src for waveform fetch The waveform renderer XHRs/fetches a freshly-created blob: object URL (e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch of blob: is governed by connect-src, not media-src, so it was blocked by the CSP. Add blob: to connect-src. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(react-ui): add input/output spectrogram view to AudioTransform The transform page only showed time-domain amplitude waveforms, so you could see how loud a clip was but not which frequencies the model touched. Add a time x frequency spectrogram heatmap and render the input and output spectrums side by side, so it's visible which bands the enhancement attenuates (bright input bands that go dark in the output). Computed client-side via a Hann-windowed STFT over both clips (a small dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame geometry. This shows the net input->output spectral change; the model's internal gain mask is not exposed by the backend. - src/utils/fft.js radix-2 FFT - src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid - src/components/audio/Spectrogram.jsx canvas heatmap (magma colormap) - AudioTransform.jsx dual-spectrogram panel + CSS - e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2) Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(react-ui): make UI coverage deterministic, tighten the gate UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let a real ~300-line regression through silently. The swing was a bug, not inherent jitter: the 'Create Agent navigates' spec ended on the URL assertion, so AgentCreate.jsx's ~400 lines were collected only when its render happened to beat the coverage teardown. Wait for the page to actually render (assert its heading) so those lines are covered every run. With the race gone, repeated runs land within ~0.013pp of each other, so: - tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band) - set the baseline to the real, reliably-achieved value (39.0 -> 39.86) Localised by running the V8-coverage suite repeatedly and diffing per-file line coverage; AgentCreate.jsx was the sole ~1pp flipper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
a44bdb29d4 |
feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Insert with path recency refresh and entry cap Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): recency-weighted per-value Weight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Remove all entries for a value Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(radixtree): race-free concurrency smoke test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard Address review findings on the generic prefix tree: - Extract a shared pruneWalk helper parameterized by a shouldClear predicate and use it from Evict, Remove, and the MaxEntries path. Previously evictOldestLocked cleared a victim's value but never removed the now value-less node or its childless ancestors, so internal nodes accumulated under sustained churn at the cap. The MaxEntries path now prunes the victim and its empty ancestors. - DRY: pruneWalk replaces the duplicated logic in the former pruneLocked and Remove's inner closure. - Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take the read lock (RLock) while Insert, Evict and Remove keep the write lock. Confirmed race-clean under go test -race. - Document the strict greater-than TTL boundary on Options.TTL and expired: age exactly equal to TTL is still live. - Guard Insert against an empty key (no-op): the root never holds a value. Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation, the no-growth-past-cap invariant, the TTL boundary, and empty-key behavior for both Insert and LongestMatch. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): RoutePolicy enum with parse/resolve Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Config with defaults and validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): deterministic xxhash prefix-chain extractor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): pure filter-then-score replica selection Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Provider interface and radix-tree-backed Index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(prefixcache): gofmt policy enum comment alignment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): head-first prefix chunking and hoist Weight out of sort Address code-quality review findings in the prefixcache package. Correctness: ExtractChain now chunks from absolute offset 0 with fixed [0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth head blocks. The previous tail-keeping logic shifted the byte offset by a non-window amount once a conversation grew past MaxDepth*WindowBytes, changing every hash each turn and silently breaking cross-turn longest-prefix matching. The reusable KV/prefix cache lives at the head of the prompt, so anchoring at offset 0 makes the chain a true prefix-chain: P and P+suffix share their full leading overlap. Add a regression spec proving cross-turn stability past the cap. Performance: Index.Decide precomputes each candidate's Weight once (decorate-sort-undecorate) instead of calling the O(tree size) Weight inside the O(n log n) sort comparator. Behavior is unchanged. Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual byte loop, clearing the modernize rangeint finding. Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's documented concurrency safety under go test -race. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): prefixcache observe/invalidate subjects and payloads Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): NATS sync publish/apply for observe and invalidate Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): ctx carrier for prefix-hash chain Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): PrefixChainHook indirection for backend-side chain build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): stash prompt prefix chain on ctx before distributed routing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend): mirror modelID fallback for prefix-chain salt parity Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scheduling config columns for prefix-cache routing Add RoutePolicy and per-model balance/prefix-match override columns to ModelSchedulingConfig and include them in the SetModelScheduling upsert DoUpdates list so updates are not dropped on conflict. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): optional route preference in FindAndLockNodeWithModel Add a RoutePreference type and a new pref parameter so the atomic pick+lock+increment can be biased toward a preferred node without weakening atomicity. A nil preference reproduces the previous ORDER BY behavior exactly. Update the ModelRouter interface, both router.go call sites (pass nil for now; Phase 5 builds the real preference), the test doubles, and the distributed e2e caller. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): make Sync satisfy Provider with Evict Sync.Observe now returns whether the local index treated the assignment as new or extended, and Sync gains an Evict method that delegates to the wrapped index. Together these let SmartRouter hold a single prefixcache.Provider that broadcasts via NATS. Adds a compile-time Provider assertion and an Evict-delegates behavioral test. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil keeps routing byte-for-byte the round-robin floor). On each request Route now calls buildPreference: it reads the prompt prefix chain from ctx (distributedhdr.PrefixChain), resolves the per-model policy/thresholds over the global config, loads candidate replica in-flight via a new registry read LoadedReplicaStats (deduped to one entry per node using the MIN in-flight across that node's replicas), asks the provider to Decide, and runs prefixcache.Select. The chosen node is passed as the RoutePreference to FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick, cold scheduleAndLoad), and the served node is recorded via Observe only when the resolved policy is prefix_cache so round-robin models never pollute the tree. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): invalidate prefix-cache entries on unload and stale removal UnloadModel and both staleness fall-through paths in Route (after a failed gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model, nodeID), guarded by a nil-provider check so the round-robin floor is unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations also broadcast to peer frontends. Adds a test that a previously hot prefix no longer Decides to a node after UnloadModel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): rolling forced-disturb pressure counter Add a concurrency-safe per-model rolling counter that tracks how many times a request had a usable hot prefix match but the load guard forced it off the warm node. Entries outside the window are dropped lazily on Count so the backing slice stays bounded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): autoscale on prefix-cache forced-disturb pressure Wire the rolling forced-disturb counter into the SmartRouter and the ReplicaReconciler. Router: in buildPreference, after Decide + Select, record a forced-disturb when a usable hot prefix match existed (d.HotNodeID != "" and d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or nothing) because the load guard ruled the warm node out. This is the scale-worthy signal: the cache-warm replica is saturated. It deliberately does not fire for all-unique workloads (no hot match), avoiding false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil keeps the path a no-op. Reconciler: read the same Pressure instance in reconcileModel as an extra scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel guards and the UnsatisfiableUntil cooldown that gates the whole method. Pressure never overrides MaxReplicas and never force-evicts; a no-capacity model does not spin. Window and threshold come from prefixcache.Config (PressureWindow default 1m, PressureScaleThreshold default 1) and are configurable via ReplicaReconcilerOptions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow Record now prunes entries older than the rolling window (the same prune Count does), via a shared pruneLocked helper, so a model that takes forced-disturb records but is never Counted (e.g. one with zero loaded replicas the reconciler skips) no longer grows its backing slice unbounded. Also removes the dead pressureWindow struct field and the ReplicaReconcilerOptions.PressureWindow option from the reconciler: they were stored but never read (the window lives inside the *prefixcache.Pressure instance). The scale block now reads pressure.Count once into a local. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(api): prefix-cache fields in scheduling endpoint DTO with validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): prefix-cache routing controls in node scheduling form Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire prefix-cache index, NATS sync, and config Activates prefix-cache-aware routing in distributed mode. Builds the prefixcache Index + NATS-backed Sync + Pressure counter, installs the distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix chain per request, subscribes to prefixcache.observe/prefixcache.invalidate to apply peers' events to the local index (no re-broadcast), threads PrefixProvider/PrefixConfig/Pressure into the SmartRouter and Pressure/PressureThreshold into the ReplicaReconciler, and runs a background eviction ticker (every TTL/2) bound to the app context. Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE) opts out and leaves the provider/pressure nil so routing stays round-robin. --distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m) controls entry idle-timeout and eviction cadence. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(nodes): round-robin-floor invariant for prefix-cache routing Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never picked even with a perfect prefix match (round-robin floor holds), while a balanced hot node within the load slack is reused. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(prefixcache): clear branch lint findings and em dashes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): validate prefix-cache config at startup wiring Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(radixtree): single-walk WeightsFor for batch value weights Add Tree.WeightsFor(values, now) which computes the recency-weighted weight for many values in a single O(N + len(values)) tree traversal, versus calling Weight once per value (O(len(values) * N)). Consumers that score K candidates against the tree under the read lock no longer pay K full walks. Extract the per-entry contribution math into an unexported helper shared by both Weight and WeightsFor so the metric stays identical (DRY). Weight's public behavior is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): add ModelConfig.ModelID() single source of truth The c.Name fallback to c.Model was duplicated in core/backend/options.go (feeding model.WithModelID) and hand-copied into core/backend/llm.go (the prefix-chain salt). These MUST agree or the prefix-cache salt diverges silently from the id the model loader tracks. Consolidate both into a new config.ModelConfig.ModelID() helper and call it from both sites. Behavior is identical. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): reuse one xxhash.Digest in ExtractChain ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth per call) and grew the chain slice without preallocation. Reuse a single Digest via Reset() before each block and preallocate the chain to min(nBlocks, MaxDepth). xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical value to a fresh New()+Write. Output hashes are unchanged, preserving the cross-process determinism that peers rely on over NATS. Verified by capturing ExtractChain output for the existing test inputs before and after the refactor: identical. Existing extractor tests pass unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk Index.Decide called radixtree.LongestMatch over the whole tree, so the deepest match could be a node that is offline, unloaded, or simply not in the passed candidate set. Honoring that as HotNodeID produced a false forced-disturb signal upstream (buildPreference records pressure when chosen != HotNodeID), making it look like a warm replica was load saturated when it was actually absent. Build the candidate set once and only set HotNodeID/MatchRatio when the matched node is an actual candidate; otherwise fall back to cold placement. A future refinement could ask the tree for the longest match restricted to the candidate nodes (shallower-but-valid) instead of dropping it. Also replace the per-candidate tree.Weight call in the cold-order sort with a single tree.WeightsFor walk, turning O(K*N) under the read lock into O(N + K). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): remove Select's unreachable deterministic fallback buildPreference always passes ColdOrder as a permutation of the full candidate set, so the cold-order loop hits every eligible candidate. The trailing best/bestIF scan was dead. Replace it with a plain "return """ and document that ColdOrder is guaranteed to cover all candidates, so "" means none were eligible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): fetch model scheduling config once per Route GetModelScheduling was read three times per request - in resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling - three DB round-trips for one row that is immutable for the life of the request, and not a consistent snapshot. Fetch it once near the top of Route and thread the *ModelSchedulingConfig (may be nil) into all three helpers. scheduleNewModel keeps its own fetch since it runs outside the Route snapshot. Behavior is identical for nil sched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): add Pressure.Reset to consume forced-disturb signal Pressure.Count is non-draining (it prunes only by age), so a single burst of forced-disturbs stays within the rolling window for the whole window and keeps Count >= threshold on every reconciler tick. The reconciler will use Reset to clear a model's events after acting on the signal so a fresh scale-up requires fresh forced-disturbs to accumulate, rather than one burst driving the model toward MaxReplicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): at most one scale-up per reconcile tick, consume pressure Two autoscale bugs: 1. Over-scaling: the pressure scale-up block read Pressure.Count but never consumed it. With a non-draining counter a single forced-disturb burst kept Count >= threshold across the whole window, firing scaleUp on every tick and pushing the model toward MaxReplicas off one transient burst. After a successful pressure-triggered scale-up the reconciler now calls Pressure.Reset to consume the signal. 2. Double scale-up in one tick: the all-replicas-busy block and the pressure block could both fire in the same reconcileModel pass, each calling scaleUp(+1) against the same `current` read once at the top, so a model that was both busy and over threshold scaled +2 and could overshoot MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1) per tick: the pressure block is skipped if the busy block already scaled, and scale-down is skipped in any tick that scaled up. MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation Add SetReplicaRemovedHook to NodeRegistry and fire it from both RemoveNodeModel and RemoveAllNodeModelReplicas after a successful delete. This is the single chokepoint every replica-removal path funnels through (router eviction, reconciler scale-down, probe reaper, health-monitor node-down reap, RemoteUnloaderAdapter), so the prefix-cache index can be invalidated by construction rather than wiring each call site individually. The hook is stored in an atomic.Pointer so the startup wiring (setter) and the request/reconcile-time fire are race-free; it is nil-safe when unset. GORM Delete reports no error for a no-op delete, so the hook also fires when nothing was removed; the consumer's Invalidate(model, node) is idempotent so this is harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): invalidate prefix-cache on any replica removal via registry hook Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): single source of truth for threshold bounds Extract ValidateThresholds into prefixcache/config.go so the per-model override validation (nodes.go endpoint) and Config.Validate share one implementation of the numeric bounds (min_prefix_match in [0,1], balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead of hard-coding them in two places. The route_policy allow-list stays explicit (not ParsePolicy, which maps typos to Default). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): preserve prefix-cache settings on partial scheduling update A scheduling POST that omitted route_policy/thresholds (e.g. a min_replicas-only update) full-replaced every column and silently reset the model's previously-configured prefix-cache settings to empty/zero. Make the four prefix-cache request fields pointers so omitted is distinguishable from explicit zero, and merge PATCH-style in SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves the existing config value (zero default when none). Non-prefix fields keep their full-replace PUT semantics. Validation now runs on the resolved values via prefixcache.ValidateThresholds. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica removal of every model, including round-robin models that never used the prefix cache. Index.Invalidate previously called tree(model), which lazily created and permanently retained an empty radix tree for any model that ever lost a replica, growing the trees map without bound. Sync.Invalidate also published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op removals across the cluster. Index.Invalidate now looks the tree up read-only via existingTree and returns without allocating when none exists. The Provider interface is unchanged; Sync gates the broadcast through an optional invalidateExisting(bool) capability type-asserted from the wrapped Index, falling back to the prior always-broadcast behavior for other Provider implementations. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort WeightsFor already returns a map keyed by every requested candidate, so the separate candidates set built to validate the hot match was redundant: a node is a candidate iff it is a key in the weights map. Drop the extra map and gate the hot-match check on weights membership. Also skip the sort when there is at most one candidate, since the input order is already the cold order. Behavior is unchanged. Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins would need lazy cross-file changes and is out of scope here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns Bulk node-scoped node_models deletes (Register re-register cleanup, MarkOffline, MarkDraining, Deregister) removed rows directly without firing the replica-removed hook, so the prefix-cache index kept pointing at nodes whose models were gone. Capture the DISTINCT model names before each bulk delete and fire fireReplicaRemoved once per model after a successful delete, restoring the single-chokepoint invariant for all removal paths. The pre-query is skipped when no hook is set so the no-hook path stays cheap. Also narrow LoadedReplicaStats to SELECT only node_id and in_flight (the only fields the router consumer reads), dropping the JOIN-side available_vram fetch and unused columns while keeping the []ReplicaCandidate return type unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(reconciler): consume autoscale signals only on a real scale-up scaleUp was fire-and-forget (void) yet its callers unconditionally consumed the pressure signal (Pressure.Reset) and the MinReplicas hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp added nothing (ScheduleAndLoadModel errored, or no node could be loaded) the saturated warm replica got no new replica AND its accumulated forced-disturb history was wiped, forcing the signal to re-accumulate over a full PressureWindow before the next attempt. Make scaleUp return whether at least one replica was actually scheduled, and gate the side effects on it: - pressure block (2b): set scaledUp and call Pressure.Reset only on success; on failure preserve the signal so the next tick retries off the same accumulated pressure. - busy-burst block (2): set scaledUp from the return value so a failed attempt does not suppress the pressure path or scale-down. - MinReplicas block: call ClearUnsatisfiable only on success so a failed attempt does not reset the unsatisfiable counter. All existing invariants (MaxReplicas, capacity gating, UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are preserved. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): drop router's redundant prefix-cache Invalidate calls The NodeRegistry removal chokepoint (RemoveNodeModel / RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which invalidates the prefix-cache index. The router was also calling prefixProvider.Invalidate explicitly right after each registry removal on the two stale-replica health-probe fall-throughs in Route and in UnloadModel, so every router-side eviction invalidated twice (double tree-prune + double NATS broadcast). Remove the three redundant explicit Invalidate calls and their empty nil-guards. Each removed call sat immediately after a registry removal that fires the hook, so invalidation is preserved via the chokepoint. Decide/Observe usage is untouched. Re-point the unit test (fake registry fires no hook) to assert the removal chokepoint is exercised on unload instead of the router's direct invalidation. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): make capture+delete atomic in bulk node_models removal paths MarkOffline, MarkDraining, and the Register re-register cleanup ran the nodeModelNames SELECT and the bulk node_models DELETE as two separate statements on r.db with no transaction. A SetNodeModel landing between the two was deleted but its replica-removed hook never fired, leaving the prefix-cache index pointing at a removed replica until TTL or candidacy self-heal. Wrap the capture and the delete in a single db.Transaction in each path (mirroring how Deregister already does it). The captured model names are collected into a slice declared outside the closure; the replica-removed hook fires for each only after the transaction commits, so a rollback never invalidates the index for a removal that did not persist. The set of fired hooks now equals exactly the set of node_models rows actually deleted, with no interleaving gap. The status flip in MarkOffline/MarkDraining (setStatus) is a separate, pre-existing operation and routing already filters non-healthy nodes, so it stays outside the transaction; return contracts are unchanged. Deregister was already correct and is untouched. The cheap-path skip (no hook -> skip the SELECT) is preserved. Adds a spec asserting MarkOffline fires hooks for exactly the rows it deletes and leaves no node_models row behind (consistent snapshot). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(nodes): debug logging for prefix-cache routing decisions and observations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): match shared prefixes by valuing every node on insert Insert recorded the value (node id) only on the final node of the key chain, leaving every intermediate prefix node valueless. LongestMatch returns the deepest node that hasValue, so two chains that share a leading block but diverge in the tail never matched: only exact-repeat queries hit. That broke the prefix-cache routing core use cases (shared system prompt, multi-turn extension, volatile tail), all of which rely on prefix matching rather than exact-repeat. Set value/hasValue/lastSeen at every node along the chain so each prefix-block node remembers the node id that served that prefix (SGLang/vLLM-style). The deepest match wins, and the last writer owns a shared prefix node (a recency heuristic: the most recent chain through a block is the one most likely still warm). size now counts valued nodes, which is the intended meaning. Updated radixtree tests to the new semantics: deepest-prefix test uses non-overlapping chains, a new test asserts last-writer-owns-shared-node, Evict/Remove/MaxEntries expectations recomputed for per-prefix-node counting, and a shared-prefix LongestMatch red test added. Added a prefixcache Decide test proving a prefix-only query routes to the warm node. No prefixcache .go logic changed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): lock in prefix-cache routing behavior end to end Add a DB-backed e2e spec that drives SmartRouter against a real NodeRegistry (Postgres testcontainer) and the real prefixcache.Index radix-tree provider, using a fake gRPC backend factory so no real inference runs. Covers the five behaviors validated by hand: 1. Cold miss + observe: an unseen prefix chain cold-places and is recorded. 2. Hot-match affinity: the same chain returns to its warm node X. 3. Shared-prefix match: a divergent chain sharing X's leading prefix still routes to X (the radix-tree regression we fixed). 4. Negative control: an unrelated chain is a cold miss, not a false hot match on X. 5. Failover + invalidation: removing X's replica fires the registry chokepoint hook to invalidate the prefix entry, and the chain fails over to surviving node Y and re-homes there. Replaces the need for manual docker-compose re-runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): make prefix-cache affinity replica-granular Track prefix-cache affinity per loaded replica (a backend process with its own KV cache) instead of per node, so multiple replicas of the same model on one node each keep distinct affinity and a hot prefix routes back to the exact replica that served it. - radixtree: add RemoveFunc(pred) and reimplement Remove on top of it. - prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/ PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to drop every replica of a node; Invalidate drops one replica. Select returns (ReplicaKey, bool) and gains a deterministic least-in-flight eligible fallback (tiebreak NodeID then Replica). - messaging: carry Replica on PrefixCacheObserveEvent and PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node). - Sync delegates + broadcasts with replica; InvalidateNode broadcasts Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode. This is part 1 of 2; the registry/router/wiring consumers are updated separately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): make prefix-cache routing replica-granular Wire the SmartRouter, NodeRegistry, and distributed startup to the replica-keyed prefixcache API. Affinity is now tracked per replica (each replica is a separate process with its own KV cache), so a prefix served by (node,0) no longer leaks onto the same-node sibling (node,1). - RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks the EXACT (node_id, replica_index) row, falling through to the default ORDER BY when that replica is not loaded. - SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires the specific replica, RemoveAllNodeModelReplicas and the four bulk node-scoped deletes fire replica<0 (all replicas of the node). - buildPreference builds one Candidate per loaded replica and locks the exact replica the policy chose; observePrefix records the served ReplicaKey at every call site. - distributed.go routes the hook to InvalidateNode (replica<0) or Invalidate(key). - Tests updated to the replica-keyed API plus new coverage: a hot prefix on (node,0) prefers replica 0 over the same-node sibling (router unit + e2e), and FindAndLock locks the exact preferred replica. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): derive prefix chain from messages for tokenizer-template models Prefix-cache-aware routing built its prompt-prefix chain from the rendered prompt string `s` in ModelInference. For models with TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the backend tokenizes the structured messages itself - so `s` is empty, the chain is empty, and routing silently falls back to round-robin. That covers the bulk of modern chat models (qwen3, llama3, ...), so the feature effectively never engaged for them. Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable head-first serialization of the conversation (role + content per turn). Two requests sharing a leading system prompt and early turns share a leading byte prefix, which ExtractChain maps to a shared chain prefix - landing both on the same cache-warm replica. The rendered `s` is still preferred when present (higher fidelity for non-template models). Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision" logs despite per-request Route calls, traced to the empty-chain guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document prefix-cache routing roadmap Add a routing-and-caching roadmap section to the distributed-mode guide, linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and NVIDIA Dynamo. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a473a32678 |
test(react-ui): cover models gallery empty-state reset flow (#10019)
Exercise the filtered empty-state path in the models gallery and verify that the clear-filters action restores the list and resets the filter selection. Assisted-by: Codex:gpt-5 Signed-off-by: Ching Kao <0980124jim@gmail.com> |
||
|
|
b81a6d01b3 |
perf(react-ui): code-split bundle, speed up coverage suite (#10042)
* Curate the highlight.js build to ~29 languages (lib/core + the common set) instead of the full ~190-grammar default: -787 KB raw / -230 KB gz on the base bundle. * Code-split every route via React.lazy with a per-layout <Suspense> in App.jsx so the sidebar stays mounted on navigation. Initial entry chunk drops from 3194 KB raw / 887 KB gz to 397 KB / 122 KB (-87%). Warm chunks on sidebar hover/focus/touch via a preload registry so the click finds the chunk already in flight or cached. * Migrate Playwright coverage from istanbul (build-time counters) to native Chromium V8 coverage, with per-worker accumulation + conversion. Suite drops from 71s to 30s at 20 workers (~58%) at the non-instrumented floor. * Keep the coverage gate bundling-invariant: the coverage build inlines dynamic imports so every shipped source file lands in the denominator (otherwise untested page chunks silently drop out and inflate the percentage). Production builds stay code-split. * Add UI_TEST_WORKERS=N Makefile knob; tighten coverage tolerance to 0.8pp now that jitter sits near istanbul's ~0.5pp again. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
373dc44992 |
fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e (#10031)
* fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e The polish PR (#10030) swapped the raw <input type=checkbox> for the shared <Toggle> component, which visually hides its native input via opacity:0;width:0;height:0. Playwright's .check() waits for visibility before clicking and times out after 30 s, breaking two UI E2E tests: - enabling fits filter hides models that exceed available VRAM - fits filter state persists after reload Pass { force: true } to skip the visibility check; the input is still the real focusable checkbox and toggles state on click. The companion .toBeChecked() assertion only reads state and works unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * fix(react-ui): click visible Toggle track in fits-filter e2e force:true skips the actionability checks but not the viewport check, and the Toggle's hidden input has width:0;height:0 so Playwright still reports "Element is outside of the viewport". Click the visible .toggle__track inside the filter-bar-group__toggle wrapper instead — that's what a real user clicks, and label-input association toggles the wrapped checkbox naturally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
02a0e70396 |
fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle (#10030)
* fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle The recently added VRAM-fit filter in the Models page used a raw <input type="checkbox"> next to the themed range slider, breaking the visual language of the rest of the row. Swap it for the shared <Toggle> component (already used by Backends, Settings, Traces, AgentCreate), adopt the filter-bar-group__toggle class to drop the duplicated inline styles, add a fa-microchip icon to mirror the per-row fit indicator, and add a subtle left divider so the filter reads as separate from the context-size slider on its left. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(react-ui): move 'Fits in GPU' filter to filter row and unify copy Two follow-ups on the previous polish pass: 1. Move the toggle from the context-slider row into the filter-button row above. The toggle is a filter on the result set, not a config for VRAM estimation, so it belongs with the type chips and backend select. The context slider stays its own thing. 2. Unify the label copy. The same locale file had "Fits in my GPU" for the filter and "Fits in GPU" for the per-row indicator; pick the shorter, possessive-free variant everywhere (en/de/es/it/zh-CN). Update e2e selectors to match. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
893e69cbf8 |
fix(react-ui): share single /api/operations poller across consumers (#10029)
useOperations() spun up its own setInterval per hook instance, so on pages like /app/models the OperationsBar in App.jsx plus the page's own useOperations() call each polled /api/operations at 1 Hz - 2 RPS sustained for the whole session, repeated on Backends and Chat. Lift the poller into an OperationsProvider mounted under AuthProvider so all consumers (OperationsBar, Models, Backends, Chat) share one timer. The hook file re-exports from the context to keep call sites unchanged. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c9a1a7e6a0 |
UI: add 'Fits in my GPU' filter on Install Models (#10017)
* feat(ui): add GPU fit filter on models install page * Delete docs/vram-fits-filter-backend-optionals.md Signed-off-by: Siddharth More <siddimore@gmail.com> --------- Signed-off-by: Siddharth More <siddimore@gmail.com> |
||
|
|
8d70855ea6 |
test: add Go + React UI coverage gates and fill test gaps (#9989)
- Strict monotonic Go coverage gate (make test-coverage-check, 45% baseline) run in CI; fixes ginkgo dropping all-but-one coverprofile across multiple recursive roots, builds with -tags auth, and folds in the in-process tests/e2e suite via --coverpkg. - React UI e2e coverage (make test-ui-coverage: vite-plugin-istanbul + nyc, nix-provided Chromium) plus e2e specs for 6 previously-untested pages, and a UI coverage gate (make test-ui-coverage-check) with a small tolerance since e2e line coverage jitters ~0.5pp run-to-run. - pre-commit hook: lint + coverage on Go changes, Playwright e2e + UI coverage gate on react-ui changes; install with make install-hooks. - New Go handler tests (settings, branding), hermetic base64 download test. - fix(ui): model editor reads vram_display (snake_case), so the VRAM estimate renders again; covered by a regression test. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com> |