fix(ci): use namespace import for js-yaml in changed-backends.js
js-yaml's ESM build exposes only named exports (load, dump, ...) and no
default export. Bun's strict ESM interop rejects the default import with
'Missing default export in module js-yaml.mjs', failing the detect-changes
and generate-matrix CI jobs. Import the namespace instead; yaml.load (the
only usage) resolves to the named export, so behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backend): call vram.EstimateModelMultiContext for model size estimate
core/backend/options.go called vram.EstimateModel, which does not exist in
the vram package (it exposes EstimateModelMultiContext). This broke the build
on master (undefined: vram.EstimateModel). Use EstimateModelMultiContext with
a nil context-size slice (defaults to a single 8192 estimate); the returned
MultiContextEstimate.SizeBytes is exactly what the caller consumes, so size
estimation behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(watchdog): add size-aware LRU eviction mode
When the model count hits the LRU limit or the memory reclaimer fires,
evict the largest model by on-disk file size first rather than the
least-recently-used one. For GGUF models the file size is a reliable
proxy for GPU/RAM footprint, so evicting the largest candidate maximises
freed memory per eviction round while keeping small utility models
(embeddings, classifiers, rerankers) resident.
Changes:
- `pkg/model/watchdog.go`: add `sizeAwareEviction` flag and
`modelSizes map[string]int64` to `WatchDog`; sort candidates by
`sizeBytes` desc (LRU time as tiebreaker) when the flag is set;
add `RegisterModelSize`, `SetSizeAwareEviction`, `GetSizeAwareEviction`
- `pkg/model/watchdog_options.go`: add `WithSizeAwareEviction` option
- `pkg/model/initializers.go`: stat model file after load and call
`RegisterModelSize` so size data is available before the first eviction
- `core/config/application_config.go`, `runtime_settings.go`: add
`SizeAwareEviction` field and `WithSizeAwareEviction` app option;
expose via `ToRuntimeSettings` / `ApplyRuntimeSettings` for the
`POST /api/settings` live-reload path
- `core/cli/run.go`: add `--size-aware-eviction` flag /
`LOCALAI_SIZE_AWARE_EVICTION` env var
- `core/application/startup.go`, `watchdog.go`: wire the new option
through to `NewWatchDog`
- `pkg/model/watchdog_test.go`: 5 new specs — option enable, dynamic
toggle, largest-first ordering, equal-size LRU tiebreaker, no-size
fallback to LRU, and size-map cleanup on eviction
Closes#9375
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* refactor(watchdog): use vram estimation scaffolding for model size
Replace the brittle os.Stat(modelFile) approach with a proper call to
pkg/vram, which handles multi-file models (DownloadFiles, MMProj) and
all weight file types, not just single GGUF files.
- Add estimateModelSizeBytes() in core/backend/options.go that collects
all weight file URIs from the model config, resolves them to file://
URIs, and calls vram.Estimate() with the shared DefaultCachedSizeResolver
(15-min TTL cache avoids redundant stat calls on repeated loads)
- Thread the result through via a new WithModelSizeBytes() loader option
- In initializers.go, consume the pre-computed size instead of calling
os.Stat; if no size was supplied (e.g. for external/router-dispatched
models) the registration is simply skipped
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* refactor(watchdog): use EstimateModel with HF fallback for size estimation
Switch estimateModelSizeBytes from calling vram.Estimate directly to the
unified vram.EstimateModel entry point, which adds automatic fallbacks:
file-based GGUF metadata → HF API → size string.
Also extract the HuggingFace repo ID from model URIs (huggingface://,
hf://, https://huggingface.co/ and org/model short-form) and pass it
as ModelEstimateInput.HFRepo, so models not yet downloaded locally can
still get a size estimate via the HF API.
Addresses @mudler's review feedback: "better to rely on EstimateModel
and pass by the HF URL of the model extracted from the URI".
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* feat(webui): add Size-Aware Eviction toggle to settings page
The size-aware eviction setting was wired through the CLI flag and the
RuntimeSettings live-reload path (POST /api/settings) but had no handle
on the React settings page, so it could not be toggled from the UI.
Add a Size-Aware Eviction toggle to the Watchdog section, next to the
existing Force Eviction When Busy / LRU eviction handles. The settings
page loads and saves the whole RuntimeSettings object, so the new
size_aware_eviction key is picked up with no extra plumbing.
Addresses @mudler's review feedback: the application config setting
should land on the same UI settings page as the other handles.
Signed-off-by: supermario_leo <leo.stack@outlook.com>
---------
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active
When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.
Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.
Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* feat(vllm): progressive streaming via parser.extract_tool_calls_streaming
When a tool parser is active for a tool-enabled streaming request,
#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.
Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.
Falls back to the existing #10346 buffer path when:
- the parser does not have extract_tool_calls_streaming, OR
- extract_tool_calls_streaming raises mid-stream (logged, the
rest of the request finishes via post-loop extract_tool_calls).
Tests (TestStreamingToolParser):
1. Buffer path: no markup leaked, no content duplication
2. Native streaming: plain-text response streams progressively
3. Native streaming: tool_call structured, no markup leaked
4. Native streaming exception → graceful fallback, no markup, no crash
5. No tool parser → unchanged per-delta content stream
E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path
Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:
- tool_call: request mentions a tool; model is expected to call it
- plain_text: request offers a tool but explicitly asks for prose
Use this to compare:
- the buffer-all path (#10346) → plain_text TTFT ≈ total response time
- the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time
python examples/vllm-bench/ttft_streaming_tool_parser.py \\
--url http://localhost:8080 --model my-coder --runs 3
Lives under examples/ so it does not interfere with the test suite.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens)
The long-text scenario shows the buffering vs streaming difference most
dramatically: with the buffer-all path, the client receives nothing for
20+ seconds and then the entire 1500-token response at once. With native
streaming, the first token arrives in tens of milliseconds and the
response flows progressively.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
---------
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>
* feat(nemo): enable word-level timestamps for ASR models
The nemo backend ignored timestamp_granularities and always returned a
single segment with start=0 end=0, making word-level timestamps
impossible to obtain even though the NeMo models (parakeet-tdt, etc.)
fully support them.
Changes:
- Add _get_stride_seconds() to compute frame duration from the model's
preprocessor window_stride and encoder subsampling_factor.
- Add _build_segments_with_words() that extracts word offsets from the
NeMo Hypothesis.timestamp dict and converts frame indices to
nanosecond timestamps.
- Support 'word' granularity (one segment per word) and 'segment'
granularity (merge at time-gap boundaries using a dynamic threshold).
- Populate TranscriptSegment.words with TranscriptWord entries so
callers get both segment-level and word-level timing.
- Only request timestamps from NeMo when the caller actually asks for
them (timestamp_granularities is non-empty), keeping the fast path
unchanged for callers that don't need timestamps.
Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:
curl -X POST /v1/audio/transcriptions \
-F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \
-F 'timestamp_granularities[]=word' -F response_format=verbose_json
→ each word has correct start/end times in seconds.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(nemo): address Copilot review feedback
- Narrow exception handling in _get_stride_seconds to catch only
AttributeError, KeyError, TypeError instead of bare Exception, and
emit a warning when falling back to the hardcoded stride.
- Remove explicit return_hypotheses=False when timestamps are requested;
timestamps=True already forces NeMo to return Hypothesis objects.
- Add a warning when NeMo does not return Hypothesis objects despite
timestamps being requested.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
The parakeet-specific word accessors can return stale initialisation
data (model name, binary blobs) for segments with no real speech.
Add isValidWord() to filter out words that have:
- empty or whitespace-only text
- U+FFFD replacement characters (from binary data scrubbing)
- negative timestamps
- zero duration (end <= start)
Also skip empty segments entirely when they have no recognisable
content (empty text AND no valid words), preventing spurious subtitle
entries like '00:45:33,592 --> 00:45:33,592 parakeet@rH\u000b\ufffdI'.
Applies to both AudioTranscription and AudioTranscriptionStream.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
fix(vllm): structured outputs silently ignored on vLLM >= 0.23
vLLM >= 0.23 removed GuidedDecodingParams (now StructuredOutputsParams) and
renamed the SamplingParams field guided_decoding -> structured_outputs. The
import failed, HAS_GUIDED_DECODING became False, and the whole guided-decoding
block was skipped, so response_format / grammar constraints were silently
ignored. Adapt the existing request.Grammar path to the new class/field.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Opening a model row's kebab (ActionMenu) on the Manage dashboard snapped the
page scroll to the top and rendered the menu detached from its trigger, making
it impossible to operate.
Two compounding causes:
- The menu auto-focus called el.focus() without preventScroll, so the browser
scrolled the focused element into view, yanking the page to the top.
- The position:fixed Popover was rendered inline inside the table row. The
editorial UI overhaul added hover transforms to rows/cards, and a transformed
ancestor re-anchors position:fixed to itself instead of the viewport, so the
menu (positioned from the trigger's viewport rect) landed in the wrong place.
Fix: portal the Popover to document.body so position:fixed always resolves
against the viewport, position it before paint with useLayoutEffect (no {0,0}
flash), and pass preventScroll:true to both focus calls.
Adds an e2e regression test that reproduces the symptom (scroll jumped from 564
to 0 on the old code) and asserts the menu tracks its trigger.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
refactor(config): single source of truth for default values across config + backend
Defaults were decided in two areas with duplicated/drifted literals: the config
SetDefaults tiers vs core/backend/options.go's grpcModelOpts (which translates a
ModelConfig to the backend wire format and supplied its own fallbacks). They had
drifted - n_gpu_layers 9999999 (options.go) vs 99999999 (gguf.go), two 512 batch
constants, context 1024 (gguf) vs 4096 (backend) scattered as bare literals.
Introduce core/config/defaults.go as the canonical home (DefaultContextSize=4096,
GGUFFallbackContextSize=1024, DefaultNGPULayers=99999999, DefaultFlashAttention=
auto). gguf.go / hooks_llamacpp.go use them directly; core/backend references them
(backend imports config, never the reverse) so DefaultContextSize/DefaultBatchSize
and the flash-attn / n_gpu_layers fallbacks resolve to one place. The two context
values (1024 GGUF-no-estimate vs 4096 general) are kept distinct but now named +
documented, not blind literals. Behavior-preserving; config + backend suites green.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): enable cross-request prefix caching for serving (Phase 2)
The llama.cpp backend ships n_cache_reuse=0 (cross-request KV prefix reuse via
shifting disabled). Enable it by default (256) so repeated prefixes - system
prompts, RAG context, agent scaffolds, multi-turn chat - aren't recomputed. This
is the universally-useful part of 'paged attention' (shared-prefix reuse, which
the upstream maintainers themselves identify as where paged attn actually helps)
and needs none of the block-KV machinery.
Lives in a serving_defaults.go sibling to hardware_defaults.go (device-driven vs
serving-policy defaults); both run from SetDefaults and only fill unset values.
Explicit cache_reuse/n_cache_reuse always wins. Device-independent, so it
propagates to distributed nodes via the model options with no router change.
Shares the backendOptionSet helper with the Phase-1 parallel default.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(config): extract generic fallback defaults into ApplyGenericDefaults
Behavior-preserving: move the inline sampling-param + runtime-flag fallbacks out
of SetDefaults into ApplyGenericDefaults, completing the domain-grouped tiers
(ApplyInferenceDefaults=family, ApplyHardwareDefaults=device, ApplyServingDefaults
=serving, ApplyGenericDefaults=generic fallbacks). SetDefaults is now a clean
orchestrator. Same order (runs after the family/hardware/serving tiers so those
win) and same conditions (TopK gated on UsesLlamaSamplerDefaults, MMap on XPU).
No behavior change; full config suite green. (NGPULayers stays in the GGUF-read
path for now - it's device-driven but coupled to model-size detection; a separate
follow-up.)
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): add model alias field and self-validation
Add ModelConfig.Alias (yaml: alias), IsAlias(), and an alias
short-circuit at the top of Validate() that rejects self-reference and
forbids setting backend/parameters.model on a pure-redirect alias.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): resolve and validate model alias targets in the loader
Assisted-by: Claude:opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(middleware): resolve model aliases and stamp requested/served identity
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(modeladmin): reject alias configs with invalid targets on create/edit
Validate alias targets at create/swap entry points (ImportModelEndpoint,
EditYAML, PatchConfig) so a dangling, chained, or disabled alias target is
rejected at save time rather than surfacing as a runtime error.
Assisted-by: Claude:opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(api): add GET /api/aliases to list model aliases
Adds an admin-gated read-only endpoint that lists every model alias
config as {name, target} pairs, backed by the loader's existing
GetAllModelsConfigs().
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(mcp): add set_alias and list_aliases tools
Expose model-alias management over the LocalAI Assistant MCP surface:
list_aliases (read-only, GET /api/aliases) and set_alias (mutating).
SetAlias is swap-first: PATCH /api/models/config-json/:name swaps an
existing alias's target (validated, non-destructive) and a 404 falls
back to POST /models/import to create a fresh {name, alias} config. The
inproc client mirrors this via ConfigService.PatchConfig + a create path
modeled on ImportModelEndpoint. Deletion reuses delete_model.
Assisted-by: Claude:claude-opus-4 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(mcp): replace em dashes in alias tool comments
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config-meta): expose alias as a model-select field
Add an 'alias' section to DefaultSections() and an 'alias' field override
in DefaultRegistry() so the schema-driven React editor renders the new
top-level ModelConfig.Alias field as a model picker in its own section.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add alias template card and Manage alias badge
Add an 'Alias / Routing' template to the create-flow gallery that seeds a
minimal name + alias config, and a read-only 'alias -> target' badge on the
Manage Models tab. The capabilities row payload does not carry the alias
field, so the badge resolves targets from GET /api/aliases looked up by name.
Assisted-by: Claude:claude-opus-4 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: document model aliases
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(swagger): regenerate for GET /api/aliases
Adds the /api/aliases path and AliasInfo schema generated from the
ListAliasesEndpoint annotation.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(localai): check os.RemoveAll error in aliases_test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: correct alias conversion docs and advertise /api/aliases in instructions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(mcp): write alias config 0600 to satisfy gosec G306
The inproc createAlias path wrote the alias YAML with 0644, which gosec
flags as a new G306 finding on the PR. The LocalAI process is the sole
reader/writer of model configs, so 0600 is correct and keeps the scan clean.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add Depth Anything V2 models + bump native version
Add Depth Anything V2 (DA2) support to the depth-anything backend. DA2 is
depth-only (no camera pose, no confidence) and ships both relative
(relative inverse depth) and metric (depth in metres) variants. The Go
backend is model-agnostic, so no backend code changes are required — only
a native version bump and new gallery entries.
- backend/go/depth-anything-cpp/Makefile: pin DEPTHANYTHING_VERSION to the
depth-anything.cpp commit that adds the DA2 engine + C-API routing
(e3dec57f13a52366bbc4f279ef44804915960a6b, kept alive by the upstream tag
da2-support so it survives a squash-merge).
- gallery/index.yaml: add 12 DA2 entries (4 base quants, small, large, plus
Hypersim indoor and VKITTI outdoor metric models in S/B/L). Metric models
carry the metric-depth tag; none carry camera-pose.
Assisted-by: Claude:claude-opus-4-8
* chore(depth-anything-cpp): pin to merged DA2 master commit
PR #1 (mudler/depth-anything.cpp) merged to master as f4e17de (squash); repoint
the pin from the pre-merge commit to the canonical master commit.
Assisted-by: Claude:claude-opus-4-8
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): node-aware hardware defaults — larger physical batch on Blackwell
A larger physical batch (n_batch/n_ubatch) materially lifts MoE prefill on
NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark) — measured
on a GB10 with Qwen3-Coder-30B-A3B, the prefill ceiling rises (ub512 ~2994 ->
ub2048 ~3316 t/s) and saturates around 2048.
The heuristic lives in core/config alongside the other config overriders
(ApplyInferenceDefaults, guessDefaultsFromFile/NGPULayers) — they all fill the
ModelConfig from heuristics, so hardware tuning is the same domain and stays in
one place. It is parameterized on a GPU descriptor (not direct detection) so it
works in both deployment shapes:
- Single host: SetDefaults applies it with the LocalGPU.
- Distributed: only the worker sees the GPU, so the worker reports its compute
capability on registration (gpu_compute_capability -> BackendNode), and the
router re-applies the SAME core/config heuristic for the SELECTED node before
loading — fixing the case where the frontend has no GPU at all.
Explicit `batch:` always wins (only managed default values are touched).
xsysinfo gains NVIDIAComputeCapability() (detection only); all interpretation
lives in core/config. Tests: core/config, pkg/xsysinfo, core/services/nodes.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(config): injectable local-GPU seam + single-instance coverage
Make local GPU detection an injectable package var (localGPU) so the
single-instance path (SetDefaults -> ApplyHardwareDefaults) is deterministically
testable without a real GPU, mirroring the distributed override's coverage.
Adds specs asserting SetDefaults sets the Blackwell physical batch, leaves it
unset on non-Blackwell, and never overrides an explicit batch.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): default concurrent serving (n_parallel) by GPU VRAM
The llama.cpp backend defaults n_parallel=1, which serializes multi-user requests
and leaves continuous batching off (it auto-enables only at n_parallel>1). Fold a
VRAM-scaled parallel-slot default into the hardware-config path so multi-user
serving works out of the box: >=32GiB->8, >=8GiB->4, >=4GiB->2, else unchanged.
With the backend's unified KV the slots SHARE the context budget, so this adds
concurrency without multiplying KV memory. Explicit parallel/n_parallel always
wins. EnsureParallelOption is shared by the single-host path (ApplyHardwareDefaults
with the local GPU) and the distributed router (per selected node's reported VRAM,
since the frontend may have no GPU). LocalGPU now also reports VRAM.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Add apex-quant (MoE per-tensor/per-layer quantization recipe) to the
"Backends built by us" section as a note after the engines table, since
it is a quantization recipe rather than a native inference engine.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): adapt grpc-server to upstream server-schema split
Upstream llama.cpp (e475fa2) extracted the JSON request-schema evaluation
out of the static server_task::params_from_json_cmpl into the new
server_schema::eval_llama_cmpl_schema (tools/server/server-schema.cpp).
The grpc-server unity build still called the old static member, breaking
every llama-cpp backend build with "no member named 'params_from_json_cmpl'
in 'server_task'".
Pull server-schema.cpp into the translation unit and call the new function,
keeping both guarded by __has_include so forks that predate the split (e.g.
llama-cpp-turboquant, which still exposes params_from_json_cmpl) keep
compiling against the old static member.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): raise WebRTC data-channel max-message-size for large events
Browsers advertise a conservative SCTP max-message-size in their SDP offer
(Chrome uses 256 KiB). pion enforces the remote's advertised value on send, so
a single realtime event larger than it cannot be sent over the "oai-events"
data channel: SendText fails, the event is dropped, and the turn silently
yields no response. Some turns legitimately produce a >256 KiB JSON event —
notably tool calls with sizeable schemas or results.
Browsers advertise the value conservatively but their SCTP stacks reassemble
much larger messages, so raise the max-message-size honored for our own
server-generated events by rewriting the attribute in the offer before
SetRemoteDescription.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): keep the WebRTC sendLoop alive when one event send fails
A failed SendText on the oai-events data channel exited the sender goroutine,
so a single dropped event (e.g. one over the negotiated SCTP max-message-size)
tore down the session and silently dropped every subsequent event. Log and skip
the offending event instead and keep draining; a genuinely dead transport is
still handled by the closed / connection-state path.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(downloader): stall timeout, resume-safe cancel, and stale-partial reaping
Large model installs would hang forever or never finish. Three defects in
the HTTP download path, all hit by big GGUF pulls over a slow or flaky link:
1. No stall timeout. The shared download client sets no body deadline
(correct for streaming) but also no read-idle timeout, and the
transport's IdleConnTimeout does not cover an in-flight body read. A
silently-dropped TCP connection (no FIN/RST) blocked the body Read
forever, freezing an install at N bytes until an external reaper killed
it. Add an idle-timeout reader that closes the body after a window of
zero progress (DownloadStallTimeout, default 60s), turning an indefinite
hang into a fast, retryable error. A read that returns data resets the
clock, so a slow-but-steady transfer is unaffected.
2. Cancellation deleted the partial. On context.Canceled the code removed
the .partial file, so any frontend restart (deploy, OOM) mid-download
wiped all progress and the retry restarted from zero. At slow egress,
files larger than the restart interval never completed. Keep the
.partial on cancel so the next attempt resumes via Range.
3. Partials leaked. Cleanup only ran on the context-cancel path, never on a
stall or a SIGKILL/OOM, so abandoned .partial files accumulated and could
fill the models volume. Add CleanupStalePartialFiles and reap partials
older than 24h on startup.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(downloader): discard the .partial on a deliberate user cancel
Review follow-up. The previous commit kept the .partial on every cancellation
so restarts could resume, but that also left a dangling partial when a user
*intentionally* cancelled an install — the file lingered until the 24h reaper.
Distinguish the two: cancel the gallery operation's context with a cause
(downloader.ErrUserCancelled) so the download layer can tell a deliberate
abort (discard the partial) from an incidental one such as a shutdown/restart
(keep it for resume). Detect cancellation via the context rather than the
returned error, because an HTTP request cancelled with a cause surfaces the
cause error, not context.Canceled.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(downloader): resolve gosec G122 in CleanupStalePartialFiles
CI's code-scanning (gosec) flagged G122 (symlink TOCTOU) for the os.Remove
call inside the filepath.WalkDir callback. Collect the stale paths during the
walk and delete them afterwards instead of mutating the tree from inside the
callback. Behavior is unchanged; the existing specs still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): add word-level timestamp support
Add word-level timestamp extraction to the crispasr backend by calling
the CrispASR C library's word accessor functions that are already
exported by libgocraspasr but were not previously bound by the Go
wrapper.
Two families of word functions are supported:
1. Session-based (get_word_count/text/t0/t1) — works per-segment for
whisper-like backends.
2. Parakeet-specific (get_parakeet_word_count/text/t0/t1) — returns a
global word list for TDT/CTC/RNNT parakeet models where the session
API does not expose per-segment word data.
The Go code tries session-based first and falls back to parakeet-specific
when the session word count is zero.
Depends on #10402 (grpc server Words forwarding) for the words to reach
the HTTP response.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(crispasr): use portable sed -i.bak for macOS compatibility
BSD sed requires -i '' for in-place editing while GNU sed uses -i.
Replace with -i.bak which works on both platforms, then remove the
backup file.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Vulkan backends bundled their own loader and ICD manifests but neither the
Mesa driver the manifests point at nor a way to make the loader find them,
so on a runtime base image without Mesa the loader enumerated zero devices
and the GPU silently fell back to CPU (only NVIDIA worked, since its ICD is
injected by the container toolkit).
- scripts/build/package-gpu-libs.sh: for each installed ICD manifest, bundle
the driver .so its library_path names — no hard-coded, platform-dependent
soname list — plus that driver's ldd dependencies, skipping manifests whose
driver isn't installed. Rewrite each library_path to a bare soname so the
bundled driver resolves via the LD_LIBRARY_PATH run.sh already sets.
- .docker/install-base-deps.sh, backend/Dockerfile.golang,
backend/Dockerfile.python: install mesa-vulkan-drivers in every Vulkan
builder so the driver + manifests exist to be packaged (the LunarG SDK
ships only the loader and shader tooling).
- pkg/model/process.go: when a backend ships vulkan/icd.d/, point the loader
at it via VK_DRIVER_FILES/VK_ICD_FILENAMES at launch (no-op otherwise).
Covered by pkg/model/process_vulkan_test.go.
- backend/go/parakeet-cpp/package.sh: complete the L0 stub (was missing the
libc-family ldd walk + GPU-lib packaging) by mirroring whisper, so the
vulkan-parakeet image actually bundles its GPU runtime.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* Use inference defaults in repo src rather than fetching
there are inference_defaults.json already in the repo so we can use
those, they are regularly updated with github actions, and we avoid hash
mismatch errors in the flake this way
Signed-off-by: Souheab <souheab@protonmail.com>
* Update vendor hash
Signed-off-by: Souheab <souheab@protonmail.com>
* Create react-ui derivation as it is required for go build
Signed-off-by: Souheab <souheab@protonmail.com>
* Add FHS env wrapper to make #!/bin/bash scripts work
Signed-off-by: Souheab <souheab@protonmail.com>
* use pkgs.importNpmLock to deal with npm dependencies instead of using npmDepsHash
Signed-off-by: Souheab <souheab@protonmail.com>
---------
Signed-off-by: Souheab <souheab@protonmail.com>
The gRPC server wrapper in pkg/grpc/server.go reconstructs
TranscriptSegment messages when relaying AudioTranscription results
from backends. The Words field was not being copied, causing all
word-level timestamps to be silently dropped regardless of backend
support.
This was introduced when PR #9621 added the TranscriptWord proto
message and transcriptResultFromProto (server-side), but did not
update the server-side gRPC relay to forward the new field.
Fixes#9306
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* feat(ui): legible Usage charts - distinct prompt/completion hues + chart a11y
Prompt and completion were the same color (primary at 0.35 opacity), so the
stacked token charts read as one blurry blob. Completion now uses a distinct
data-viz hue (--color-data-3) at full opacity across the time chart, the
per-model distribution bars, and the tooltip. The source-mix chart is no longer
aria-hidden: it exposes role="img" with a label.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): sortable Users table
The admin Users table is now sortable by name, email, provider, role, status,
and created date - clickable headers with an aria-sort state, a direction
caret, and keyboard activation (Enter/Space). Permissions and Actions stay
non-sortable.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): unsaved-changes guard on Settings and Agent create/edit
Add a reusable UnsavedChangesGuard (router useBlocker + beforeunload) that
prompts before navigating away or closing the tab with unsaved edits. Wired to
Settings (existing isDirty) and AgentCreate (snapshot the loaded form, compare;
suppressed while saving so the post-save redirect is not blocked). Adds the
common.unsaved i18n keys.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): sortable Traces tables
Both trace tables are now sortable: the API table by method/path/status and the
backend table by type/time/model/duration, with aria-sort, a direction caret,
and keyboard activation. Sort and the expanded row reset when switching tabs
(the two tables have different columns).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): responsive table reflow (cards on mobile), applied to Users
Dense admin tables sideways-scroll on phones. Add a reusable ResponsiveTable
that mirrors the <thead> labels onto each body cell (data-label) and a
<=640px stylesheet that stacks rows into label/value cards. Wired to both
Users tables; reusable for the other dense tables next.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): roll responsive table reflow to Traces, Models, Manage, Nodes
Apply ResponsiveTable to the remaining dense tables so they stack into
label/value cards on phones instead of scrolling sideways. Harden the
component for these tables: scope label-mirroring and the card CSS to direct
children (nested detail tables render normally), override inline min-width on
mobile, and pass through table/container inline styles. Nested expansion
tables in Nodes/Models/Manage are intentionally left as-is.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): unsaved-changes guard on the Fine-Tuning form
Editing the long fine-tune job form and navigating away silently discarded
everything. Snapshot the assembled getFormConfig() as a baseline, treat the
open form as dirty when it diverges, and reuse UnsavedChangesGuard to prompt
before leaving. The baseline is rebased after a job is submitted so leaving
afterward does not warn.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add Fraunces variable serif + --font-serif token
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): serif display tier + section-heading typography scale
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): un-overload accent — nav rail, stronger focus ring, neutral hover
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): orchestrated page reveal + stagger motion primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): fix dead token refs + dedupe toggle to one primitive
Migrate all .toggle-slider consumers (Users, Chat, AgentChat) to the
canonical BEM toggle primitive and delete the legacy duplicate CSS block.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): route boot fallback through the LoadingSpinner primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): EmptyState primitive with serif title
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Skeleton shimmer primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): PageHeader + SectionHeading editorial primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): StatusPill primitive + time-of-day greeting helper
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Home editorial header + status line (north-star redesign)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Home loaded-models skeleton list, button hierarchy, EmptyState wizard
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): single focus ring (no double-ring) + neutralize stagger delay under reduced motion
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* refactor(ui): all-sans editorial headings + tint-only active nav
Per design review, pivot the heading strategy from hybrid-serif to a
refined grotesk: drop the Fraunces dependency, token, and import; page
titles, the Home greeting, and section/empty-state titles now use Geist
at semibold with the editorial fluid sizing and tight tracking. No serif
anywhere.
Active sidebar item is now a tint-only treatment (accent text + tinted
background); the left accent rail is removed and the shared base
.nav-item.active inset bar is suppressed in the sidebar (as the console
rail already does). Update the design-system e2e specs to assert the
sans display font and the tinted-background active state.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e): add --host flag to ui-test-server
Allow binding the e2e/preview server to an arbitrary address (e.g.
0.0.0.0 to review the UI from another device on the LAN). Defaults to
127.0.0.1 so existing e2e behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): declutter Home - discoverable + dismissable API, vertical balance
Home felt overloaded and top-heavy. Three changes from review:
- The API endpoint catalog (12 endpoints) is collapsed by default behind a
"Browse the API" disclosure; only the base URL + copy stay visible, so the
catalog is discoverable without dominating the page.
- The whole connect card is dismissable (x): dismissing unmounts it so the
vertical space is recovered, and the choice is remembered (localStorage).
- .home-page now fills its column and vertically centers its content when
there is slack, so sparse states (no models / card dismissed) read as a
balanced launcher instead of content jammed at the top. Overflow-safe:
tall content flows from the top and scrolls.
Adds connect.browse / connect.hide / connect.dismiss i18n keys to all locales.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): editorial PageHeader with section eyebrow + scroll-to-top on nav
PageHeader now derives its eyebrow from the route's section/console (Build /
Operate / Create) via sectionKeyForPath, so pages get a consistent, meaningful
eyebrow with no per-page wiring (override with the eyebrow prop, suppress with
eyebrow={null}). Settings adopts it as the first consumer.
Also fix a navigation scroll bug: the default layout uses the document as its
scroll container and route changes did not reset it, so navigating the console
rail from a scrolled page landed mid-view. App now scrolls to top on pathname
change.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): adopt PageHeader on agent/media/import/backend pages (batch A)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* refactor(ui): adopt PageHeader on ops/admin/media pages (batch B)
Replace hand-rolled .page-header title blocks with the shared editorial
PageHeader component across 14 pages (Manage, Middleware, Models,
NodeBackendLogs, Nodes, P2P, SkillEdit, Skills, Sound, Traces, TTS, Usage,
Users, VideoGen). Title/subtitle move into PageHeader; header-own action
clusters (Models stats+buttons, Skills search+buttons) move into the actions
slot. Tabs, filters, stat cards, ResourceMonitor and page body stay as
siblings. Eyebrow is left to auto-derive from the route.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(ui): home greeting asserts sans font, not the dropped serif
The greeting render-smoke still asserted Fraunces; update it to assert the
Geist sans display font (and not Fraunces), matching the all-sans direction.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): ThemeToggle i18n + animated icon, drop transition:all
The theme toggle hard-coded its English tooltip; route it through the existing
nav switchToLightMode/switchToDarkMode keys and add an aria-label. The sun/moon
icon now replays a small rotate+fade on theme change (keyed remount; honored by
the global reduced-motion block). Replace the .theme-toggle `transition: all`
with explicit properties.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): canvas drag-to-resize + slide-in, fix hooks order, typed download
Canvas was a fixed pane; make it a workbench:
- Drag the panel's left edge to resize (clamped 360px..75vw), persisted to
localStorage, double-click to reset; hidden and full-width on narrow screens.
- Slide-in/fade on open via canvasSlideIn (honored by reduced-motion).
- Fix a rules-of-hooks bug: the `if (!current) return null` early return sat
above useEffect, so the hook count changed when artifacts emptied. All hooks
now run unconditionally before the guard.
- Downloads use the artifact language's real extension + MIME (a Python
artifact saves as .py, not .txt) via extensionForLanguage.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): per-message code blocks get a language header + copy button
Chat code blocks now render inside a framed block with a header showing the
language and a copy button (delegated handler, copies the block and flips to a
check briefly). Decoration + highlighting run from a MutationObserver scoped to
the messages container, which fires reliably for streamed responses AND for
chats loaded/switched from storage - the prior render-keyed effect missed the
load path (code was left unhighlighted on reload). The observer disconnects
while mutating so it does not retrigger on its own edits.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): image attachments show a thumbnail in the composer
Staged image attachments now preview as a 28px thumbnail (from their data URL)
instead of a bare file icon; other types keep the icon. File names truncate and
the remove button gets an aria-label.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): jump-to-latest pill when scrolled up in chat
When the user scrolls away from the bottom of a conversation, a floating
"Jump to latest" pill appears (sticky, centered above the composer); clicking
it smooth-scrolls to the newest message and re-pins auto-scroll. Resets on
chat switch. Adds the chat.actions.jumpToLatest i18n key to all locales.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): canvas fullscreen toggle + keyboard tab navigation
The canvas header gains a fullscreen toggle (expands the panel to cover the
viewport; resize handle hidden while fullscreen). The artifact tab strip is now
a proper ARIA tablist with roving tabindex and Left/Right arrow-key navigation.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): image result lightbox (zoom, prev/next, download, keyboard)
Generated/history images on the Image page are now clickable, opening a
fullscreen Lightbox with a download button, prev/next navigation, an N/M
counter, and keyboard control (Esc to close, Left/Right to navigate). Adds a
reusable `Lightbox` component (usable later for Video) and the media.image
.actions.view i18n key.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): generation progress with placeholder tiles + elapsed timer
Image generation replaces the bare spinner with a GenerationProgress scaffold:
shimmer placeholder tiles matching the requested count plus a live elapsed-time
readout, so the (often slow) wait feels accountable. Reusable for the other
media generation pages.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): generation progress on Video, TTS, and Sound pages
Reuse GenerationProgress (placeholder tile + elapsed timer) in place of the
bare spinner on the remaining media generation pages, so every slow generation
gives the same accountable feedback.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): agent chat gets per-message code-copy + reliable highlighting
AgentChat now shares Chat's code-block treatment: it runs highlightAll +
enhanceCodeBlocks from a MutationObserver on its messages container (the same
proven path), so agent responses get language headers, copy buttons, and
highlighting that fires for both streamed and loaded messages - closing the
divergence with the main chat without a large refactor.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Talk voice visualizer
Add a hero frequency-bar visualizer at the top of the Talk page so users get
ambient feedback that they are heard and that the assistant is speaking - the
audit's main Talk gap (the only prior feedback was a small status pill; the
waveform was buried in the dev diagnostics panel).
VoiceVisualizer is self-contained: it builds its own AudioContext + analysers
from the output <audio> stream (speaking) and the mic stream (listening) so it
does not touch the existing WebRTC/diagnostics graph. Bars are status-tinted
(idle/connected/listening/speaking/error) and animate with a gentle idle wave
when not connected. Live mic/output animation is exercised on a real session.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore: bump localrecall to include PostgreSQL table-name sanitization fix
Pulls mudler/localrecall#48, which makes sanitizeTableName allowlist valid
identifier characters so collection names containing ':' (e.g. the per-user
"legacy-api-key:<agent>" namespace) no longer break PostgreSQL CREATE TABLE
with "syntax error at or near ':'".
Fixes#10375
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
docs: document the privacy-filter.cpp backend in README and compatibility table
The privacy-filter.cpp backend (#10360) was registered in backend/index.yaml
and referenced from the PII feature docs, but was missing from the backend
catalog surfaces. Add it to the README "Backends built by us" table, the
compatibility table (Utilities & Other, CPU/CUDA 13/Vulkan), and the backend
type list in the backends feature doc.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:
- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
whole conversations as one document), with a recreated bounded restricted-
regex secret-matching pattern detector tier alongside it (per-model
pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
filter; token_classify known_usecase; batch sized to context for NER models.
privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
meta + image entries with a capabilities map) matching its CI matrix jobs,
and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
privacy-filter GGUF detection) replacing the prior pref-only registration.
Reconciled against master's independent evolution:
- Dropped master's PIIPatternOverrides feature (global-pattern runtime
overrides + /api/pii/patterns API + runtime_settings.json persistence). The
per-model NER + pattern-detector design supersedes it; it was built on the
global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
removed the patch and restored master's grpc-server.cpp Score RPC (direct
llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
model_config validation forbidding score + chat/completion/embeddings on
llama-cpp. token_classify is unaffected (it runs on the privacy-filter
backend, not llama-cpp).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The sidebar collapse toggle silently no-op'd in dev builds. toggleCollapse
ran its side effects (localStorage write + sidebar-collapse dispatch) inside
the setCollapsed updater. StrictMode double-invokes updaters in dev to surface
impurity, and the synchronous dispatch re-entered setState from the
App/Sidebar listeners mid-update, so the toggle never committed. Production
builds don't double-invoke, which is why only the dev server was affected.
Compute next from current state and move the persist + broadcast into the
handler body so the updater is pure.
Also fix the Talk page anchoring to the transcript box on load. The transcript
is its own overflow container, but scrollIntoView bubbles to every scrollable
ancestor including the window, yanking the whole page down on mount. Scroll
the transcript container directly instead.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): restructure sidebar into Create/Recognition/Build tiers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): preserve exact sidebar gating for agent items and fine-tune/quantize
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* i18n(ui): add nav tier + console keys to all locales
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add grouped admin console via pathless layout route
Wrap the existing admin pages in a pathless AdminConsoleLayout route so
they keep their exact flat URLs while gaining a grouped left rail
(Inference / Cluster / Observability / Access / System). Rail item gating
mirrors the sidebar (adminOnly / authOnly / feature + /api/features). The
layout forwards the App-level outlet context (addToast) to the wrapped
pages, which read it via useOutletContext().
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): fold Audio Transform into Studio as a tab
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(ui): update e2e specs for tiered nav + admin console
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): gate embedded Studio transform view on audio_transform feature
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): visual polish + console-ize Build/Recognition tiers
Generalize the one-off admin console into a reusable ConsoleLayout driven by
a shared consoleConfig (single source of truth for the rail, its gating, and
the sidebar entry that opens it — removes the prior rail/sidebar drift).
- Promote Install Models to the top menu next to Home.
- Build and Operate are now console tiers (secondary rail); Create stays inline.
- Fold Recognition (Faces/Voices) into the Build console as a group alongside
Automation and Training so it no longer feels split off.
- Style the console rail as a panel (header, grouped dividers, rounded active
pills) with a hover nudge; sidebar items become inset rounded pills. The rail
slide-in plays only when entering a console, not on item-to-item sub-nav
(which remounts the layout), so switching no longer flashes the menu. All
token-based (light + dark), respects reduced-motion.
- Add a delayed RouteFallback loader so lazy routes no longer flash blank;
scoped inside ConsoleLayout so the rail stays put while the body loads.
- Update e2e specs for the new structure (.console-* classes, console entries).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): persist console layout across sub-nav + add drop-in endpoint section
- Keep the page-transition key stable within a console (derived from the
shared console config) so the ConsoleLayout and its rail persist across
item-to-item navigation instead of remounting — fixes the submenu flash.
Cache /api/features across mounts and play the rail entrance animation only
when actually entering a console.
- Add a "One endpoint, every API" section to Home: leads with LocalAI's own
native API (images, video, realtime voice over WebRTC/WS, depth, object
detection, rerank, audio/TTS, face & voice recognition) plus a Full API
reference link, then the drop-in compatibility layer (OpenAI, Anthropic,
Ollama, OpenAI Responses) with the live copyable base URL. All 7 locales.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): revert Middleware nav label rename (keep Middleware in all locales)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Bring the Backend & Model Compatibility Table up to the full set of
backends published in backend/index.yaml (60+), organized by modality
with per-backend acceleration targets. Add an "Available Backends"
pointer and expand the backend-type list in the backends feature doc.
Update the README backend count to 60+ and add a "Backends built by us"
section listing the native C/C++/GGML engines maintained by the LocalAI
project (parakeet.cpp, voxtral.c, vibevoice.cpp, rf-detr.cpp,
locate-anything.cpp, depth-anything.cpp, LocalVQE, local-store).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models
The ds4 backend zero-initialized ds4_engine_options and exposed none of the
engine's tunable knobs, so SSD streaming (run a model larger than RAM by
streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs
were unreachable from LocalAI model YAMLs.
Map ModelOptions.Options onto ds4_engine_options through a declarative table
(kEngineOptSpecs + apply_engine_option) instead of per-field branches: the
struct is fixed C with no reflection, so the field set is enumerated once and a
future knob is a one-line table row. Two fields use ds4's own typed parsers
(GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming")
mean true; path-type options (mtp_path, expert_profile_path,
directional_steering_file) resolve relative to the model directory so a gallery
entry can reference a companion file by bare filename. mtp_draft/mtp_margin are
now validated rather than parsed with throwing std::stoi/std::stof.
Add gallery entries for the 128 GB class:
- deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality)
- deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming)
- deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights)
- deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming)
SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>