* fix(pii): post-merge review fixes + live NER e2e for the privacy-filter tier
Follow-up to the NER tier engine (#10360), already on master. This carries
only the incremental review fixes and tests that postdate that merge — the
feature itself is not re-introduced.
Review fixes:
- openai_completion.go: remove the dead `elem >= 0` conjunct in applyAnyText
(the `elem < 0` guard above already returns).
- application.go: collapse ResolvePIIPolicy's inline re-implementation of
PIIIsEnabled to a single cfg.PIIIsEnabled() call (sole source of the
"explicit pii.enabled wins, else cloud-proxy default" rule) and return true
past the !enabled guard where it is provable.
- pattern.go: hoist the triple `appConfig != nil && EnableTracing` check in
patternDetector.Detect into one local.
- grammar.go: MaxQuantifier was 4096, but Go's regexp/syntax rejects repeat
bounds above 1000 at Parse time, so walk()'s {n,m} guard could never fire —
dead code shadowed by the parser. Lower it to 512 so a bound in (512,1000]
is rejected here with an actionable error; >1000 still fails closed via
Parse. Specs pin the relationship so the guard can't silently revert.
- PatternListEditor.jsx: clamp a directly-typed negative min_len to >=0 and
force the DOM value back when clamping (min={0} only constrained the spinner,
so a negative reached saved config and silently disabled the length filter).
Tests:
- piipattern_test.go: MaxQuantifier guard specs (must stay live, not dead).
- model-config.spec.js: assert the min_len clamp, and that entity_actions
collapses a duplicate group to a single row (map semantics; regression guard
against emitting an array that drops a row on save).
- tests/e2e-backends: token_classify capability driving the TokenClassify gRPC
RPC against the backend image, asserting byte-correct, UTF-8 rune-aligned
spans (entity.Text == text[start:end]) at threshold 0. Verified on CPU via
`make test-extra-backend-privacy-filter` (3/3 specs).
- Makefile: test-extra-backend-privacy-filter wrapper.
- tests/e2e: e2e_pii_ner_test.go drives /api/pii/analyze + /api/pii/redact
(mask + block) through the full HTTP -> detector -> redactor path; gated on
PII_NER_MODEL_GGUF so the default suite is unaffected.
- .github/workflows/tests-pii-ner-e2e.yml: path-filtered / nightly CI job
running the container harness on CPU.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(gallery): add privacy-filter-nemotron (f16 + q8)
GGUF conversions of OpenMed/privacy-filter-nemotron — a fine-grained English
PII token-classifier (55 categories / 221 BIOES classes), fine-tuned from
openai/privacy-filter on NVIDIA's Nemotron-PII dataset. Sibling to the existing
privacy-filter-multilingual entry, trading language breadth for category depth.
- privacy-filter-nemotron: F16 reference artifact (~2.8 GB).
- privacy-filter-nemotron-q8: Q8_0 quant (~1.64 GB) for RAM-constrained / edge
use; description notes the size/speed tradeoff and to validate on your own
data (a single dropped span is a PII leak).
Both run on the privacy-filter backend with known_usecases [token_classify] and
a default mask policy (min_score 0.5); operators add per-category entity_actions
as needed. sha256s taken from the HF repo's LFS object ids.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
fix(diffusers): pin diffusers and transformers to a known-good pair
The diffusers backend tracked git+https://github.com/huggingface/diffusers
(main) with an unpinned transformers. transformers v5 restructured
CLIPTextModel and removed the .text_model attribute that diffusers' single
-file loader reads, so loading any single-file Stable Diffusion checkpoint
fails:
create_diffusers_clip_model_from_ldm (single_file_utils.py)
position_embedding_dim = model.text_model.embeddings.position_embedding...
AttributeError: 'CLIPTextModel' object has no attribute 'text_model'
No released diffusers (<=0.38.0) supports transformers v5 - only unreleased
diffusers main does. Because the requirements tracked main plus an unpinned
transformers, every backend image froze whichever pair existed at build
time, and images built once transformers v5 shipped but before diffusers
main caught up are permanently broken.
Pin the last known-good released pair across all requirements files:
diffusers==0.38.0 and transformers==4.57.6. 0.38.0 still exposes every
pipeline backend.py imports (Flux, Wan, Sana, LTX2, Qwen, GGUF), so no
functionality is lost, and builds become reproducible instead of drifting
into the broken window.
Fixes#9979
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
File-staging progress lived only in the SmartRouter's in-memory
StagingTracker on the replica performing the transfer. In a multi-replica
deployment behind a round-robin load balancer, a /api/operations poll
that lands on any other replica saw no staging row, so the progress
("processing file ... Total ... Current ...") flickered in and out as
polls rotated between frontends.
Mirror the pattern already used for gallery-install progress: the origin
replica broadcasts staging ticks over NATS (SubjectStagingProgress, a
new staging.<model>.progress subject), and peers merge them via
ApplyRemote (SubscribeBroadcasts on the wildcard). Byte-level ticks are
leading-edge debounced (~1/s); Start/FileComplete/Complete always
publish. A locally-owned op stays authoritative so the origin's own echo
and stray peer events can't clobber it, and mirrored remote ops expire
after a TTL so a missed Done event can't leave a phantom row. The UI read
path (StagingTracker.GetAll) is unchanged.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
A model not yet loaded on a worker is staged lazily on the inference
request path. Staging a multi-GB model takes minutes - far longer than
any client keeps its HTTP request open - so a browser refresh, an
ingress/LB idle-timeout, or a round-robined retry landing on another
frontend replica cancels the request context and aborts the upload with
"context canceled" mid-transfer. Large models then never finish staging,
so they never load (observed in a 2-replica deployment: both frontends
repeatedly failed to stage a 15.7 GB GGUF, each attempt dying at a
different offset).
Bind the cold load (staging + LoadModel + the per-model advisory lock) to
context.WithoutCancel(ctx): it keeps the request's values (prefix chain)
but drops cancellation/deadline. Each long step keeps its own bound (the
file stager's resume budget, LoadModel's 5m timeout), and the advisory
lock still de-dupes concurrent loaders across replicas.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI pulls models from OCI registries (via go-containerregistry), the
Ollama registry, and OCI blob stores (via oras), but every request went
out with the underlying library's generic User-Agent, so registry
operators had no way to attribute traffic to LocalAI.
Add an oci.UserAgent() helper that returns "LocalAI" (or
"LocalAI/<version>" when the binary is built with a version stamp via
internal.Version) and wire it into all three pull paths:
- pkg/oci/image.go: remote.WithUserAgent on the go-containerregistry
image and digest requests
- pkg/oci/ollama.go: a User-Agent header on the Ollama manifest request
- pkg/oci/blob.go: a LocalAI User-Agent on the oras blob client. This
mirrors oras' auth.DefaultClient (same retry.DefaultClient policy);
only the advertised User-Agent changes.
Implements #6258.
Assisted-by: Claude:claude-opus-4-8 golangci-lint
Signed-off-by: Vijay Sai <vijaysaijnv@gmail.com>
* feat(ced): sketch sound-classification backend (CED audio tagger)
Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry,
footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend.
SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist
in DESIGN.md):
- backend/backend.proto: new SoundDetection rpc + SoundClass messages
(run `make protogen-go` to regenerate pkg/grpc/proto).
- backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h),
goced.go (Ced gRPC backend: Load + SoundDetection), Makefile
(clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh,
package.sh, .gitignore.
- DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability
registration checklist), gallery/index + CI registration, and a scoping
note for the realtime/websocket live-recognition path (sliding-window
classify over the existing ws transport + voicegate; the ced C-API
per-PCM entry point is already window-friendly).
Backend code does not compile until protogen-go regenerates the pb types
and a libced.so is built (Makefile clones+builds it).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): REST /v1/audio/classification endpoint + capability registration
Wires the ced sound-event classification backend (AudioSet audio tagger)
end to end through the REST surface, mirroring the transcription path.
- Handler: core/http/endpoints/openai/sound_classification.go parses the
multipart audio upload, temp-files it, resolves the model config and
calls the SoundDetection RPC; returns {model, detections[]} JSON.
- Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection)
loads the model and normalizes the proto response into schema types.
- Schema: core/schema/sound_classification.go (SoundClassificationResult).
- gRPC layer: SoundDetection wired through the LocalAI wrapper (interface,
Backend client, Client, embed, server, base default) so the loader-typed
client exposes the RPC; proto regenerated via make protogen-go.
- Route: POST /v1/audio/classification (+ /audio/classification alias) with
the audio/multipart default-model middleware in routes/openai.go.
- Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_
CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap +
GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase
option; /api/instructions audio area updated; auth RouteFeatureRegistry +
FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI
usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter
+ i18n; docs page features/audio-classification.md + whats-new + crosslink.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): realtime sound-event detection over the websocket API
When a realtime pipeline configures a sound-classification model, each
VAD-committed utterance (the same window the transcription path produces)
is also run through the CED sound-event classifier and the scored AudioSet
tags are emitted as a new server event. No new backend rpc is needed: the
SoundDetection gRPC method already exists on this branch.
- config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty)
beside Transcription/VAD.
- realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the
ModelInterface; implement it on wrappedModel and transcriptOnlyModel by
calling backend.ModelSoundDetection with the session's sound-classification
model config (mirrors how Transcribe dispatches). Load the optional config
in newModel / newTranscriptionOnlyModel; nil config keeps it additive.
- types: add ConversationItemSoundDetectionEvent (item_id, content_index,
detections[]{label,score,index}) with type conversation.item.sound_detection,
its ServerEventType constant and MarshalJSON, mirroring the transcription
completed event.
- realtime: add emitSoundDetection (unary path: classify the committed window,
build the event, t.SendEvent) and wire it at the utterance-commit hook right
after emitTranscription; gated on session.SoundDetectionEnabled (resolved
from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0).
Its error is logged via xlog but never aborts the turn.
- test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections,
classifier error) plus a SoundDetection method on the fakeModel double.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): implement SoundDetection in nodes backend test doubles
The SoundDetection method added to the grpc backend interface left two
test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so
core/services/nodes failed to compile under `go vet`/`go test` (go build
missed it: the doubles live in _test.go). Add the method to both,
mirroring their existing Detect mock. Repairs CI for the nodes package.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): decouple realtime sound detection from VAD (sound-only sessions)
Sound-event detection must activate on sounds, not speech, so it no longer
runs through the voice VAD/transcription path. A sound-detection-only
pipeline (sound_detection set, no transcription/LLM) now:
- is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline
stage),
- builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS
loaded), and
- defaults the session to turn_detection none (no VAD) with no transcription
stage, so the client drives windowing via input_audio_buffer.commit
(option A: client-side sliding window). The per-PCM C-API already supports
arbitrary windows.
commitUtterance gains a sound-only branch: it emits the
conversation.item.sound_detection event (scored AudioSet tags) and stops -
no transcription, no LLM response. generateResponse is now guarded on a
transcription stage being present, so a sound-only turn never invokes the LLM.
Existing transcription/VAD sessions are unchanged (additive). Added a
commitUtterance sound-only Ginkgo spec asserting it emits the sound event and
neither transcribes nor generates a response. go vet + golangci-lint
(new-from-merge-base) clean; openai suite green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): register sound-classification backend in gallery + CI
Mechanical backend-image registration for the ced sound-event classifier,
mirroring the parakeet-cpp Go/purego backend everywhere it is wired up.
- .github/backend-matrix.yml: add the ced build matrix, field-for-field copies
of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64,
l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan
amd64/arm64, rocm hipblas, and the metal darwin entry), changing only
backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang.
- backend/index.yaml: add the &ced meta anchor (capabilities map per platform)
plus ced-development and the per-arch image entries, each uri/mirror
tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is
intentionally deferred pending the HuggingFace publish (TODO note inline).
- scripts/changed-backends.js: add an explicit item.backend === "ced" branch in
inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as
the parakeet-cpp branch (before the generic golang fallthrough).
- .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in
backend/go/ced/Makefile so the daily bot bumps the pin.
- swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so
the existing /v1/audio/classification annotations land in the generated spec.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): server-side windowing for realtime sound detection (option B)
Adds an optional server-driven sliding-window classifier so a sound-only
realtime client only has to stream audio (no input_audio_buffer.commit):
- Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs.
When both > 0 on a sound-only session, the server classifies the last
window of streamed audio every hop and emits a conversation.item.sound_
detection event; the input buffer is trimmed to one window so a long
stream stays bounded. When unset, the session stays client-driven
(option A). Runs independent of VAD (sound events are not speech).
- handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so
it is unit-testable) + writeWindowWAV, which declares the true
InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples
correctly. Goroutine is started after toggleVAD and torn down with the
session (close + wg.Wait).
- Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta
registry; the earlier realtime commit added pipeline.sound_detection
without a registry entry, failing TestAllFieldsHaveRegistryEntries. This
fixes that and covers the two new knobs.
Tests: classifySoundWindow emits an event + trims the buffer to one window,
no-ops on too-little audio; writeWindowWAV declares the given sample rate.
go build/vet + golangci-lint (new-from-merge-base) clean; config + openai
suites green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0)
The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0,
converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced +
known_usecases: sound_classification) and two gallery/index.yaml entries
(ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and
removes the now-resolved TODO from backend/index.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add tiny/mini/small GGUF model gallery entries
Publishes the rest of the CED family (same architecture, metadata-driven port
verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds
their f16 + q8_0 gallery entries:
ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB
ced-mini (9.6M) f16 19MB / q8_0 11MB
ced-small (22M) f16 42MB / q8_0 23MB
All sha256-pinned. ced-base remains the accuracy default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo
All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single
HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8
gallery model entries' urls + file uris accordingly. sha256 and filenames are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): bump CED_VERSION to the short-clip fix
Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip
shorter than target_length (~10.11s): time_pos_embed was added at its full
63-frame grid instead of being sliced to the clip's actual time grid, tripping
ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s
windows) and gated with a short-clip parity test upstream.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive
- README.md: add ced.cpp to the "native C/C++/GGML engines developed and
maintained by the LocalAI project" table.
- docs/content/features/backends.md: add a Sound Classification backend
category (sound-event classification / audio tagging) listing ced.cpp.
- .agents/adding-backends.md: add a "Documenting the backend" section and two
verification-checklist items requiring new backends to be documented in the
backends.md category list, and in-house native engines to be added to the
README maintained-engines table. This directive was missing.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): repin CED_VERSION to the v0.1.0 release commit
ced.cpp history was squashed into a single release commit (tagged v0.1.0), so
the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the
v0.1.0 release commit, so the backend builds against a commit that exists.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths
- sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of
the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler.
- goced.go: reading a NUL-terminated C string from a libced-owned buffer.
#nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since
the uintptr is a C-owned malloc'd buffer, not Go-GC memory.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): add voice_recognition enforce + identity config
Add Enforce *bool and Identity *VoiceIdentityConfig to
PipelineVoiceRecognition, plus EnforceGate/IdentityEnabled/
AnnounceEnabled/PersonalizeEnabled helpers. Enforce nil defaults to
gating (backward compatible); identity surfacing is independent of the
gate.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): add Speaker type and conversation.item.speaker event
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): split voiceGate into Resolve + authorize
Split the speaker authorization into a Resolve step (embed once, produce a
types.Speaker identity) and a pure authorize policy step, with a 0..100
confidence score mirroring /v1/voice/identify. The legacy Authorize wrapper is
kept so existing specs stay green.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): resolve speaker per turn and emit conversation.item.speaker
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): personalize LLM turns with recognized speaker
Set the per-message name field on each recognized user turn and append a
current-speaker note to the system message, both gated by the voice
recognition identity config.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(realtime): document speaker identity surfacing and personalization
Document the new voice_recognition keys (enforce, identity.*) and the
LocalAI-extension conversation.item.speaker server event in the realtime
feature docs.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(realtime): cover when:first+identity re-resolution and multi-speaker history
Add two integration specs to harden the speaker-aware realtime path:
- when:first with an Identity block re-resolves the speaker every turn even
though re-authorization is skipped after the first match: a later resolve
error now fails closed, while a clean later resolve still surfaces and names
the speaker.
- multi-speaker history attribution: each user turn carries its own per-message
name and the injected system note reflects the latest speaker.
Test-only change; no production behavior was modified.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): surface speaker labels in conversation.item.speaker
Carry the registered speaker's labels (identify mode) on types.Speaker so
they flow into the conversation.item.speaker event and the stored item.
Verify mode has no labels, so the field is omitted there.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e): cover conversation.item.speaker over a real websocket
Add a realtime-pipeline-identity config (verify mode, enforce:false, identity
announce+announce_unknown+personalize) and two e2e specs driving the real
server over a real WebSocket with the mock VoiceEmbed backend: an authorized
speaker yields a conversation.item.speaker event naming e2e-speaker (matched
true) and reaches response.done; an unauthorized speaker yields an unknown
(matched false, no name) event and still responds, proving enforce:false
never drops a turn.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(config): register voice_recognition enforce + identity fields
The meta registry coverage test (TestAllFieldsHaveRegistryEntries) requires
every config field to have an entry in core/config/meta/registry.go. The new
voice_recognition.enforce and voice_recognition.identity.* fields were missing,
failing tests-linux and tests-apple. Add registry entries (toggles) so the
fields are surfaced in the model-config editor and the coverage test passes.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
The chat input only accepted attachments via the file picker, so users
who copied an image from a webpage or a screen region had to first save
it to a file before attaching it (#10361).
Add an onPaste handler on the input textarea that pulls image items out
of the clipboard and routes them through the same staging path as the
file picker. The per-file processing in handleFileChange is extracted
into a shared processFiles helper so both entry points stay in sync.
Clipboard images, which arrive unnamed or as a generic "image.png", are
given unique typed names so multiple pastes don't collide, and the
default paste is suppressed only when an image is actually attached so
normal text paste is unaffected.
Closes#10361
Signed-off-by: Anai-Guo <antai12232931@outlook.com>
fix(ci): use namespace import for js-yaml in changed-backends.js
js-yaml's ESM build exposes only named exports (load, dump, ...) and no
default export. Bun's strict ESM interop rejects the default import with
'Missing default export in module js-yaml.mjs', failing the detect-changes
and generate-matrix CI jobs. Import the namespace instead; yaml.load (the
only usage) resolves to the named export, so behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backend): call vram.EstimateModelMultiContext for model size estimate
core/backend/options.go called vram.EstimateModel, which does not exist in
the vram package (it exposes EstimateModelMultiContext). This broke the build
on master (undefined: vram.EstimateModel). Use EstimateModelMultiContext with
a nil context-size slice (defaults to a single 8192 estimate); the returned
MultiContextEstimate.SizeBytes is exactly what the caller consumes, so size
estimation behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(watchdog): add size-aware LRU eviction mode
When the model count hits the LRU limit or the memory reclaimer fires,
evict the largest model by on-disk file size first rather than the
least-recently-used one. For GGUF models the file size is a reliable
proxy for GPU/RAM footprint, so evicting the largest candidate maximises
freed memory per eviction round while keeping small utility models
(embeddings, classifiers, rerankers) resident.
Changes:
- `pkg/model/watchdog.go`: add `sizeAwareEviction` flag and
`modelSizes map[string]int64` to `WatchDog`; sort candidates by
`sizeBytes` desc (LRU time as tiebreaker) when the flag is set;
add `RegisterModelSize`, `SetSizeAwareEviction`, `GetSizeAwareEviction`
- `pkg/model/watchdog_options.go`: add `WithSizeAwareEviction` option
- `pkg/model/initializers.go`: stat model file after load and call
`RegisterModelSize` so size data is available before the first eviction
- `core/config/application_config.go`, `runtime_settings.go`: add
`SizeAwareEviction` field and `WithSizeAwareEviction` app option;
expose via `ToRuntimeSettings` / `ApplyRuntimeSettings` for the
`POST /api/settings` live-reload path
- `core/cli/run.go`: add `--size-aware-eviction` flag /
`LOCALAI_SIZE_AWARE_EVICTION` env var
- `core/application/startup.go`, `watchdog.go`: wire the new option
through to `NewWatchDog`
- `pkg/model/watchdog_test.go`: 5 new specs — option enable, dynamic
toggle, largest-first ordering, equal-size LRU tiebreaker, no-size
fallback to LRU, and size-map cleanup on eviction
Closes#9375
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* refactor(watchdog): use vram estimation scaffolding for model size
Replace the brittle os.Stat(modelFile) approach with a proper call to
pkg/vram, which handles multi-file models (DownloadFiles, MMProj) and
all weight file types, not just single GGUF files.
- Add estimateModelSizeBytes() in core/backend/options.go that collects
all weight file URIs from the model config, resolves them to file://
URIs, and calls vram.Estimate() with the shared DefaultCachedSizeResolver
(15-min TTL cache avoids redundant stat calls on repeated loads)
- Thread the result through via a new WithModelSizeBytes() loader option
- In initializers.go, consume the pre-computed size instead of calling
os.Stat; if no size was supplied (e.g. for external/router-dispatched
models) the registration is simply skipped
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* refactor(watchdog): use EstimateModel with HF fallback for size estimation
Switch estimateModelSizeBytes from calling vram.Estimate directly to the
unified vram.EstimateModel entry point, which adds automatic fallbacks:
file-based GGUF metadata → HF API → size string.
Also extract the HuggingFace repo ID from model URIs (huggingface://,
hf://, https://huggingface.co/ and org/model short-form) and pass it
as ModelEstimateInput.HFRepo, so models not yet downloaded locally can
still get a size estimate via the HF API.
Addresses @mudler's review feedback: "better to rely on EstimateModel
and pass by the HF URL of the model extracted from the URI".
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* feat(webui): add Size-Aware Eviction toggle to settings page
The size-aware eviction setting was wired through the CLI flag and the
RuntimeSettings live-reload path (POST /api/settings) but had no handle
on the React settings page, so it could not be toggled from the UI.
Add a Size-Aware Eviction toggle to the Watchdog section, next to the
existing Force Eviction When Busy / LRU eviction handles. The settings
page loads and saves the whole RuntimeSettings object, so the new
size_aware_eviction key is picked up with no extra plumbing.
Addresses @mudler's review feedback: the application config setting
should land on the same UI settings page as the other handles.
Signed-off-by: supermario_leo <leo.stack@outlook.com>
---------
Signed-off-by: supermario_leo <leo.stack@outlook.com>
* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active
When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.
Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.
Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* feat(vllm): progressive streaming via parser.extract_tool_calls_streaming
When a tool parser is active for a tool-enabled streaming request,
#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.
Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.
Falls back to the existing #10346 buffer path when:
- the parser does not have extract_tool_calls_streaming, OR
- extract_tool_calls_streaming raises mid-stream (logged, the
rest of the request finishes via post-loop extract_tool_calls).
Tests (TestStreamingToolParser):
1. Buffer path: no markup leaked, no content duplication
2. Native streaming: plain-text response streams progressively
3. Native streaming: tool_call structured, no markup leaked
4. Native streaming exception → graceful fallback, no markup, no crash
5. No tool parser → unchanged per-delta content stream
E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path
Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:
- tool_call: request mentions a tool; model is expected to call it
- plain_text: request offers a tool but explicitly asks for prose
Use this to compare:
- the buffer-all path (#10346) → plain_text TTFT ≈ total response time
- the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time
python examples/vllm-bench/ttft_streaming_tool_parser.py \\
--url http://localhost:8080 --model my-coder --runs 3
Lives under examples/ so it does not interfere with the test suite.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens)
The long-text scenario shows the buffering vs streaming difference most
dramatically: with the buffer-all path, the client receives nothing for
20+ seconds and then the entire 1500-token response at once. With native
streaming, the first token arrives in tens of milliseconds and the
response flows progressively.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
---------
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>
* feat(nemo): enable word-level timestamps for ASR models
The nemo backend ignored timestamp_granularities and always returned a
single segment with start=0 end=0, making word-level timestamps
impossible to obtain even though the NeMo models (parakeet-tdt, etc.)
fully support them.
Changes:
- Add _get_stride_seconds() to compute frame duration from the model's
preprocessor window_stride and encoder subsampling_factor.
- Add _build_segments_with_words() that extracts word offsets from the
NeMo Hypothesis.timestamp dict and converts frame indices to
nanosecond timestamps.
- Support 'word' granularity (one segment per word) and 'segment'
granularity (merge at time-gap boundaries using a dynamic threshold).
- Populate TranscriptSegment.words with TranscriptWord entries so
callers get both segment-level and word-level timing.
- Only request timestamps from NeMo when the caller actually asks for
them (timestamp_granularities is non-empty), keeping the fast path
unchanged for callers that don't need timestamps.
Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:
curl -X POST /v1/audio/transcriptions \
-F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \
-F 'timestamp_granularities[]=word' -F response_format=verbose_json
→ each word has correct start/end times in seconds.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(nemo): address Copilot review feedback
- Narrow exception handling in _get_stride_seconds to catch only
AttributeError, KeyError, TypeError instead of bare Exception, and
emit a warning when falling back to the hardcoded stride.
- Remove explicit return_hypotheses=False when timestamps are requested;
timestamps=True already forces NeMo to return Hypothesis objects.
- Add a warning when NeMo does not return Hypothesis objects despite
timestamps being requested.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
The parakeet-specific word accessors can return stale initialisation
data (model name, binary blobs) for segments with no real speech.
Add isValidWord() to filter out words that have:
- empty or whitespace-only text
- U+FFFD replacement characters (from binary data scrubbing)
- negative timestamps
- zero duration (end <= start)
Also skip empty segments entirely when they have no recognisable
content (empty text AND no valid words), preventing spurious subtitle
entries like '00:45:33,592 --> 00:45:33,592 parakeet@rH\u000b\ufffdI'.
Applies to both AudioTranscription and AudioTranscriptionStream.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
fix(vllm): structured outputs silently ignored on vLLM >= 0.23
vLLM >= 0.23 removed GuidedDecodingParams (now StructuredOutputsParams) and
renamed the SamplingParams field guided_decoding -> structured_outputs. The
import failed, HAS_GUIDED_DECODING became False, and the whole guided-decoding
block was skipped, so response_format / grammar constraints were silently
ignored. Adapt the existing request.Grammar path to the new class/field.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Opening a model row's kebab (ActionMenu) on the Manage dashboard snapped the
page scroll to the top and rendered the menu detached from its trigger, making
it impossible to operate.
Two compounding causes:
- The menu auto-focus called el.focus() without preventScroll, so the browser
scrolled the focused element into view, yanking the page to the top.
- The position:fixed Popover was rendered inline inside the table row. The
editorial UI overhaul added hover transforms to rows/cards, and a transformed
ancestor re-anchors position:fixed to itself instead of the viewport, so the
menu (positioned from the trigger's viewport rect) landed in the wrong place.
Fix: portal the Popover to document.body so position:fixed always resolves
against the viewport, position it before paint with useLayoutEffect (no {0,0}
flash), and pass preventScroll:true to both focus calls.
Adds an e2e regression test that reproduces the symptom (scroll jumped from 564
to 0 on the old code) and asserts the menu tracks its trigger.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
refactor(config): single source of truth for default values across config + backend
Defaults were decided in two areas with duplicated/drifted literals: the config
SetDefaults tiers vs core/backend/options.go's grpcModelOpts (which translates a
ModelConfig to the backend wire format and supplied its own fallbacks). They had
drifted - n_gpu_layers 9999999 (options.go) vs 99999999 (gguf.go), two 512 batch
constants, context 1024 (gguf) vs 4096 (backend) scattered as bare literals.
Introduce core/config/defaults.go as the canonical home (DefaultContextSize=4096,
GGUFFallbackContextSize=1024, DefaultNGPULayers=99999999, DefaultFlashAttention=
auto). gguf.go / hooks_llamacpp.go use them directly; core/backend references them
(backend imports config, never the reverse) so DefaultContextSize/DefaultBatchSize
and the flash-attn / n_gpu_layers fallbacks resolve to one place. The two context
values (1024 GGUF-no-estimate vs 4096 general) are kept distinct but now named +
documented, not blind literals. Behavior-preserving; config + backend suites green.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): enable cross-request prefix caching for serving (Phase 2)
The llama.cpp backend ships n_cache_reuse=0 (cross-request KV prefix reuse via
shifting disabled). Enable it by default (256) so repeated prefixes - system
prompts, RAG context, agent scaffolds, multi-turn chat - aren't recomputed. This
is the universally-useful part of 'paged attention' (shared-prefix reuse, which
the upstream maintainers themselves identify as where paged attn actually helps)
and needs none of the block-KV machinery.
Lives in a serving_defaults.go sibling to hardware_defaults.go (device-driven vs
serving-policy defaults); both run from SetDefaults and only fill unset values.
Explicit cache_reuse/n_cache_reuse always wins. Device-independent, so it
propagates to distributed nodes via the model options with no router change.
Shares the backendOptionSet helper with the Phase-1 parallel default.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(config): extract generic fallback defaults into ApplyGenericDefaults
Behavior-preserving: move the inline sampling-param + runtime-flag fallbacks out
of SetDefaults into ApplyGenericDefaults, completing the domain-grouped tiers
(ApplyInferenceDefaults=family, ApplyHardwareDefaults=device, ApplyServingDefaults
=serving, ApplyGenericDefaults=generic fallbacks). SetDefaults is now a clean
orchestrator. Same order (runs after the family/hardware/serving tiers so those
win) and same conditions (TopK gated on UsesLlamaSamplerDefaults, MMap on XPU).
No behavior change; full config suite green. (NGPULayers stays in the GGUF-read
path for now - it's device-driven but coupled to model-size detection; a separate
follow-up.)
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): add model alias field and self-validation
Add ModelConfig.Alias (yaml: alias), IsAlias(), and an alias
short-circuit at the top of Validate() that rejects self-reference and
forbids setting backend/parameters.model on a pure-redirect alias.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): resolve and validate model alias targets in the loader
Assisted-by: Claude:opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(middleware): resolve model aliases and stamp requested/served identity
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(modeladmin): reject alias configs with invalid targets on create/edit
Validate alias targets at create/swap entry points (ImportModelEndpoint,
EditYAML, PatchConfig) so a dangling, chained, or disabled alias target is
rejected at save time rather than surfacing as a runtime error.
Assisted-by: Claude:opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(api): add GET /api/aliases to list model aliases
Adds an admin-gated read-only endpoint that lists every model alias
config as {name, target} pairs, backed by the loader's existing
GetAllModelsConfigs().
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(mcp): add set_alias and list_aliases tools
Expose model-alias management over the LocalAI Assistant MCP surface:
list_aliases (read-only, GET /api/aliases) and set_alias (mutating).
SetAlias is swap-first: PATCH /api/models/config-json/:name swaps an
existing alias's target (validated, non-destructive) and a 404 falls
back to POST /models/import to create a fresh {name, alias} config. The
inproc client mirrors this via ConfigService.PatchConfig + a create path
modeled on ImportModelEndpoint. Deletion reuses delete_model.
Assisted-by: Claude:claude-opus-4 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(mcp): replace em dashes in alias tool comments
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config-meta): expose alias as a model-select field
Add an 'alias' section to DefaultSections() and an 'alias' field override
in DefaultRegistry() so the schema-driven React editor renders the new
top-level ModelConfig.Alias field as a model picker in its own section.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add alias template card and Manage alias badge
Add an 'Alias / Routing' template to the create-flow gallery that seeds a
minimal name + alias config, and a read-only 'alias -> target' badge on the
Manage Models tab. The capabilities row payload does not carry the alias
field, so the badge resolves targets from GET /api/aliases looked up by name.
Assisted-by: Claude:claude-opus-4 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: document model aliases
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(swagger): regenerate for GET /api/aliases
Adds the /api/aliases path and AliasInfo schema generated from the
ListAliasesEndpoint annotation.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(localai): check os.RemoveAll error in aliases_test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: correct alias conversion docs and advertise /api/aliases in instructions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(mcp): write alias config 0600 to satisfy gosec G306
The inproc createAlias path wrote the alias YAML with 0644, which gosec
flags as a new G306 finding on the PR. The LocalAI process is the sole
reader/writer of model configs, so 0600 is correct and keeps the scan clean.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add Depth Anything V2 models + bump native version
Add Depth Anything V2 (DA2) support to the depth-anything backend. DA2 is
depth-only (no camera pose, no confidence) and ships both relative
(relative inverse depth) and metric (depth in metres) variants. The Go
backend is model-agnostic, so no backend code changes are required — only
a native version bump and new gallery entries.
- backend/go/depth-anything-cpp/Makefile: pin DEPTHANYTHING_VERSION to the
depth-anything.cpp commit that adds the DA2 engine + C-API routing
(e3dec57f13a52366bbc4f279ef44804915960a6b, kept alive by the upstream tag
da2-support so it survives a squash-merge).
- gallery/index.yaml: add 12 DA2 entries (4 base quants, small, large, plus
Hypersim indoor and VKITTI outdoor metric models in S/B/L). Metric models
carry the metric-depth tag; none carry camera-pose.
Assisted-by: Claude:claude-opus-4-8
* chore(depth-anything-cpp): pin to merged DA2 master commit
PR #1 (mudler/depth-anything.cpp) merged to master as f4e17de (squash); repoint
the pin from the pre-merge commit to the canonical master commit.
Assisted-by: Claude:claude-opus-4-8
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): node-aware hardware defaults — larger physical batch on Blackwell
A larger physical batch (n_batch/n_ubatch) materially lifts MoE prefill on
NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark) — measured
on a GB10 with Qwen3-Coder-30B-A3B, the prefill ceiling rises (ub512 ~2994 ->
ub2048 ~3316 t/s) and saturates around 2048.
The heuristic lives in core/config alongside the other config overriders
(ApplyInferenceDefaults, guessDefaultsFromFile/NGPULayers) — they all fill the
ModelConfig from heuristics, so hardware tuning is the same domain and stays in
one place. It is parameterized on a GPU descriptor (not direct detection) so it
works in both deployment shapes:
- Single host: SetDefaults applies it with the LocalGPU.
- Distributed: only the worker sees the GPU, so the worker reports its compute
capability on registration (gpu_compute_capability -> BackendNode), and the
router re-applies the SAME core/config heuristic for the SELECTED node before
loading — fixing the case where the frontend has no GPU at all.
Explicit `batch:` always wins (only managed default values are touched).
xsysinfo gains NVIDIAComputeCapability() (detection only); all interpretation
lives in core/config. Tests: core/config, pkg/xsysinfo, core/services/nodes.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(config): injectable local-GPU seam + single-instance coverage
Make local GPU detection an injectable package var (localGPU) so the
single-instance path (SetDefaults -> ApplyHardwareDefaults) is deterministically
testable without a real GPU, mirroring the distributed override's coverage.
Adds specs asserting SetDefaults sets the Blackwell physical batch, leaves it
unset on non-Blackwell, and never overrides an explicit batch.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): default concurrent serving (n_parallel) by GPU VRAM
The llama.cpp backend defaults n_parallel=1, which serializes multi-user requests
and leaves continuous batching off (it auto-enables only at n_parallel>1). Fold a
VRAM-scaled parallel-slot default into the hardware-config path so multi-user
serving works out of the box: >=32GiB->8, >=8GiB->4, >=4GiB->2, else unchanged.
With the backend's unified KV the slots SHARE the context budget, so this adds
concurrency without multiplying KV memory. Explicit parallel/n_parallel always
wins. EnsureParallelOption is shared by the single-host path (ApplyHardwareDefaults
with the local GPU) and the distributed router (per selected node's reported VRAM,
since the frontend may have no GPU). LocalGPU now also reports VRAM.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Add apex-quant (MoE per-tensor/per-layer quantization recipe) to the
"Backends built by us" section as a note after the engines table, since
it is a quantization recipe rather than a native inference engine.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): adapt grpc-server to upstream server-schema split
Upstream llama.cpp (e475fa2) extracted the JSON request-schema evaluation
out of the static server_task::params_from_json_cmpl into the new
server_schema::eval_llama_cmpl_schema (tools/server/server-schema.cpp).
The grpc-server unity build still called the old static member, breaking
every llama-cpp backend build with "no member named 'params_from_json_cmpl'
in 'server_task'".
Pull server-schema.cpp into the translation unit and call the new function,
keeping both guarded by __has_include so forks that predate the split (e.g.
llama-cpp-turboquant, which still exposes params_from_json_cmpl) keep
compiling against the old static member.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): raise WebRTC data-channel max-message-size for large events
Browsers advertise a conservative SCTP max-message-size in their SDP offer
(Chrome uses 256 KiB). pion enforces the remote's advertised value on send, so
a single realtime event larger than it cannot be sent over the "oai-events"
data channel: SendText fails, the event is dropped, and the turn silently
yields no response. Some turns legitimately produce a >256 KiB JSON event —
notably tool calls with sizeable schemas or results.
Browsers advertise the value conservatively but their SCTP stacks reassemble
much larger messages, so raise the max-message-size honored for our own
server-generated events by rewriting the attribute in the offer before
SetRemoteDescription.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): keep the WebRTC sendLoop alive when one event send fails
A failed SendText on the oai-events data channel exited the sender goroutine,
so a single dropped event (e.g. one over the negotiated SCTP max-message-size)
tore down the session and silently dropped every subsequent event. Log and skip
the offending event instead and keep draining; a genuinely dead transport is
still handled by the closed / connection-state path.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(downloader): stall timeout, resume-safe cancel, and stale-partial reaping
Large model installs would hang forever or never finish. Three defects in
the HTTP download path, all hit by big GGUF pulls over a slow or flaky link:
1. No stall timeout. The shared download client sets no body deadline
(correct for streaming) but also no read-idle timeout, and the
transport's IdleConnTimeout does not cover an in-flight body read. A
silently-dropped TCP connection (no FIN/RST) blocked the body Read
forever, freezing an install at N bytes until an external reaper killed
it. Add an idle-timeout reader that closes the body after a window of
zero progress (DownloadStallTimeout, default 60s), turning an indefinite
hang into a fast, retryable error. A read that returns data resets the
clock, so a slow-but-steady transfer is unaffected.
2. Cancellation deleted the partial. On context.Canceled the code removed
the .partial file, so any frontend restart (deploy, OOM) mid-download
wiped all progress and the retry restarted from zero. At slow egress,
files larger than the restart interval never completed. Keep the
.partial on cancel so the next attempt resumes via Range.
3. Partials leaked. Cleanup only ran on the context-cancel path, never on a
stall or a SIGKILL/OOM, so abandoned .partial files accumulated and could
fill the models volume. Add CleanupStalePartialFiles and reap partials
older than 24h on startup.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(downloader): discard the .partial on a deliberate user cancel
Review follow-up. The previous commit kept the .partial on every cancellation
so restarts could resume, but that also left a dangling partial when a user
*intentionally* cancelled an install — the file lingered until the 24h reaper.
Distinguish the two: cancel the gallery operation's context with a cause
(downloader.ErrUserCancelled) so the download layer can tell a deliberate
abort (discard the partial) from an incidental one such as a shutdown/restart
(keep it for resume). Detect cancellation via the context rather than the
returned error, because an HTTP request cancelled with a cause surfaces the
cause error, not context.Canceled.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(downloader): resolve gosec G122 in CleanupStalePartialFiles
CI's code-scanning (gosec) flagged G122 (symlink TOCTOU) for the os.Remove
call inside the filepath.WalkDir callback. Collect the stale paths during the
walk and delete them afterwards instead of mutating the tree from inside the
callback. Behavior is unchanged; the existing specs still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): add word-level timestamp support
Add word-level timestamp extraction to the crispasr backend by calling
the CrispASR C library's word accessor functions that are already
exported by libgocraspasr but were not previously bound by the Go
wrapper.
Two families of word functions are supported:
1. Session-based (get_word_count/text/t0/t1) — works per-segment for
whisper-like backends.
2. Parakeet-specific (get_parakeet_word_count/text/t0/t1) — returns a
global word list for TDT/CTC/RNNT parakeet models where the session
API does not expose per-segment word data.
The Go code tries session-based first and falls back to parakeet-specific
when the session word count is zero.
Depends on #10402 (grpc server Words forwarding) for the words to reach
the HTTP response.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(crispasr): use portable sed -i.bak for macOS compatibility
BSD sed requires -i '' for in-place editing while GNU sed uses -i.
Replace with -i.bak which works on both platforms, then remove the
backup file.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Vulkan backends bundled their own loader and ICD manifests but neither the
Mesa driver the manifests point at nor a way to make the loader find them,
so on a runtime base image without Mesa the loader enumerated zero devices
and the GPU silently fell back to CPU (only NVIDIA worked, since its ICD is
injected by the container toolkit).
- scripts/build/package-gpu-libs.sh: for each installed ICD manifest, bundle
the driver .so its library_path names — no hard-coded, platform-dependent
soname list — plus that driver's ldd dependencies, skipping manifests whose
driver isn't installed. Rewrite each library_path to a bare soname so the
bundled driver resolves via the LD_LIBRARY_PATH run.sh already sets.
- .docker/install-base-deps.sh, backend/Dockerfile.golang,
backend/Dockerfile.python: install mesa-vulkan-drivers in every Vulkan
builder so the driver + manifests exist to be packaged (the LunarG SDK
ships only the loader and shader tooling).
- pkg/model/process.go: when a backend ships vulkan/icd.d/, point the loader
at it via VK_DRIVER_FILES/VK_ICD_FILENAMES at launch (no-op otherwise).
Covered by pkg/model/process_vulkan_test.go.
- backend/go/parakeet-cpp/package.sh: complete the L0 stub (was missing the
libc-family ldd walk + GPU-lib packaging) by mirroring whisper, so the
vulkan-parakeet image actually bundles its GPU runtime.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* Use inference defaults in repo src rather than fetching
there are inference_defaults.json already in the repo so we can use
those, they are regularly updated with github actions, and we avoid hash
mismatch errors in the flake this way
Signed-off-by: Souheab <souheab@protonmail.com>
* Update vendor hash
Signed-off-by: Souheab <souheab@protonmail.com>
* Create react-ui derivation as it is required for go build
Signed-off-by: Souheab <souheab@protonmail.com>
* Add FHS env wrapper to make #!/bin/bash scripts work
Signed-off-by: Souheab <souheab@protonmail.com>
* use pkgs.importNpmLock to deal with npm dependencies instead of using npmDepsHash
Signed-off-by: Souheab <souheab@protonmail.com>
---------
Signed-off-by: Souheab <souheab@protonmail.com>
The gRPC server wrapper in pkg/grpc/server.go reconstructs
TranscriptSegment messages when relaying AudioTranscription results
from backends. The Words field was not being copied, causing all
word-level timestamps to be silently dropped regardless of backend
support.
This was introduced when PR #9621 added the TranscriptWord proto
message and transcriptResultFromProto (server-side), but did not
update the server-side gRPC relay to forward the new field.
Fixes#9306
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* feat(ui): legible Usage charts - distinct prompt/completion hues + chart a11y
Prompt and completion were the same color (primary at 0.35 opacity), so the
stacked token charts read as one blurry blob. Completion now uses a distinct
data-viz hue (--color-data-3) at full opacity across the time chart, the
per-model distribution bars, and the tooltip. The source-mix chart is no longer
aria-hidden: it exposes role="img" with a label.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): sortable Users table
The admin Users table is now sortable by name, email, provider, role, status,
and created date - clickable headers with an aria-sort state, a direction
caret, and keyboard activation (Enter/Space). Permissions and Actions stay
non-sortable.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): unsaved-changes guard on Settings and Agent create/edit
Add a reusable UnsavedChangesGuard (router useBlocker + beforeunload) that
prompts before navigating away or closing the tab with unsaved edits. Wired to
Settings (existing isDirty) and AgentCreate (snapshot the loaded form, compare;
suppressed while saving so the post-save redirect is not blocked). Adds the
common.unsaved i18n keys.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): sortable Traces tables
Both trace tables are now sortable: the API table by method/path/status and the
backend table by type/time/model/duration, with aria-sort, a direction caret,
and keyboard activation. Sort and the expanded row reset when switching tabs
(the two tables have different columns).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): responsive table reflow (cards on mobile), applied to Users
Dense admin tables sideways-scroll on phones. Add a reusable ResponsiveTable
that mirrors the <thead> labels onto each body cell (data-label) and a
<=640px stylesheet that stacks rows into label/value cards. Wired to both
Users tables; reusable for the other dense tables next.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): roll responsive table reflow to Traces, Models, Manage, Nodes
Apply ResponsiveTable to the remaining dense tables so they stack into
label/value cards on phones instead of scrolling sideways. Harden the
component for these tables: scope label-mirroring and the card CSS to direct
children (nested detail tables render normally), override inline min-width on
mobile, and pass through table/container inline styles. Nested expansion
tables in Nodes/Models/Manage are intentionally left as-is.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): unsaved-changes guard on the Fine-Tuning form
Editing the long fine-tune job form and navigating away silently discarded
everything. Snapshot the assembled getFormConfig() as a baseline, treat the
open form as dirty when it diverges, and reuse UnsavedChangesGuard to prompt
before leaving. The baseline is rebased after a job is submitted so leaving
afterward does not warn.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add Fraunces variable serif + --font-serif token
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): serif display tier + section-heading typography scale
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): un-overload accent — nav rail, stronger focus ring, neutral hover
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): orchestrated page reveal + stagger motion primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): fix dead token refs + dedupe toggle to one primitive
Migrate all .toggle-slider consumers (Users, Chat, AgentChat) to the
canonical BEM toggle primitive and delete the legacy duplicate CSS block.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): route boot fallback through the LoadingSpinner primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): EmptyState primitive with serif title
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Skeleton shimmer primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): PageHeader + SectionHeading editorial primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): StatusPill primitive + time-of-day greeting helper
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Home editorial header + status line (north-star redesign)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Home loaded-models skeleton list, button hierarchy, EmptyState wizard
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): single focus ring (no double-ring) + neutralize stagger delay under reduced motion
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* refactor(ui): all-sans editorial headings + tint-only active nav
Per design review, pivot the heading strategy from hybrid-serif to a
refined grotesk: drop the Fraunces dependency, token, and import; page
titles, the Home greeting, and section/empty-state titles now use Geist
at semibold with the editorial fluid sizing and tight tracking. No serif
anywhere.
Active sidebar item is now a tint-only treatment (accent text + tinted
background); the left accent rail is removed and the shared base
.nav-item.active inset bar is suppressed in the sidebar (as the console
rail already does). Update the design-system e2e specs to assert the
sans display font and the tinted-background active state.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e): add --host flag to ui-test-server
Allow binding the e2e/preview server to an arbitrary address (e.g.
0.0.0.0 to review the UI from another device on the LAN). Defaults to
127.0.0.1 so existing e2e behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): declutter Home - discoverable + dismissable API, vertical balance
Home felt overloaded and top-heavy. Three changes from review:
- The API endpoint catalog (12 endpoints) is collapsed by default behind a
"Browse the API" disclosure; only the base URL + copy stay visible, so the
catalog is discoverable without dominating the page.
- The whole connect card is dismissable (x): dismissing unmounts it so the
vertical space is recovered, and the choice is remembered (localStorage).
- .home-page now fills its column and vertically centers its content when
there is slack, so sparse states (no models / card dismissed) read as a
balanced launcher instead of content jammed at the top. Overflow-safe:
tall content flows from the top and scrolls.
Adds connect.browse / connect.hide / connect.dismiss i18n keys to all locales.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): editorial PageHeader with section eyebrow + scroll-to-top on nav
PageHeader now derives its eyebrow from the route's section/console (Build /
Operate / Create) via sectionKeyForPath, so pages get a consistent, meaningful
eyebrow with no per-page wiring (override with the eyebrow prop, suppress with
eyebrow={null}). Settings adopts it as the first consumer.
Also fix a navigation scroll bug: the default layout uses the document as its
scroll container and route changes did not reset it, so navigating the console
rail from a scrolled page landed mid-view. App now scrolls to top on pathname
change.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): adopt PageHeader on agent/media/import/backend pages (batch A)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* refactor(ui): adopt PageHeader on ops/admin/media pages (batch B)
Replace hand-rolled .page-header title blocks with the shared editorial
PageHeader component across 14 pages (Manage, Middleware, Models,
NodeBackendLogs, Nodes, P2P, SkillEdit, Skills, Sound, Traces, TTS, Usage,
Users, VideoGen). Title/subtitle move into PageHeader; header-own action
clusters (Models stats+buttons, Skills search+buttons) move into the actions
slot. Tabs, filters, stat cards, ResourceMonitor and page body stay as
siblings. Eyebrow is left to auto-derive from the route.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(ui): home greeting asserts sans font, not the dropped serif
The greeting render-smoke still asserted Fraunces; update it to assert the
Geist sans display font (and not Fraunces), matching the all-sans direction.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): ThemeToggle i18n + animated icon, drop transition:all
The theme toggle hard-coded its English tooltip; route it through the existing
nav switchToLightMode/switchToDarkMode keys and add an aria-label. The sun/moon
icon now replays a small rotate+fade on theme change (keyed remount; honored by
the global reduced-motion block). Replace the .theme-toggle `transition: all`
with explicit properties.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): canvas drag-to-resize + slide-in, fix hooks order, typed download
Canvas was a fixed pane; make it a workbench:
- Drag the panel's left edge to resize (clamped 360px..75vw), persisted to
localStorage, double-click to reset; hidden and full-width on narrow screens.
- Slide-in/fade on open via canvasSlideIn (honored by reduced-motion).
- Fix a rules-of-hooks bug: the `if (!current) return null` early return sat
above useEffect, so the hook count changed when artifacts emptied. All hooks
now run unconditionally before the guard.
- Downloads use the artifact language's real extension + MIME (a Python
artifact saves as .py, not .txt) via extensionForLanguage.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): per-message code blocks get a language header + copy button
Chat code blocks now render inside a framed block with a header showing the
language and a copy button (delegated handler, copies the block and flips to a
check briefly). Decoration + highlighting run from a MutationObserver scoped to
the messages container, which fires reliably for streamed responses AND for
chats loaded/switched from storage - the prior render-keyed effect missed the
load path (code was left unhighlighted on reload). The observer disconnects
while mutating so it does not retrigger on its own edits.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): image attachments show a thumbnail in the composer
Staged image attachments now preview as a 28px thumbnail (from their data URL)
instead of a bare file icon; other types keep the icon. File names truncate and
the remove button gets an aria-label.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): jump-to-latest pill when scrolled up in chat
When the user scrolls away from the bottom of a conversation, a floating
"Jump to latest" pill appears (sticky, centered above the composer); clicking
it smooth-scrolls to the newest message and re-pins auto-scroll. Resets on
chat switch. Adds the chat.actions.jumpToLatest i18n key to all locales.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): canvas fullscreen toggle + keyboard tab navigation
The canvas header gains a fullscreen toggle (expands the panel to cover the
viewport; resize handle hidden while fullscreen). The artifact tab strip is now
a proper ARIA tablist with roving tabindex and Left/Right arrow-key navigation.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): image result lightbox (zoom, prev/next, download, keyboard)
Generated/history images on the Image page are now clickable, opening a
fullscreen Lightbox with a download button, prev/next navigation, an N/M
counter, and keyboard control (Esc to close, Left/Right to navigate). Adds a
reusable `Lightbox` component (usable later for Video) and the media.image
.actions.view i18n key.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): generation progress with placeholder tiles + elapsed timer
Image generation replaces the bare spinner with a GenerationProgress scaffold:
shimmer placeholder tiles matching the requested count plus a live elapsed-time
readout, so the (often slow) wait feels accountable. Reusable for the other
media generation pages.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): generation progress on Video, TTS, and Sound pages
Reuse GenerationProgress (placeholder tile + elapsed timer) in place of the
bare spinner on the remaining media generation pages, so every slow generation
gives the same accountable feedback.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): agent chat gets per-message code-copy + reliable highlighting
AgentChat now shares Chat's code-block treatment: it runs highlightAll +
enhanceCodeBlocks from a MutationObserver on its messages container (the same
proven path), so agent responses get language headers, copy buttons, and
highlighting that fires for both streamed and loaded messages - closing the
divergence with the main chat without a large refactor.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Talk voice visualizer
Add a hero frequency-bar visualizer at the top of the Talk page so users get
ambient feedback that they are heard and that the assistant is speaking - the
audit's main Talk gap (the only prior feedback was a small status pill; the
waveform was buried in the dev diagnostics panel).
VoiceVisualizer is self-contained: it builds its own AudioContext + analysers
from the output <audio> stream (speaking) and the mic stream (listening) so it
does not touch the existing WebRTC/diagnostics graph. Bars are status-tinted
(idle/connected/listening/speaking/error) and animate with a gentle idle wave
when not connected. Live mic/output animation is exercised on a real session.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore: bump localrecall to include PostgreSQL table-name sanitization fix
Pulls mudler/localrecall#48, which makes sanitizeTableName allowlist valid
identifier characters so collection names containing ':' (e.g. the per-user
"legacy-api-key:<agent>" namespace) no longer break PostgreSQL CREATE TABLE
with "syntax error at or near ':'".
Fixes#10375
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
docs: document the privacy-filter.cpp backend in README and compatibility table
The privacy-filter.cpp backend (#10360) was registered in backend/index.yaml
and referenced from the PII feature docs, but was missing from the backend
catalog surfaces. Add it to the README "Backends built by us" table, the
compatibility table (Utilities & Other, CPU/CUDA 13/Vulkan), and the backend
type list in the backends feature doc.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:
- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
whole conversations as one document), with a recreated bounded restricted-
regex secret-matching pattern detector tier alongside it (per-model
pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
filter; token_classify known_usecase; batch sized to context for NER models.
privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
meta + image entries with a capabilities map) matching its CI matrix jobs,
and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
privacy-filter GGUF detection) replacing the prior pref-only registration.
Reconciled against master's independent evolution:
- Dropped master's PIIPatternOverrides feature (global-pattern runtime
overrides + /api/pii/patterns API + runtime_settings.json persistence). The
per-model NER + pattern-detector design supersedes it; it was built on the
global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
removed the patch and restored master's grpc-server.cpp Score RPC (direct
llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
model_config validation forbidding score + chat/completion/embeddings on
llama-cpp. token_classify is unaffected (it runs on the privacy-filter
backend, not llama-cpp).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The sidebar collapse toggle silently no-op'd in dev builds. toggleCollapse
ran its side effects (localStorage write + sidebar-collapse dispatch) inside
the setCollapsed updater. StrictMode double-invokes updaters in dev to surface
impurity, and the synchronous dispatch re-entered setState from the
App/Sidebar listeners mid-update, so the toggle never committed. Production
builds don't double-invoke, which is why only the dev server was affected.
Compute next from current state and move the persist + broadcast into the
handler body so the updater is pure.
Also fix the Talk page anchoring to the transcript box on load. The transcript
is its own overflow container, but scrollIntoView bubbles to every scrollable
ancestor including the window, yanking the whole page down on mount. Scroll
the transcript container directly instead.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): restructure sidebar into Create/Recognition/Build tiers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): preserve exact sidebar gating for agent items and fine-tune/quantize
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* i18n(ui): add nav tier + console keys to all locales
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add grouped admin console via pathless layout route
Wrap the existing admin pages in a pathless AdminConsoleLayout route so
they keep their exact flat URLs while gaining a grouped left rail
(Inference / Cluster / Observability / Access / System). Rail item gating
mirrors the sidebar (adminOnly / authOnly / feature + /api/features). The
layout forwards the App-level outlet context (addToast) to the wrapped
pages, which read it via useOutletContext().
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): fold Audio Transform into Studio as a tab
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(ui): update e2e specs for tiered nav + admin console
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): gate embedded Studio transform view on audio_transform feature
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): visual polish + console-ize Build/Recognition tiers
Generalize the one-off admin console into a reusable ConsoleLayout driven by
a shared consoleConfig (single source of truth for the rail, its gating, and
the sidebar entry that opens it — removes the prior rail/sidebar drift).
- Promote Install Models to the top menu next to Home.
- Build and Operate are now console tiers (secondary rail); Create stays inline.
- Fold Recognition (Faces/Voices) into the Build console as a group alongside
Automation and Training so it no longer feels split off.
- Style the console rail as a panel (header, grouped dividers, rounded active
pills) with a hover nudge; sidebar items become inset rounded pills. The rail
slide-in plays only when entering a console, not on item-to-item sub-nav
(which remounts the layout), so switching no longer flashes the menu. All
token-based (light + dark), respects reduced-motion.
- Add a delayed RouteFallback loader so lazy routes no longer flash blank;
scoped inside ConsoleLayout so the rail stays put while the body loads.
- Update e2e specs for the new structure (.console-* classes, console entries).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): persist console layout across sub-nav + add drop-in endpoint section
- Keep the page-transition key stable within a console (derived from the
shared console config) so the ConsoleLayout and its rail persist across
item-to-item navigation instead of remounting — fixes the submenu flash.
Cache /api/features across mounts and play the rail entrance animation only
when actually entering a console.
- Add a "One endpoint, every API" section to Home: leads with LocalAI's own
native API (images, video, realtime voice over WebRTC/WS, depth, object
detection, rerank, audio/TTS, face & voice recognition) plus a Full API
reference link, then the drop-in compatibility layer (OpenAI, Anthropic,
Ollama, OpenAI Responses) with the live copyable base URL. All 7 locales.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): revert Middleware nav label rename (keep Middleware in all locales)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Bring the Backend & Model Compatibility Table up to the full set of
backends published in backend/index.yaml (60+), organized by modality
with per-backend acceleration targets. Add an "Available Backends"
pointer and expand the backend-type list in the backends feature doc.
Update the README backend count to 60+ and add a "Backends built by us"
section listing the native C/C++/GGML engines maintained by the LocalAI
project (parakeet.cpp, voxtral.c, vibevoice.cpp, rf-detr.cpp,
locate-anything.cpp, depth-anything.cpp, LocalVQE, local-store).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models
The ds4 backend zero-initialized ds4_engine_options and exposed none of the
engine's tunable knobs, so SSD streaming (run a model larger than RAM by
streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs
were unreachable from LocalAI model YAMLs.
Map ModelOptions.Options onto ds4_engine_options through a declarative table
(kEngineOptSpecs + apply_engine_option) instead of per-field branches: the
struct is fixed C with no reflection, so the field set is enumerated once and a
future knob is a one-line table row. Two fields use ds4's own typed parsers
(GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming")
mean true; path-type options (mtp_path, expert_profile_path,
directional_steering_file) resolve relative to the model directory so a gallery
entry can reference a companion file by bare filename. mtp_draft/mtp_margin are
now validated rather than parsed with throwing std::stoi/std::stof.
Add gallery entries for the 128 GB class:
- deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality)
- deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming)
- deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights)
- deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming)
SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): add depth-anything-3-metric-large gallery entry
DA3METRIC-LARGE (ViT-L) single-file metric-scale depth + sky, served by the
existing depth-anything backend (same single-GGUF path as mono-large). GGUF
published at mudler/depth-anything.cpp-gguf.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): serve nested metric model (two-file load)
The DA3 nested model needs both branches (anyview GIANT + metric ViT-L) loaded
together. Wire it through the backend:
- Load reads a 'metric_model:<file>' entry from ModelOptions.Options and, when
present, calls da_capi_load_nested(anyview, metric) instead of da_capi_load
(registers the new abi-4 symbol; helper optionValue + unit test).
- gallery: depth-anything-3-nested (model=anyview, options=metric branch, both
GGUFs fetched) for metric-scale depth + pose.
- bump depth-anything.cpp pin to cce5edc (abi 4 / da_capi_load_nested).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:
failed to extract shortcode: template for shortcode "alert" not found
docs/content/features/image-generation.md:106
Convert the vae_decode_only note to the theme-supported notice shortcode
used everywhere else in the docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery
Mirrors the locate-anything-cpp backend to register a new depth-anything
backend that wraps the Depth Anything 3 ggml port (depth-anything.cpp) via
purego (cgo-less, no Python at inference).
- backend/go/depth-anything-cpp/: gRPC backend (Load + Predict + GenerateImage),
purego binding to the da_capi_* C ABI, CMake/Makefile/run/package/test scripts
building depth-anything.cpp's DA_SHARED static .so per CPU variant.
- backend/index.yaml: depth-anything backend meta + all hardware-variant
capability entries (cpu/cuda12/cuda13/intel-sycl-f32+f16/vulkan/nvidia-l4t).
- gallery/index.yaml: 8 Depth Anything 3 GGUF models (base q4_k/q8_0/f16/f32,
small, large, giant, mono-large).
- .github/backend-matrix.yml: one build entry per hardware variant.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): typed Depth RPC + REST endpoint exposing full DA3 data
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): pin depth-anything.cpp to e0b6814 (ABI 3 dense C-API)
The Depth RPC handler calls da_capi_depth_dense / da_capi_points (C-API ABI 3);
pin the native build to the commit that exports them.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): pin depth-anything.cpp to v0.1.0 release (b515c31)
Repoint the native version from the now-orphaned e0b6814 to the
b515c31 release commit, kept alive by the upstream v0.1.0 tag.
C-API is unchanged (da_capi_abi_version == 3).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): wire depth-anything-cpp into build, CI bump, and importer
The backend dir, gallery index, and CI build-matrix were present but the
backend was never wired into the integration points that adding-backends.md
requires:
- root Makefile: add to .NOTPARALLEL, the test-extra chain, a BACKEND_*
definition, the docker-build target eval, and docker-build-backends
(mirrors parakeet-cpp; the backend's own Makefile already documented that
its `test` target is driven by test-extra).
- bump_deps.yaml: register the DEPTHANYTHING_VERSION pin so the daily
auto-bump bot tracks mudler/depth-anything.cpp master (it cannot see an
unregistered Makefile pin).
- import form: add a preference-only KnownBackend entry so depth-anything is
selectable at /import-model (mirrors sam3-cpp; no reliable GGUF auto-detect
signal, so pref-only per the doc's default).
changed-backends.js needs no entry: the generic golang suffix branch already
resolves backend/go/depth-anything-cpp/.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): auto-detect importer for depth-anything GGUFs
Replace the preference-only entry with a real auto-detect importer
(mirrors parakeet-cpp / locate-anything):
- DepthAnythingImporter matches a .gguf whose name carries a
depth-anything token (depth-anything-<size>-<quant>.gguf), so
/import-model recognises mudler/depth-anything.cpp-gguf repos and direct
GGUF URLs without an explicit backend preference. preferences.backend=
"depth-anything" still forces it.
- Registered before LlamaCPPImporter so its GGUF bundles aren't claimed by
the generic .gguf importer; the narrow name match means it cannot claim
arbitrary llama GGUFs or the upstream safetensors PyTorch repos.
- Multi-quant repos pick the smallest quant by default (q4_k -> ... -> f32,
depth stays >0.998 corr even at q4_k); quantizations preference overrides.
- Drops the now-redundant knownPrefOnlyBackends entry (importer-backed
backends are not listed there, matching parakeet-cpp).
- Table-driven Ginkgo test covers detection, negative cases (llama GGUF,
upstream safetensors), default/override/fallback quant pick, and direct
URL import. 10/10 specs pass.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): check conn.Close error in grpc Depth client (errcheck)
The new Depth() client method used a bare `defer conn.Close()`. golangci-lint
runs with new-from-merge-base, so although the 39 sibling methods use the same
bare form (grandfathered), the newly added line trips errcheck. Drop the result
explicitly to satisfy the linter.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
* fix(depth): bump depth-anything.cpp to v0.1.1 (embeddable CMake)
v0.1.0 (b515c31) used ${CMAKE_SOURCE_DIR} for its include dirs, which
points at the parent project when built via add_subdirectory() as this
backend does, so the container build failed with missing stb_image.h /
da_gguf_keys.h. v0.1.1 (2d42897) switches to project-relative paths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
* fix(depth): resolve gosec findings in the backend wrapper
The code-scanning gate flagged three new failure-level alerts in
godepthanythingcpp.go (gosec runs with -no-fail; GitHub gates on new alerts):
- G301: export dirs were created with 0o755. Tighten to 0o750 (no world
access needed for backend-written export output).
- G304: writeDepthPNG creates req.GetDst(). That path is chosen by the
LocalAI core as the intended output destination (same pattern every
image backend uses), not attacker input, so annotate with #nosec G304
and document why.
The remaining G103 "audit unsafe" notes on the unsafe.Slice C-buffer copies
are warning-level (the same purego interop whisper/parakeet use) and do not
gate the check, per the supertonic exclusion precedent in secscan.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
* fix(depth): bump depth-anything.cpp to v0.1.2 (CUDA cross-build arch)
v0.1.1 forced CMAKE_CUDA_ARCHITECTURES=native, which breaks the GPU-less
l4t/cublas CI builds (nvcc "Unsupported gpu architecture 'compute_'" on
CMake 3.22). v0.1.2 (442eea4) drops the override and lets ggml pick its
default cross-build arch list.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): add chat_template_kwargs model field + resolver
Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier,
plus ResolveChatTemplateKwargs which layers the config map under coerced
request metadata. Foundation for generic jinja chat-template kwargs (issue #10329).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backend): forward resolved chat_template_kwargs blob to backends
gRPCPredictOpts now merges per-request client metadata over the server-derived
enable_thinking/reasoning_effort (reaching all backends via the standalone keys)
and serialises the resolved chat_template_kwargs map into a JSON blob for
llama.cpp, written last so a client cannot clobber it. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(http): wire request metadata to config.RequestMetadata
The OpenAI request metadata field was parsed but unused; stamp it onto the
per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs
overrides. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks)
Replace the per-key enable_thinking/reasoning_effort handling in both the
streaming and non-streaming chat paths with a single block that parses the
chat_template_kwargs JSON blob resolved by the Go layer and merges every key
into body_json. New jinja template levers (e.g. preserve_thinking) now need
no C++ change. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: document custom chat_template_kwargs (model + per-request)
Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob
Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC
Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the
received PredictOptions.Metadata, and an app_test.go spec that drives a real
/v1/chat/completions request (model chat_template_kwargs + per-request metadata
override) and asserts the exact metadata + chat_template_kwargs blob the REST
layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(config): grandfather chat_template_kwargs in registry coverage
chat_template_kwargs is a free-form map[string]any (like engine_args, already
on the list), not a scalar the config UI registry can surface, so it is exempt
from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries
failure introduced by the new field. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update leejet/stable-diffusion.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(stablediffusion-ggml): adapt gosd.cpp to upstream sd_ctx_params_t API
The bump to 5a34bc7 restructured sd_ctx_params_t: the boolean CPU-offload
knobs (offload_params_to_cpu, keep_clip_on_cpu, keep_vae_on_cpu,
keep_control_net_on_cpu) were replaced by backend assignment specs
(backend/params_backend), and vae_decode_only / free_params_immediately
were dropped entirely. The build broke with "no member named ..." on
every arch.
Translate the legacy options we still accept from gallery configs into
the new backend assignment specs, mirroring prepare_backend_assignments()
in the upstream CLI, so offload_params_to_cpu / keep_*_on_cpu keep
working. vae_decode_only is parsed and ignored for config compatibility.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(stablediffusion-ggml): expose backend/params placement options
The upstream bump introduced new sd_ctx_params_t fields for device and
memory placement (backend, params_backend, rpc_servers, max_vram,
stream_layers) plus PuLID-Flux weights (pulid_weights_path). Wire them up
as backend options so models can be split across CPU/GPU/disk/RPC:
- backend: per-component compute placement (e.g. clip=cpu,vae=cuda0)
- params_backend: per-component weight storage incl. disk mmap
- max_vram / stream_layers: graph-cut segmented parameter offload budget
- rpc_servers: offload compute to remote RPC servers
- pulid_weights_path: PuLID-Flux identity injection
The legacy keep_*_on_cpu / offload_params_to_cpu booleans now seed and
compose with the explicit backend/params_backend specs, matching upstream
prepare_backend_assignments(). Option values are taken as everything after
the first ':' so colon-bearing values (rpc_servers host:port) survive
parsing. Documented the new options in the image-generation guide.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(stablediffusion-ggml): distributed RPC across ggml workers
Enable the ggml RPC backend (-DSD_RPC=ON) so image generation can be
sharded across remote rpc-server workers. The ggml rpc-server is
backend-agnostic, so this reuses the exact same worker pool as the
llama.cpp backend - one set of `local-ai worker llama-cpp-rpc` /
`p2p-llama-cpp-rpc` workers accelerates both text and image generation.
RPC servers are selected by precedence:
- the explicit `rpc_servers` option, else
- the LLAMACPP_GRPC_SERVERS env var, which LocalAI's p2p worker mode
populates automatically with discovered workers (the backend inherits
it from the parent process env), so distributed image generation needs
no per-model configuration.
Documented manual and p2p setup in the image-generation guide.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update antirez/ds4
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(ds4): add Homebrew include/lib prefix for Darwin grpc-proto build
The darwin/metal ds4 backend job runs for the first time on this bump
(it was skipped on prior ds4 PRs) and fails compiling backend.pb.cc with
'google/protobuf/runtime_version.h' file not found.
hw_grpc_proto links neither protobuf::libprotobuf nor gRPC::grpc++, so
the generated proto sources rely on default system include paths. That
works on Linux (/usr/include) but not on macOS, where Homebrew installs
under /opt/homebrew. Add the Homebrew prefix to include/link dirs on
Darwin, mirroring the llama-cpp backend that already builds on Darwin CI.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(ds4): install nlohmann-json on Darwin CI for ds4 backend
After the protobuf include-path fix the ds4 darwin build advances to
compiling dsml_renderer.cpp, which includes <nlohmann/json.hpp> and
#errors when absent. On Linux the header comes from apt nlohmann-json3-dev
in the build image; the macOS runner had no equivalent. Add the
header-only nlohmann-json formula to the shared Darwin backend brew
install/link list and Homebrew cache, alongside the existing deps.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(ds4): build proper OCI image tar for Darwin backend
The darwin packaging referenced scripts/build/oci-pack.sh, which was
never added to the tree, so it fell back to a plain 'tar' that omits
manifest.json. 'local-ai backends install' then rejects the tarball
with 'file manifest.json not found in tar'.
Use './local-ai util create-oci-image' (already built by the 'build'
prerequisite of the backends/ds4-darwin target), mirroring
llama-cpp-darwin.sh, to emit a real OCI image the installer accepts.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(launcher): truncate download status labels to stop dialog blowout
The download progress windows place a ProgressBar and a status Label in the
same VBox. On failure the status label is set to "Download failed: <error>",
and the error commonly contains a long, unbreakable URL/path. A Fyne label
with default settings reports its MinSize as the full single-line text width,
so a long message stretches the window — and the progress bar sharing the
VBox — arbitrarily wide (fixes#10355).
Set Truncation = fyne.TextTruncateEllipsis on the four affected status labels
(the main-window status label plus the status label in each of the three
showDownloadProgress implementations). Truncation collapses the label's
MinSize to roughly one character plus the ellipsis regardless of content, so
the window keeps its intended size. TextWrapWord is not enough because it
cannot break a spaceless URL. The full error text remains visible via the
dialog.ShowError call already present in each path.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The qwen3-tts backend migrated from predict-woo/qwen3-tts.cpp to
ServeurpersoCom/qwentts.cpp (the Makefile QWEN3TTS_REPO already points
there), but the bump_deps matrix still tracked the old repo. That made
the nightly bumper open PRs (e.g. #10334) against the wrong upstream.
Point the matrix entry at the new repo and its master branch.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The Gemma 4 QAT MTP assistant-head gallery entries currently fail to load in the stock llama.cpp backend with unknown architecture errors. Hide them until the assistant GGUFs are verified against the supported backend path.
Assisted-by: Codex:GPT-5 [gh] [git]
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(react-ui): localize SearchableSelect component
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
* feat(react-ui): localize ModelSelector component
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
* fix(react-ui): dynamically localize back navigation caption to match page title
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
* feat(react-ui): localize back navigation state on Models page
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
* feat(react-ui): localize ModelEditor page
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
* fix(react-ui): fix Indonesian typo 'Import' to 'Impor' in importModel locale
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
---------
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Expands sherpa-onnx Piper TTS coverage in the model gallery. Previously only
5 single-speaker Piper voices shipped (it_IT-paola, en_US-amy, es_ES-davefx,
fr_FR-siwis, de_DE-thorsten). This adds 19 entries:
Italian (it_IT): dii-high, miro-high, riccardo-x_low.
UK English (en_GB): alan (low+medium), alba-medium, aru-medium, cori
(high+medium), dii-high, jenny_dioco-medium, miro-high,
northern_english_male-medium, semaine-medium, southern_english_female
(low+medium), southern_english_male-medium, vctk-medium, sweetbbak-amy.
Each entry mirrors the existing Piper block (sherpa-onnx-tts.yaml base config).
sha256, ONNX path, sample rate and speaker count were read from the actual
release tarballs; licenses and source URLs were taken from each archive's
MODEL_CARD/README rather than assumed:
- dii/miro voices are OpenVoiceOS models under CC BY-NC-SA 4.0 (non-commercial),
labelled as such in both the license field and description.
- cori is LibriVox public-domain (cc0-1.0); OpenSLR-83 voices are CC BY-SA 4.0;
alba/vctk are CC BY 4.0.
- vctk (109), aru (12) and semaine (4) are multi-speaker; tagged accordingly
with a note to select the speaker via the numeric voice id.
The legacy underscore-named southern_english_female_medium duplicate is
intentionally skipped. No backend change is needed: sherpa-onnx auto-detects
single-speaker VITS vs Kokoro, and each tarball ships its own espeak-ng-data.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Composed realtime pipelines (VAD+STT+LLM+TTS) defaulted to unlimited history,
so a long-running session grew every turn and fed the whole conversation to the
LLM until its context window filled. Add an optional pipeline.max_history_items
to cap the trailing items per turn; explicit value (including 0=unlimited) wins
over the per-model-type default. Self-contained any-to-any models keep their
6-item default.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(watchdog): start the live watchdog on a cold enable from Settings (#9125)
The React Settings "Enable Watchdog" master toggle only ever writes the
idle/busy flags; watchdog_enabled is vestigial in that UI. The live
start/stop decision in UpdateSettingsEndpoint keyed off the raw, stale
watchdog_enabled request field, so a cold enable (idle/busy=true,
watchdog_enabled=false) called StopWatchdog() and the watchdog stayed
stopped until the next restart - at which point startup re-derived it
from the idle flag. Net: enabling the watchdog appeared to do nothing.
Derive the run-state from idle||busy as the single source of truth,
mirroring the startup invariant:
- ApplyRuntimeSettings now sets WatchDog = idle||busy whenever either
field is present (so a full disable also brings it down), while an API
client posting only watchdog_enabled keeps its explicit value.
- Add ApplicationConfig.WatchdogShouldRun() mirroring startWatchdog's
gating (idle/busy, LRU eviction, memory reclaimer); the /api/settings
handler uses it to decide start vs stop.
- Belt-and-suspenders: the Settings.jsx master toggle also writes
watchdog_enabled = idle||busy.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
A model whose ModelFile is a single file (e.g. sherpa-onnx VITS/piper: the
.onnx) failed to load on remote worker nodes because the sibling assets the
backend resolves from the model dir — tokens.txt, lexicon.txt, the
espeak-ng-data / dict directories, Kokoro's voices.bin — were never staged.
Only the declared ModelFile was shipped, so the worker hit "failed to create
sherpa-onnx TTS engine" and TTS produced no audio.
Lean on the existing option-path staging instead of hardcoding filenames:
- stageGenericOptions now also resolves an option value relative to the model's
own directory (not just the frontend models dir), so a shared config can
declare companions with bare names regardless of whether Model includes a
subdirectory; and it expands directory-valued options (e.g. espeak-ng-data)
file-by-file rather than handing a directory fd to the stager.
- gallery/sherpa-onnx-tts.yaml declares the companion assets as option paths
(tokens, lexicon, espeak-ng-data, voices.bin, dict, per-lang lexicons). The
backend ignores these keys and keeps resolving siblings from the model dir;
they exist only so distributed staging ships them. Absent files are skipped.
Adds router_optionstage_test.go covering file + directory companion staging via
the model-dir fallback.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add mock-backend VoiceEmbed/VoiceVerify (deterministic DC-offset speaker
discrimination) and a verify-mode gated realtime pipeline, then drive the
real HTTP/WS stack: an authorized speaker reaches response.done while an
unauthorized one is dropped before the LLM with a speaker_not_authorized
event.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): add pipeline voice_recognition gate config schema
Add the PipelineVoiceRecognition config block that gates a realtime
pipeline behind speaker verification (identify against the voice
registry, or verify against reference audios), with Normalize defaults
and Validate enum/shape checks. Register the new fields in the config
meta registry so the UI renders them with proper labels/components
(required by the registry-coverage gate).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(realtime): range-check voice gate threshold and floor UI min
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* feat(realtime): add cosineDistance helper for voice gate
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* feat(realtime): add voiceGate identify-mode authorization
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* test(realtime): cover voice gate fail-closed error paths
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* feat(realtime): add voiceGate verify-mode authorization
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* feat(realtime): add voiceGate decide policy helper
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* feat(realtime): add newVoiceGate constructor
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* feat(realtime): gate pipeline responses behind voice recognition
Run speaker verification concurrently with transcription and join on a
hard barrier before generateResponse, so unauthorized utterances never
reach the LLM, tools, or TTS. Supports identify (registry) and verify
(reference) modes with multiple authorized speakers, per-utterance or
first-utterance checking, and drop-with-event or silent-drop on reject.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(realtime): harden voice gate goroutine lifecycle
Only launch the verification goroutine on the transcription path and
drain it before the temp WAV is removed on the transcription-error
return, so an in-flight backend read never races the deferred cleanup.
Drop the write-only voiceMatched field; log the matched speaker instead.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* docs(realtime): document the voice_recognition pipeline gate
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(realtime): fail closed on an incomplete voice_recognition block
A present voice_recognition block with no model previously disabled the
gate silently, authorizing every speaker. Treat block presence as the
intent signal and reject an empty model in Validate, so the session is
refused instead of running unprotected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* test(realtime): integration-test the voice gate through commitUtterance
Drive the real commitUtterance path (gate goroutine, hard join before the
LLM, reject event, when:first session trust) with the existing
transport/model doubles: authorized speakers reach a full response,
unauthorized ones are dropped before the LLM with a speaker_not_authorized
event, backend errors fail closed, drop_silent stays quiet, and when:first
trusts the session after one match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Add the localai-org/localai-realtime-demo Go client to the README
Examples list and to the realtime docs (integrations + realtime feature
page). Refresh the Latest News section with June 2026 highlights pulled
from history since v4.3.0: realtime pipeline streaming, the parakeet.cpp
and CrispASR speech work, new backends (locate-anything.cpp, Ideogram4,
llama.cpp video input), and distributed-mode hardening.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add a full Korean locale (core/http/react-ui/public/locales/ko/, 13 namespaces,
840 keys, full parity with en/) and register ko in SUPPORTED_LANGUAGES
(core/http/react-ui/src/i18n/index.js). All i18next {{interpolation}} and
_one/_other plural keys preserved; brand/model names kept untranslated.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: moduvoice <moduvoicr77@gmail.com>
* feat(omnivoice-cpp): add C wrapper + CMake/Makefile build over OmniVoice ov_* ABI
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(omnivoice-cpp): add option/language parsing + WAV framing helpers with tests
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(omnivoice-cpp): wire purego binding with TTS + streaming TTSStream
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* build(omnivoice-cpp): wire backend into root Makefile
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(omnivoice-cpp): add build matrix entries + dep-bump registration
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(omnivoice-cpp): register backend meta + image entries
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(omnivoice-cpp): expose as preference-only importable backend
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add omnivoice-cpp TTS models (Q8_0 default + BF16 HQ)
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(omnivoice-cpp): document the OmniVoice TTS backend
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(omnivoice-cpp): add env-gated e2e for TTS + streaming
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(omnivoice-cpp): honor tts.audio_path/tts.voice config as default cloning reference
The model config tts.audio_path (ModelOptions.AudioPath) and tts.voice now
provide a default voice-cloning reference used when a request omits Voice, so a
cloned voice can be pinned in the model YAML instead of passed per request. A
per-request voice still overrides. Paths resolve relative to the model dir.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(omnivoice-cpp): add missing omnivoice-cpp-development backend meta
Mirrors the whisper/vibevoice convention: a -development meta aggregating the
master-tagged image variants (the production meta and per-variant prod+dev image
entries already existed; only the development meta aggregator was missing).
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Wire the Kokoro model family into the sherpa-onnx backend (which only
supported VITS/Piper before) and add gallery voices for Italian, English,
Spanish, French and German plus a multilingual Kokoro model.
- csrc/shim.{c,h}: kokoro_* config setters (model/voices/tokens/data_dir/
dict_dir/lexicon/lang/length_scale) mirroring the VITS path, with the
matching frees in tts_config_free.
- backend.go: loadTTS now detects a Kokoro model (a voices.bin beside the
ONNX) and routes to configureKokoroTTS, otherwise configureVitsTTS.
Kokoro picks up espeak-ng-data, the jieba dict and the per-language
lexicons (only one English variant, to avoid tens of thousands of
duplicate-word warnings at load); the language= option hints the lang.
- backend_test.go: functional test for isKokoroModel detection.
- gallery: 5 Piper VITS voices (it_IT-paola, en_US-amy, es_ES-davefx,
fr_FR-siwis, de_DE-thorsten) + kokoro-multi-lang-v1.0, served through
sherpa-onnx-tts.yaml with native streaming TTS.
Verified by building the backend and synthesizing with a real Piper and
Kokoro model (31/31 specs pass, including real-model synth smokes).
Assisted-by: Claude:claude-opus-4-8 gofmt golangci-lint go-test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(xsysinfo): make reported system RAM total cgroup/lxcfs-aware (#8059)
GetSystemRAMInfo derived Total from memory.TotalMemory(), which on Linux
uses syscall.Sysinfo().Totalram - the HOST kernel total. lxcfs/LXD does
NOT virtualize that value, while MemAvailable (used for Free/Available)
IS virtualized. Inside an LXD/container with a 128Gi host but a ~10Gi
container view this produced Total=128Gi, Available=10Gi => Used=118Gi,
reporting ~92% RAM usage on an idle container.
Derive Total instead from the minimum of all non-zero, non-unlimited
candidates: cgroup v2 memory.max, cgroup v1 memory.limit_in_bytes (the
kernel unlimited sentinel is ignored), /proc/meminfo MemTotal (which
lxcfs virtualizes), and the syscall.Sysinfo total as the bare-metal
fallback. On bare metal every candidate is unlimited or equals the host
total, so behavior is unchanged.
The selection/parsing lives in a pure function chooseTotalMemory(...)
taking file CONTENTS, unit-tested without a real LXD host; OS file
reads stay in a thin wrapper.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(agents): make React agent chat timestamps format-agnostic
The agent SSE bridge emits the json_message timestamp in three different
encodings depending on deploy mode: an RFC3339 string (standalone agent
pool), Unix milliseconds (local dispatcher), and Unix nanoseconds (the
older NATS path). The React AgentChat handler passed data.timestamp
straight through, so the standalone string and any numeric value outside
the millisecond range rendered as "Invalid Timestamp" or a constant
epoch-ish time.
Add a small pure helper, normalizeTimestampMs, that accepts an RFC3339
string or a numeric epoch in s/ms/us/ns and returns JS milliseconds,
falling back to Date.now() on null/empty/unparseable input. Use it in
the json_message handler so the rendered time is correct regardless of
which backend path produced it.
Fixes#9867
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Two small visual fixes in the React UI:
- Talk page pipeline summary: the four-column grid used
`repeat(4, 1fr)`, which resolves to `minmax(auto, 1fr)` so each track
refuses to shrink below the min-content width of its `nowrap` model
name. Long names (e.g. a verbose GGUF LLM id) blew the grid out past
the container despite the per-cell ellipsis styling. Switching to
`minmax(0, 1fr)` lets the tracks shrink and the ellipsis take effect.
- Sidebar user avatar: the desktop collapsed look centers the avatar via
`.sidebar.collapsed .sidebar-user{-link}` rules, but the tablet
icon-rail (640-1023px) collapses visually through `.sidebar:not(.open)`
without necessarily carrying the `.collapsed` class, so the avatar kept
its left-aligned negative margins and looked misaligned. Mirror the
centering rules under `.sidebar:not(.open)`.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The gallery has metal-ds4 / metal-ds4-development entries, and the build
recipe exists (make backends/ds4-darwin, special-cased in
backend_build_darwin.yml), but ds4 was never listed in the darwin matrix,
so no metal-darwin-arm64-ds4 image was ever published and the entries
dangled.
- Add ds4 to the darwin matrix (includeDarwin), mirroring the llama-cpp
form (the reusable workflow builds it via 'make backends/ds4-darwin').
- Fix inferBackendPathDarwin in scripts/changed-backends.js to map ds4 to
backend/cpp/ds4/ (like llama-cpp): ds4 is C++ but the matrix entry carries
lang=go, so without this its darwin build would only ever run on a release
(FORCE_ALL), never incrementally when backend/cpp/ds4 changes.
sherpa-onnx and speaker-recognition are already in the darwin matrix on
master and are not changed here.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(gallery): correct meta-backend definitions in backend/index.yaml
Backends that ship per-platform images must be meta backends (a capabilities
map and NO uri) so the right variant is auto-selected per platform - mirroring
llama-cpp/whisper. Several entries were misdefined; fixed here:
- Concrete base + metal sibling (could not select the Apple Silicon variant):
silero-vad, piper, kitten-tts, local-store (+ their -development). Converted
each anchor to a meta and added the cpu-<name> concrete.
- mlx family (mlx, mlx-vlm, mlx-audio, mlx-distributed + -development): anchor
had both a uri AND a capabilities map, so IsMeta() was false and the map was
ignored (always resolved to the metal-darwin image); the metal-<name> target
did not exist. Removed the uri and added the missing metal-<name> concretes.
- Dangling capability targets: diffusers/kokoro nvidia-l4t-cuda-12 repointed to
the existing nvidia-l4t-<name> concrete; coqui nvidia-cuda-13 key removed
(no cuda13-coqui image).
- locate-anything: the meta existed but its concrete entries were never added,
so it was un-installable on every platform. Added the full concrete set plus
the locate-anything-development meta, mirroring rfdetr-cpp. Image tags grounded
against the published quay.io tags.
- trl (cuda12/13): repointed the stale 'cublas-cuda12/13-trl' image tags to the
actually-published 'gpu-nvidia-cuda-12/13-trl' tags (fixes#9236).
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(gallery): make opus a meta backend so the platform variant is auto-selected (#9813)
The realtime/WebRTC path loads the "opus" codec backend by name, but on
macOS arm64 only "metal-opus" is installable, so Load("opus") failed with
"opus backend not available".
The root cause: unlike llama-cpp and whisper, the opus entry was a concrete
CPU backend (it carried a uri and no capabilities map) rather than a meta
backend, so nothing mapped "opus" to the platform-appropriate variant.
Restructure opus to mirror llama-cpp/whisper: "opus" becomes a meta backend
with a capabilities map (default -> cpu-opus, metal -> metal-opus) and no
uri; the CPU image moves to a new "cpu-opus" concrete (and its dev variant
to "cpu-opus-development"). Installing "opus" now resolves to metal-opus on
Apple Silicon and cpu-opus elsewhere, and Load("opus") works on every
platform via the meta pointer - so the realtime endpoint needs no special
casing. This reverts the realtime_webrtc.go resolution helper from the
earlier approach in favor of the gallery-level fix.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(neutts): pin torchaudio to match torch to avoid ABI mismatch (#9798)
neucodec pulls torchaudio transitively but it was unpinned, so an
incompatible torchaudio could be resolved against the pinned torch==2.8.0,
producing the 'undefined symbol: torch_library_impl' load failure. Pin
torchaudio==2.8.0 alongside torch in the cpu and cublas12 requirements.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(mcp): bump cogito to handle non-text tool result content
Fixes#10101: the API panicked with "interface conversion: mcp.Content
is *mcp.ImageContent, not *mcp.TextContent" when an MCP tool returned an
image. Upstream cogito PR #50 replaced the unchecked TextContent
assertion in the tool-result loop with a contentToString type-switch
that handles image (and other non-text) content blocks gracefully.
Bump github.com/mudler/cogito to v0.10.1-0.20260609212329-bf4010d31047,
which includes the fix.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Extends the piper voice set with a couple of voices per language for 42 more
languages (Arabic, Bulgarian, Catalan, Czech, Welsh, Danish, Greek, Spanish,
Basque, Persian, Finnish, French, Hindi, Hungarian, Indonesian, Icelandic,
Georgian, Kazakh, Luxembourgish, Latvian, Malayalam, Nepali, Dutch, Norwegian,
Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Albanian, Swedish,
Swahili, Telugu, Turkish, Ukrainian, Urdu, Vietnamese, Chinese, ...), run
through the crispasr backend's backend:piper engine and hosted at
LocalAI-Community/piper-voices-GGUF.
All converted from rhasspy/piper-voices with CrispASR's convert-piper-to-gguf.py
and screened end-to-end on the pinned engine. Only single-speaker low/medium
voices are included; high-quality decoders and multi-speaker models segfault and
are excluded (e.g. zh_CN-chaowen dropped, huayan kept).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(config): gate top_k=40 default on backend family (#6632)
SetDefaults injected top_k=40 (llama.cpp's sampling default) for every
model config regardless of backend. That value is wrong for backends
whose native default differs: mlx_lm's intended default is top_k=0
(disabled) and mlx does not remap 0->40, so a client that omits top_k
silently got 40 shipped to mlx, changing sampling. The mlx backend's own
getattr(request,'TopK',0) fallback is dead because proto3 int32 is always
present.
Gate the injection on backend family via UsesLlamaSamplerDefaults: keep
top_k=40 for the llama.cpp family and for the empty/auto backend (the GGUF
auto-detect path resolves to llama.cpp, so existing behavior is preserved),
but leave TopK nil for the known non-llama backends (mlx, mlx-vlm,
mlx-distributed). gRPCPredictOpts now sends 0 when TopK is nil, which is
the value mlx actually wants.
Only TopK is gated - the confirmed bug. The sibling sampler defaults
(top_p, temperature, min_p) are left global to avoid widening scope and
introducing nil-deref risk; revisit per-backend if needed.
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
A transcription session.update that carries only a language (no model) —
e.g. a client forcing the STT input language — has an empty
Transcription.Model. updateSession unconditionally copied that into
session.ModelConfig.Pipeline.Transcription, blanking the pipeline's
configured transcription backend. The next utterance then transcribed
against an empty model and the backend RPC failed with "unimplemented"
(surfaced to the client as transcription_failed), so transcription
silently stopped whenever a language was selected.
Only adopt the incoming transcription model when it is non-empty, and
preserve the existing model otherwise (mirroring updateTransSession).
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Agent chat replies rendered a broken timestamp in the web UI
("Invalid Timestamp" / "12:00 AM", identical for every reply) because
the SSE timestamp unit was inconsistent across producers.
EventBridge.PublishEvent emitted Unix nanoseconds while the local
dispatcher (dispatcher.go) already emitted Unix milliseconds, and the
React UI fed the value straight into `new Date(ts)` after dividing by
1e6. Nanoseconds also overflow JS's safe-integer range (~1.7e18).
Standardize on Unix milliseconds: switch PublishEvent to UnixMilli and
drop the /1e6 conversion in AgentChat.jsx so both SSE paths agree and
match the React UI's expectation. Add a regression test asserting the
published timestamp is in milliseconds.
* fix(darwin): never package a go backend build tree as a working image
The darwin/arm64 vibevoice-cpp image shipped the source tree with a
half-built CMake directory (build-libgovibevoicecpp-fallback.so/) and no
backend binary, so the backend could never start: run.sh exec'd a
vibevoice-cpp binary that was not in the package and LocalAI timed out
waiting for the gRPC service.
Two durable, backend-agnostic defenses:
- backend/go/vibevoice-cpp/Makefile: mirror whisper's cleanup discipline so a
partial CMake tree cannot survive into packaging. Run `make purge` before
each variant build and `rm -rfv build*` after. The old recipe only removed
its build dir after a successful `mv`, so a failed build left the half-built
tree behind.
- scripts/build/golang-darwin.sh: before creating the OCI image, remove any
stray build-* directory and assert that the binary run.sh launches actually
exists. A build that produced no binary now fails the job loudly instead of
publishing a source tree as a working backend. The binary name is derived
from run.sh's `exec $CURDIR/<binary>` line (parakeet-cpp launches
parakeet-cpp-grpc, so it is not always ${BACKEND}) with a ${BACKEND}
fallback.
The underlying native build failure that left vibevoice-cpp half-built still
needs to be reproduced and fixed on Apple Silicon; this change ensures such a
failure can never again be published as a working image.
Refs #10267
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(vibevoice-cpp): build libvibevoice.a on darwin (link target, not path)
The darwin build failed with:
No rule to make target 'vibevoice/libvibevoice.a', needed by
'libgovibevoicecpp.so'. Stop.
The upstream vibevoice project is added with add_subdirectory(... EXCLUDE_FROM_ALL),
so its `vibevoice` static-library target is only built when something links it
as a target. The Apple branch linked only `$<TARGET_FILE:vibevoice>` - a bare
archive path with no target reference - so CMake never emitted a rule to build
libvibevoice.a, while the Linux branch worked because it passes the `vibevoice`
target name inside the --whole-archive flags.
Link the `vibevoice` target on Apple (establishing the build dependency) and
apply -force_load as a separate link option to keep whole-archive semantics so
purego can dlsym the vv_capi_* symbols.
Refs #10267
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit
declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx
importer hardcoded backend "mlx" for every mlx-community model, so these
VLMs were served by the text-only mlx-lm backend whose tokenizer does not
carry the processor chat template. The template was never applied and the
model produced degenerate, looping output that echoed the prompt.
Detect the "image-text-to-text" pipeline tag in the importer and route those
models to mlx-vlm, which applies the processor-aware chat template. An
explicit backend preference still wins.
As a defensive backstop, the mlx backend now warns loudly when the loaded
model has no chat template, so a misrouted VLM surfaces the problem instead
of silently looping.
Fixes#10269
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update mudler/parakeet.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(parakeet-cpp): close streaming segments on <EOB> after ABI v5 eou/eob split
parakeet.cpp ABI v5 (the pin this PR bumps to) splits the streaming JSON
"eou" flag: in v4 "eou":1 fired for either <EOU> (end of utterance) or
<EOB> (backchannel); in v5 "eou" means <EOU> only, with a new separate
"eob" field for the backchannel token.
The streamSegmenter closed a segment on "eou" alone, so after the bump a
backchannel token would silently stop ending a segment and merge into the
next utterance. Read the new "eob" field and flush on either signal to
preserve the v4 segmentation boundaries. The flat stream_feed eou_out path
is unaffected: its mask is still non-zero for either event.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
CrispASR's piper backend phonemizes non-English text via espeak-ng (dlopen,
the MIT-clean path; English uses a built-in G2P). The FROM scratch crispasr
image shipped none of it, so non-English piper voices loaded but failed
synthesis with "phonemization failed". Bundle the espeak-ng runtime so they
work:
- Dockerfile.golang: install espeak-ng-data + libespeak-ng1 and its libpcaudio0
/ libsonic0 deps in the crispasr builder (espeak's dlopen fails without the
latter two).
- package.sh: copy libespeak-ng.so.1, libpcaudio.so.0, libsonic.so.0 into
package/lib/ and the espeak-ng-data dir into the package root.
- run.sh: export CRISPASR_ESPEAK_DATA_PATH so the bundled data is found.
Add 9 single-speaker piper voices (de/en/it, incl. Italian paola + riccardo) to
the gallery, run through backend:piper, hosted at
LocalAI-Community/piper-voices-GGUF (converted from rhasspy/piper-voices with
CrispASR's convert-piper-to-gguf.py). Only single-speaker low/medium voices are
included; the engine does not yet support multi-speaker or high-quality piper
decoders.
All 9 verified end-to-end: each synthesizes a WAV at the model's native sample
rate using only the image-bundled espeak payload.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF
piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does
not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was
therefore written with a wrong header and played back at the wrong pitch/speed.
Read piper.sample_rate from the model's GGUF metadata at Load via the vendored
gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default
for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that
emit 24 kHz and carry no such key.
Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated
end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium
synthesizes a 22050 Hz WAV through backend:piper.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Neither the sherpa-onnx nor the speaker-recognition backend had a
darwin/arm64 image, so `local-ai backends install` failed with "no child
with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the
sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple
Silicon.
Both backends build on darwin/arm64:
- sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its
Makefile; it only needed a darwin matrix entry (build-type metal, lang go,
like whisper and silero-vad).
- speaker-recognition (Python) needed a requirements-mps.txt so the mps build
installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the
onnxruntime-gpu pulled by its base requirements (which does not).
Add both to the includeDarwin build matrix, wire the metal capability and
metal image aliases into the gallery, and add the speaker-recognition
requirements-mps.txt.
Fixes#10268
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
ggml leaves GGML_CUDA_GRAPHS off by default. Passing -DGGML_CUDA_GRAPHS=ON
for cublas builds lets the CUDA backend capture and replay the compute
graph for a small free speedup (about 1% measured on a GB10, never
negative). It is not gated by parakeet.cpp's CMake options, so it passes
straight through to ggml.
Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(router): score classifier production-readiness
Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.
New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.
Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.
Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.
Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(backend): auto-size batch to context for embedding and rerank models
Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.
Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.
Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(gallery): raise arch-router scoring output cap via parallel:64
Scoring decodes the whole prompt+candidate in a single llama_decode and
reads one logit row per candidate token. The vendored llama.cpp server
caps causal output rows at n_parallel, so the default of 1 aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route
labels. Set options: [parallel:64] on both arch-router quant entries to
lift the cap; kv_unified (the grpc-server default) keeps the full context
per sequence, so this does not split the KV cache.
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Hugo fails the gh-pages build with REF_NOT_FOUND because the relref
in model-configuration.md uses the 'docs/' prefix; refs are resolved
relative to content/, so the page lives at 'features/openai-realtime'
(as the other ref in the same file already uses).
Assisted-by: Claude Code:claude-fable-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): pipeline streaming + disable_thinking config
Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining
streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): streaming TTS/transcription methods on Model interface
Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): emitSpeech with flag-gated streaming TTS
emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.
Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): route response audio through emitSpeech (streaming TTS)
Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): streaming transcription text deltas
Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): pipeline disable_thinking maps to enable_thinking off
applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): speechStreamer for token-streamed LLM->TTS
emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): wire streamLLMResponse for token-streamed replies
triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(realtime): document pipeline streaming + disable_thinking
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): register pipeline streaming/thinking config fields
TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.
Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): always strip reasoning from spoken output
disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).
Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): clean TTS temp path before read (gosec G304)
emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.
Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): buffer whole message for TTS, drop sentence segmenter
Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.
Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): stream tool-call turns via tokenizer-template autoparser
Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.
- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
(grammar-based function calling still buffers, since its call is emitted as
JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
in the token callback, parses tool calls from the final ChatDeltas, and
creates the assistant content item lazily so a content-less tool turn emits
only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
calls, response.done, and server-side assistant-tool follow-ups identically.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): script-aware clause chunking + streamed-reply fixes
Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.
Also fix the streamed-reply path and the Talk page:
- Don't swallow streamed autoparser content as reasoning: the
tokenizer-template path already delivers reasoning-free content via
ChatDeltas, so prefilling the thinking start token re-tagged it as an
unclosed reasoning block, leaving no spoken reply. Disable the prefill
on that path; closed tag pairs are still stripped (#9985).
- Generate collision-free realtime IDs (16 random bytes) instead of a
constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
works.
- Key the Talk transcript by the server item_id and upsert entries.
Realtime events arrive over a WebRTC data channel — outside React's
event system — so React defers the setTranscript updaters while
synchronous ref writes in handler bodies run first; the old
index-tracking ref rendered a duplicate assistant bubble on
completion. Upserts by item_id are idempotent and order-independent.
- Drop the partial assistant bubble on a cancelled response (barge-in):
the server discards the interrupted item and sends response.done with
status "cancelled"; mirror that in the UI so the regenerated reply
isn't rendered as a second assistant message.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved)
vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer
to vllm.tokenizers. Since the backend requirements install vllm
unpinned, freshly built/installed vllm backends currently fail to
start with ModuleNotFoundError: No module named
'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not
ready' when loading a model).
Use the same try/except version-compat import pattern already used
elsewhere in this file: try the new vllm.tokenizers location first and
fall back to the pre-0.22 path.
Tested on a DGX Spark (GB10, ARM64) with the
cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat
completions and tool calls all work with this patch applied.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(distributed): track in-flight for non-LLM inference methods
InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.
In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.
Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The /v1/realtime WebRTC handler created the peer connection with a bare
webrtc.Configuration and no SettingEngine, so pion gathered a host ICE
candidate for every local interface. Under Docker host networking that
includes bridge addresses (docker0/veth, 172.x) a remote browser cannot
route to; the call establishes on a good pair and then drops once ICE
consent freshness checks fail on the unreachable candidates.
Add two opt-in knobs, applied via a pion SettingEngine:
- LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates
(e.g. the host LAN IP)
- LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces
Defaults are unchanged (empty => current all-interface behavior).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup.
Update chat tests to assert response writer errors so errcheck passes without hiding failed writes.
Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli
Assisted-by: Codex:GPT-5
Signed-off-by: Ching Kao <0980124jim@gmail.com>
Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests.
The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`.
Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server.
Document the new command and terminal workflow in the README and CLI docs.
Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli
Assisted-by: Codex:GPT-5
Signed-off-by: Ching Kao <0980124jim@gmail.com>
* dev knowledge.go structure
Signed-off-by: Pete Chen <petechentw@gmail.com>
* feat(agents): append KB source citations to responses
Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint.
Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base.
Tested with: go test ./core/services/agents
Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>
* Collect KB citations from tool searches
Signed-off-by: Pete Chen <petechentw@gmail.com>
* fix(agents): append KB sources in local chats
Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint.
Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>
---------
Signed-off-by: Pete Chen <petechentw@gmail.com>
Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com>
Co-authored-by: TLoE419 <tloemizuchizu@gmail.com>
Co-authored-by: Ching Kao <0980124jim@gmail.com>
* fix(reasoning): stop prefilled <think> from swallowing tag-less answers
When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.
But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.
Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.
Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(reasoning): cover the enable_thinking=false non-thinking-mode regression
Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): forward video input to mtmd (template + non-template paths)
Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add video attachment support to the chat UI
Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)
Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.
Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch
Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.
Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").
NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update CrispStrobe/CrispASR
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(crispasr): link crispasr-lib CMake target instead of crispasr
The dependency-bump regeneration of this branch reset CMakeLists.txt to
master and dropped the prior link-target fix, reintroducing the
`cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the
library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr);
there is no target named `crispasr`, so target_link_libraries falls back
to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link
at the real target name.
Verified locally: CPU cmake-configure of the bumped source generates a
gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no
dangling -lcrispasr.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update antirez/ds4
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(ds4): link ds4_ssd.o into the backend build
Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c
translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release,
ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are
referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated
this branch from clean master and dropped the prior linkage fix, so the
cpu-ds4 / cublas-ds4 backend builds fail again with undefined references.
Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o)
in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule
for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Add the remaining official Google Gemma 4 QAT Q4_0 GGUFs (E2B, E4B,
26B-A4B, 31B) next to the existing 12B entry, each shipping its
multimodal mmproj.
Also add three MTP (Multi-Token Prediction) speculative-decoding bundles
that pair each QAT target with a QAT-matched assistant/drafter head:
- 12B <- Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF
- 26B-A4B <- boxwrench/gemma-4-qat-mtp-assistant-heads
- 31B <- boxwrench/gemma-4-qat-mtp-assistant-heads
The assistant heads use the gemma4_assistant architecture and are not
standalone chat models, so each entry bundles the target + draft and
sets draft_model together with the draft-mtp spec options
(spec_type:draft-mtp / spec_n_max:6 / spec_p_min:0.75), matching
MTPSpecOptions() in core/config/mtp.go. QAT-matched heads raise draft
acceptance substantially over generic non-QAT heads.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus
The processingBackends map (the UI 'reinstalling' spinner source) only cleared
an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and
Upgrade buttons never poll, so completed installs leaked into processingBackends
forever and the backend card spun 'reinstalling' even though the install had
finished. Evict terminal ops on the list read instead; DeleteUUID already
broadcasts the eviction so peer replicas converge.
Reproduced on a live 5-node distributed cluster: 5 backends sat in
processingBackends with underlying jobs reporting completed:true,progress:100.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): clear pending backend ops behind offline/draining nodes
ListDuePendingBackendOps filters status=healthy, so a backend op queued against
a node that went offline (stale heartbeat) or draining (admin action) was never
retried, aged out, or deleted - it leaked forever and kept the UI operation
spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass:
draining nodes are cleared immediately (model rows already purged), offline
nodes once their heartbeat is older than a grace window (blip protection).
Reproduced on a live cluster: orphaned llama-cpp install rows targeting an
offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0
indefinitely.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): stream per-node progress during backend upgrade
The install dispatch subscribed to a per-op progress subject and streamed
per-node download ticks; the upgrade dispatch did a bare 15-minute blocking
NATS round-trip with no subscription, so the UI showed progress:0 the whole
time (the 'reinstalling but nothing happens' report on a slow node).
Thread the op ID through BackendManager.UpgradeBackend -> the distributed
manager -> the adapter, and have the adapter subscribe to the per-op progress
subject before the request (extracted into a shared subscribeProgress helper
reused by install/upgrade/force-fallback). The worker's upgradeBackend now
creates the same DebouncedInstallProgressPublisher installBackend uses. An
upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress
rather than minting a new subject - no new NATS permission, no new
rolling-update compat surface. Reconciler-driven retries pass empty
opID/onProgress and stay on the silent path.
Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow
sat at progress:0 for 4+ minutes with no per-node feedback.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(galleryop): persist cancellation + periodically reap orphaned ops
Two distributed gaps surfaced when a replica was killed mid-upgrade on a live
cluster, leaving the backend stuck 'processing' in the UI forever:
1. CancelOperation flipped the in-memory status to cancelled and broadcast a
NATS event but never persisted the terminal status. On the next replica
restart the still-active row re-hydrated straight back into
processingBackends and the UI spun again. It now calls store.Cancel(id) so
the cancel survives a restart.
2. CleanStale (which marks abandoned active ops failed) only ran once on
startup, so an op orphaned AFTER startup - its owning replica's foreground
handler goroutine gone - was never reaped until the next restart. Add
GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale
now returns the reaped count for observability).
Neither is covered by the OpCache self-evict fix: an orphaned op never reaches
Processed, so it would never self-evict.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(review): address self-review findings on the distributed install fixes
Three findings from an adversarial review of this branch:
1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns
the live internal map by reference, so deleting from it on the read path was
an unsynchronized write to a map four HTTP handlers poll every ~1s -> a
'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build
a fresh result map, and apply evictions via the locked DeleteUUID after the
loop. Added a -race concurrency regression guard.
2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations
and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so
the admin can read the error and click Dismiss). Eviction now fires only for
terminal ops with Error == nil (success/cancelled); failures are retained.
3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node
marked unhealthy on a NATS ErrNoResponders never transitions to offline
(health.go skips re-marking it), so its pending ops leaked exactly like the
offline case. Unhealthy is now reaped via the same stale-heartbeat grace path
(a fresh-heartbeat node is recovering and keeps its op).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops
Second review pass found two issues:
1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling
soft-path op. That op is deliberately Processed=true with no error to show a
yellow in-progress state when a worker timed out the NATS round-trip but is
still installing in the background; the reconciler confirms the real outcome
later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an
install that may still fail. Eviction is now scoped to a clean success
(progress 100 + 'completed', matching the job-poll's historical condition) or
a cancellation - the soft-path (progress != 100) and failures are kept.
2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an
'Installing...' spinner, so a failed op (now intentionally kept in the list
for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from
the card spinner, mirroring Models.jsx (isInstalling already excludes
op.error). The error + Dismiss still surface in the global OperationsBar.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): refresh Manage backends table when an operation settles
The Manage backends table fetched installed backends only on mount/after delete
and checked upgrades only on tab activation. After a reinstall/upgrade completed
neither re-ran, so the installed-version cell and the 'update available' badge
stayed stale until the user switched tabs - the op looked like it 'did nothing'.
Watch the operations list (via useOperations) and re-fetch installed backends +
available upgrades whenever the count settles, mirroring the operations.length
watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades
check into the same effect.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.
The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.
Fixes#9790
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a
separate `gemma4-assistant` architecture. That assistant GGUF still carries
`<arch>.nextn_predict_layers`, so the architecture-agnostic detection in
HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults.
Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot
self-speculate: it is a draft model that requires a paired target context
(`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults
to a standalone assistant import therefore produces a broken config.
Guard the detection against draft-only assistant architectures (the `-assistant`
suffix is upstream's naming convention) so importing one no longer yields a
self-speculation config. Two-model target+draft pairing remains expressible
manually via `draft_model:` and is left to a follow-up.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)
Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.
Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.
Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(parakeet-cpp): update model-gated specs for multi-segment output
The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): drop obsolete legacy-spec shim after fork rebased
The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.
Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:
- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.
CUDA builds are unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* ci(turboquant): drop the ROCm/hipblas build flavor
The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.
Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:
failed to extract shortcode: template for shortcode "alert" not found
docs/content/features/distributed-mode.md:136
Convert the warning block to the theme-supported notice shortcode used
everywhere else in the docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model
Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.
Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): add Ideogram4 stablediffusion-ggml models
Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).
Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml (8-bit, ~20GB diffusion)
Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths
Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).
parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin
Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore(parakeet-cpp): bump pin to banded long-audio attention (843600590)
Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f
(merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer
attention with the chunk-matmul construction: long audio now uses O(T*window)
attention instead of global O(T^2), fixing the encoder OOM on long clips
(~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window.
Short clips are unchanged (global path). No C-ABI change.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore: bump LocalAGI and localrecall (index-backed RRF hybrid search)
Bumps the agent stack to pull in the PostgreSQL hybrid-search fix:
- mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged)
- mudler/LocalAGI -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged)
localrecall's hybrid search previously sorted on a wrapped scalar
similarity expression, which blinded the planner into a full sequential
scan over every row and exceeded the statement timeout on large
collections, returning an empty result set. It now uses the canonical
Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL
OUTER JOIN + weighted RRF).
Fixes#10186
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat: forward reasoning_effort to the backend so jinja models honor it
reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.
Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): set the pipeline LLM's reasoning_effort
Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The worker HTTP file-transfer server is authenticated by the registration
token via checkBearerToken, which fails open on an empty token: every
/v1/files, /v1/files-list and /v1/backend-logs request is then served
unauthenticated, granting read/write to the worker's models/staging/data
directories. The fail-open was also silent (the only auth log sat on the
unreachable reject branch), and the worker process never runs
DistributedConfig.Validate(), so the existing frontend warning did not
cover the component that exposes the server.
Mirror the NatsRequireAuth pattern: keep anonymous as the default but make
it loud and opt-in enforceable.
- Log a prominent warning when the file-transfer server starts tokenless.
- Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors
on an empty token (frontend) and the worker refuses to start (fail-fast,
before registration), so production can fail closed. Also satisfies the
F-003 suggestion to fail Validate() on distributed + empty token.
- Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both
RegistrationRequireAuth and NatsRequireAuth — one production knob locking
down the registration/file-transfer layer and the NATS bus together; the
granular flags remain available as single-layer overrides. Wired into the
frontend, supervisor worker, and agent worker (vLLM worker has neither a
NATS connection nor a file-transfer server, so it is left untouched).
- Document in distributed-mode.md (warning callout + flag tables).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(distributed): self-heal stale 'model not loaded' routing
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.
InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(distributed): typed model-not-loaded error via gRPC status code
Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):
- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
IsModelNotLoaded(err) checker (status-code first, message fallback for
backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
rfdetr-cpp) to the typed constructor.
Acting on a false positive is harmless (the model is just reloaded).
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as
ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks
both backends by name — but BackendCapabilities had no entries for
either, so the UI could not classify them.
Also missing were the Method* constants for the five proto-defined RPCs
these backends implement (FaceVerify, FaceAnalyze, VoiceVerify,
VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings
and UsecaseInfoMap entries needed to wire them into the rest of the
capability system.
Changes:
- Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify,
MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants
- Add UsecaseFaceRecognition ("face_recognition") and
UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants
- Add UsecaseInfoMap entries for both new usecases, referencing the
existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags
- Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze
- Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze
Follows up on #10107 which left these two out because they needed new
constants first.
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.
stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).
Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.
Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:
- proto: add `optional string instructions` and `map<string,string> params`
to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
that attaches instructions only when non-empty (so backends can fall back to
YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
(used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
options into generate() kwargs.
Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.
Closes#10164
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage
Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.
Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.
Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(distributed): correct NATS JWT scoping and harden client auth
The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:
- Agent-worker minted JWTs did not allow the subjects the agent worker
actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
so MCP-CI jobs and backend-stop session cleanup were silently dropped.
Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
returned a live-but-dead subscription). Confirm subscriptions with a
server round-trip so a denial surfaces synchronously, and log async
permission errors.
- The backend worker connected anonymously when given a JWT without its
paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
prefixcache.>, which the frontend publishes and subscribes; add it.
Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.
Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(distributed): acquire and auto-refresh worker NATS credentials
Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.
Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):
- Acquire re-registers through admin approval until the node is approved
and credentials are minted (or returns the first success when auth is
not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
return an error on exhaustion, so an unapprovable or unrenewable worker
exits non-zero and surfaces the problem instead of hanging or drifting
toward an expired credential.
The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.
Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.
Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".
Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.
Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.
Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-03 15:06:57 +02:00
710 changed files with 53534 additions and 9474 deletions
**If the backend is a native C/C++/GGML engine created and maintained by the
LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
engines ... developed and maintained by the LocalAI project itself". Add a row
linking the upstream engine repo with a one-line description. This is the
project's showcase of its own engines; a new in-house backend that is missing
from it is a documentation bug.
## 5. Verification Checklist
After adding a new backend, verify:
@@ -211,6 +232,8 @@ After adding a new backend, verify:
- [ ] No YAML syntax errors (check with linter)
- [ ] No Makefile syntax errors (check with linter)
- [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`
**LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
@@ -149,12 +161,26 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
```
To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
> **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).
For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).
## Latest News
- **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
- **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
- **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
- **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
@@ -194,10 +220,29 @@ For older news and full release notes, see [GitHub Releases](https://github.com/
## Supported Backends & Acceleration
LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).
### Backends built by us
Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
| Backend | What it does |
|---------|-------------|
| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-tensor, per-layer quantization recipe for Mixture-of-Experts models that exploits their structural sparsity to produce GGUFs matching or beating Q8_0 quality - and they run out of the box on stock llama.cpp.
## Resources
- [Documentation](https://localai.io/)
@@ -207,7 +252,7 @@ See the full [Backend & Model Compatibility Table](https://localai.io/model-comp
- [Integrations & community projects](https://localai.io/docs/integrations/)
- [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
- [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)
src:=unsafe.Slice((*float32)(unsafe.Pointer(ptr)),int(n))//nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out:=make([]float32,int(n))// copy out of C memory before free
out:=make([]float32,int(n))// copy out of C memory before free
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.