WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.
Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):
- backend/index.yaml: add the &facedetect meta-backend (capabilities
platform map, no top-level uri to avoid the meta-backend gotcha) plus
the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
commercial-friendly alternative). The detector/embedder architecture is
read from GGUF metadata (facedetect.arch) at load; only the real
verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
artifacts are not yet published: each files: entry points at the
intended mudler/face-detect-gguf location with a TODO to fill sha256
after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
FACEDETECT_VERSION (pin 636a1963).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:
- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
-> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
(class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
age + gender ("M"/"F" normalized to "Man"/"Woman").
The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.
Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):
- backend/index.yaml: add the &voicedetect meta-backend (capabilities
platform map, no top-level uri) plus the full set of concrete per-arch
image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
-development variants). Referential integrity audited - every alias
target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
wav2vec2 age/gender/emotion analyze model. The engine architecture is
read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
not yet published: each files: entry points at the intended
mudler/voice-detect-gguf location with a TODO to fill sha256 after
upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
speaker_recognition), mirroring speaker-recognition.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.
The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.
Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
-> dominant_gender with the remaining float scores as the gender map; emotion
label/scores -> dominant_emotion/emotion).
The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(ced): sketch sound-classification backend (CED audio tagger)
Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry,
footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend.
SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist
in DESIGN.md):
- backend/backend.proto: new SoundDetection rpc + SoundClass messages
(run `make protogen-go` to regenerate pkg/grpc/proto).
- backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h),
goced.go (Ced gRPC backend: Load + SoundDetection), Makefile
(clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh,
package.sh, .gitignore.
- DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability
registration checklist), gallery/index + CI registration, and a scoping
note for the realtime/websocket live-recognition path (sliding-window
classify over the existing ws transport + voicegate; the ced C-API
per-PCM entry point is already window-friendly).
Backend code does not compile until protogen-go regenerates the pb types
and a libced.so is built (Makefile clones+builds it).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): REST /v1/audio/classification endpoint + capability registration
Wires the ced sound-event classification backend (AudioSet audio tagger)
end to end through the REST surface, mirroring the transcription path.
- Handler: core/http/endpoints/openai/sound_classification.go parses the
multipart audio upload, temp-files it, resolves the model config and
calls the SoundDetection RPC; returns {model, detections[]} JSON.
- Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection)
loads the model and normalizes the proto response into schema types.
- Schema: core/schema/sound_classification.go (SoundClassificationResult).
- gRPC layer: SoundDetection wired through the LocalAI wrapper (interface,
Backend client, Client, embed, server, base default) so the loader-typed
client exposes the RPC; proto regenerated via make protogen-go.
- Route: POST /v1/audio/classification (+ /audio/classification alias) with
the audio/multipart default-model middleware in routes/openai.go.
- Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_
CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap +
GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase
option; /api/instructions audio area updated; auth RouteFeatureRegistry +
FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI
usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter
+ i18n; docs page features/audio-classification.md + whats-new + crosslink.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): realtime sound-event detection over the websocket API
When a realtime pipeline configures a sound-classification model, each
VAD-committed utterance (the same window the transcription path produces)
is also run through the CED sound-event classifier and the scored AudioSet
tags are emitted as a new server event. No new backend rpc is needed: the
SoundDetection gRPC method already exists on this branch.
- config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty)
beside Transcription/VAD.
- realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the
ModelInterface; implement it on wrappedModel and transcriptOnlyModel by
calling backend.ModelSoundDetection with the session's sound-classification
model config (mirrors how Transcribe dispatches). Load the optional config
in newModel / newTranscriptionOnlyModel; nil config keeps it additive.
- types: add ConversationItemSoundDetectionEvent (item_id, content_index,
detections[]{label,score,index}) with type conversation.item.sound_detection,
its ServerEventType constant and MarshalJSON, mirroring the transcription
completed event.
- realtime: add emitSoundDetection (unary path: classify the committed window,
build the event, t.SendEvent) and wire it at the utterance-commit hook right
after emitTranscription; gated on session.SoundDetectionEnabled (resolved
from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0).
Its error is logged via xlog but never aborts the turn.
- test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections,
classifier error) plus a SoundDetection method on the fakeModel double.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): implement SoundDetection in nodes backend test doubles
The SoundDetection method added to the grpc backend interface left two
test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so
core/services/nodes failed to compile under `go vet`/`go test` (go build
missed it: the doubles live in _test.go). Add the method to both,
mirroring their existing Detect mock. Repairs CI for the nodes package.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): decouple realtime sound detection from VAD (sound-only sessions)
Sound-event detection must activate on sounds, not speech, so it no longer
runs through the voice VAD/transcription path. A sound-detection-only
pipeline (sound_detection set, no transcription/LLM) now:
- is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline
stage),
- builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS
loaded), and
- defaults the session to turn_detection none (no VAD) with no transcription
stage, so the client drives windowing via input_audio_buffer.commit
(option A: client-side sliding window). The per-PCM C-API already supports
arbitrary windows.
commitUtterance gains a sound-only branch: it emits the
conversation.item.sound_detection event (scored AudioSet tags) and stops -
no transcription, no LLM response. generateResponse is now guarded on a
transcription stage being present, so a sound-only turn never invokes the LLM.
Existing transcription/VAD sessions are unchanged (additive). Added a
commitUtterance sound-only Ginkgo spec asserting it emits the sound event and
neither transcribes nor generates a response. go vet + golangci-lint
(new-from-merge-base) clean; openai suite green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): register sound-classification backend in gallery + CI
Mechanical backend-image registration for the ced sound-event classifier,
mirroring the parakeet-cpp Go/purego backend everywhere it is wired up.
- .github/backend-matrix.yml: add the ced build matrix, field-for-field copies
of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64,
l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan
amd64/arm64, rocm hipblas, and the metal darwin entry), changing only
backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang.
- backend/index.yaml: add the &ced meta anchor (capabilities map per platform)
plus ced-development and the per-arch image entries, each uri/mirror
tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is
intentionally deferred pending the HuggingFace publish (TODO note inline).
- scripts/changed-backends.js: add an explicit item.backend === "ced" branch in
inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as
the parakeet-cpp branch (before the generic golang fallthrough).
- .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in
backend/go/ced/Makefile so the daily bot bumps the pin.
- swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so
the existing /v1/audio/classification annotations land in the generated spec.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): server-side windowing for realtime sound detection (option B)
Adds an optional server-driven sliding-window classifier so a sound-only
realtime client only has to stream audio (no input_audio_buffer.commit):
- Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs.
When both > 0 on a sound-only session, the server classifies the last
window of streamed audio every hop and emits a conversation.item.sound_
detection event; the input buffer is trimmed to one window so a long
stream stays bounded. When unset, the session stays client-driven
(option A). Runs independent of VAD (sound events are not speech).
- handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so
it is unit-testable) + writeWindowWAV, which declares the true
InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples
correctly. Goroutine is started after toggleVAD and torn down with the
session (close + wg.Wait).
- Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta
registry; the earlier realtime commit added pipeline.sound_detection
without a registry entry, failing TestAllFieldsHaveRegistryEntries. This
fixes that and covers the two new knobs.
Tests: classifySoundWindow emits an event + trims the buffer to one window,
no-ops on too-little audio; writeWindowWAV declares the given sample rate.
go build/vet + golangci-lint (new-from-merge-base) clean; config + openai
suites green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0)
The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0,
converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced +
known_usecases: sound_classification) and two gallery/index.yaml entries
(ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and
removes the now-resolved TODO from backend/index.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add tiny/mini/small GGUF model gallery entries
Publishes the rest of the CED family (same architecture, metadata-driven port
verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds
their f16 + q8_0 gallery entries:
ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB
ced-mini (9.6M) f16 19MB / q8_0 11MB
ced-small (22M) f16 42MB / q8_0 23MB
All sha256-pinned. ced-base remains the accuracy default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo
All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single
HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8
gallery model entries' urls + file uris accordingly. sha256 and filenames are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): bump CED_VERSION to the short-clip fix
Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip
shorter than target_length (~10.11s): time_pos_embed was added at its full
63-frame grid instead of being sliced to the clip's actual time grid, tripping
ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s
windows) and gated with a short-clip parity test upstream.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive
- README.md: add ced.cpp to the "native C/C++/GGML engines developed and
maintained by the LocalAI project" table.
- docs/content/features/backends.md: add a Sound Classification backend
category (sound-event classification / audio tagging) listing ced.cpp.
- .agents/adding-backends.md: add a "Documenting the backend" section and two
verification-checklist items requiring new backends to be documented in the
backends.md category list, and in-house native engines to be added to the
README maintained-engines table. This directive was missing.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): repin CED_VERSION to the v0.1.0 release commit
ced.cpp history was squashed into a single release commit (tagged v0.1.0), so
the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the
v0.1.0 release commit, so the backend builds against a commit that exists.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths
- sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of
the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler.
- goced.go: reading a NUL-terminated C string from a libced-owned buffer.
#nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since
the uintptr is a C-owned malloc'd buffer, not Go-GC memory.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active
When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.
Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.
Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* feat(vllm): progressive streaming via parser.extract_tool_calls_streaming
When a tool parser is active for a tool-enabled streaming request,
#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.
Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.
Falls back to the existing #10346 buffer path when:
- the parser does not have extract_tool_calls_streaming, OR
- extract_tool_calls_streaming raises mid-stream (logged, the
rest of the request finishes via post-loop extract_tool_calls).
Tests (TestStreamingToolParser):
1. Buffer path: no markup leaked, no content duplication
2. Native streaming: plain-text response streams progressively
3. Native streaming: tool_call structured, no markup leaked
4. Native streaming exception → graceful fallback, no markup, no crash
5. No tool parser → unchanged per-delta content stream
E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path
Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:
- tool_call: request mentions a tool; model is expected to call it
- plain_text: request offers a tool but explicitly asks for prose
Use this to compare:
- the buffer-all path (#10346) → plain_text TTFT ≈ total response time
- the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time
python examples/vllm-bench/ttft_streaming_tool_parser.py \\
--url http://localhost:8080 --model my-coder --runs 3
Lives under examples/ so it does not interfere with the test suite.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens)
The long-text scenario shows the buffering vs streaming difference most
dramatically: with the buffer-all path, the client receives nothing for
20+ seconds and then the entire 1500-token response at once. With native
streaming, the first token arrives in tens of milliseconds and the
response flows progressively.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
---------
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>
* feat(nemo): enable word-level timestamps for ASR models
The nemo backend ignored timestamp_granularities and always returned a
single segment with start=0 end=0, making word-level timestamps
impossible to obtain even though the NeMo models (parakeet-tdt, etc.)
fully support them.
Changes:
- Add _get_stride_seconds() to compute frame duration from the model's
preprocessor window_stride and encoder subsampling_factor.
- Add _build_segments_with_words() that extracts word offsets from the
NeMo Hypothesis.timestamp dict and converts frame indices to
nanosecond timestamps.
- Support 'word' granularity (one segment per word) and 'segment'
granularity (merge at time-gap boundaries using a dynamic threshold).
- Populate TranscriptSegment.words with TranscriptWord entries so
callers get both segment-level and word-level timing.
- Only request timestamps from NeMo when the caller actually asks for
them (timestamp_granularities is non-empty), keeping the fast path
unchanged for callers that don't need timestamps.
Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:
curl -X POST /v1/audio/transcriptions \
-F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \
-F 'timestamp_granularities[]=word' -F response_format=verbose_json
→ each word has correct start/end times in seconds.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(nemo): address Copilot review feedback
- Narrow exception handling in _get_stride_seconds to catch only
AttributeError, KeyError, TypeError instead of bare Exception, and
emit a warning when falling back to the hardcoded stride.
- Remove explicit return_hypotheses=False when timestamps are requested;
timestamps=True already forces NeMo to return Hypothesis objects.
- Add a warning when NeMo does not return Hypothesis objects despite
timestamps being requested.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
The parakeet-specific word accessors can return stale initialisation
data (model name, binary blobs) for segments with no real speech.
Add isValidWord() to filter out words that have:
- empty or whitespace-only text
- U+FFFD replacement characters (from binary data scrubbing)
- negative timestamps
- zero duration (end <= start)
Also skip empty segments entirely when they have no recognisable
content (empty text AND no valid words), preventing spurious subtitle
entries like '00:45:33,592 --> 00:45:33,592 parakeet@rH\u000b\ufffdI'.
Applies to both AudioTranscription and AudioTranscriptionStream.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
fix(vllm): structured outputs silently ignored on vLLM >= 0.23
vLLM >= 0.23 removed GuidedDecodingParams (now StructuredOutputsParams) and
renamed the SamplingParams field guided_decoding -> structured_outputs. The
import failed, HAS_GUIDED_DECODING became False, and the whole guided-decoding
block was skipped, so response_format / grammar constraints were silently
ignored. Adapt the existing request.Grammar path to the new class/field.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* feat(gallery): add Depth Anything V2 models + bump native version
Add Depth Anything V2 (DA2) support to the depth-anything backend. DA2 is
depth-only (no camera pose, no confidence) and ships both relative
(relative inverse depth) and metric (depth in metres) variants. The Go
backend is model-agnostic, so no backend code changes are required — only
a native version bump and new gallery entries.
- backend/go/depth-anything-cpp/Makefile: pin DEPTHANYTHING_VERSION to the
depth-anything.cpp commit that adds the DA2 engine + C-API routing
(e3dec57f13a52366bbc4f279ef44804915960a6b, kept alive by the upstream tag
da2-support so it survives a squash-merge).
- gallery/index.yaml: add 12 DA2 entries (4 base quants, small, large, plus
Hypersim indoor and VKITTI outdoor metric models in S/B/L). Metric models
carry the metric-depth tag; none carry camera-pose.
Assisted-by: Claude:claude-opus-4-8
* chore(depth-anything-cpp): pin to merged DA2 master commit
PR #1 (mudler/depth-anything.cpp) merged to master as f4e17de (squash); repoint
the pin from the pre-merge commit to the canonical master commit.
Assisted-by: Claude:claude-opus-4-8
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): adapt grpc-server to upstream server-schema split
Upstream llama.cpp (e475fa2) extracted the JSON request-schema evaluation
out of the static server_task::params_from_json_cmpl into the new
server_schema::eval_llama_cmpl_schema (tools/server/server-schema.cpp).
The grpc-server unity build still called the old static member, breaking
every llama-cpp backend build with "no member named 'params_from_json_cmpl'
in 'server_task'".
Pull server-schema.cpp into the translation unit and call the new function,
keeping both guarded by __has_include so forks that predate the split (e.g.
llama-cpp-turboquant, which still exposes params_from_json_cmpl) keep
compiling against the old static member.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): add word-level timestamp support
Add word-level timestamp extraction to the crispasr backend by calling
the CrispASR C library's word accessor functions that are already
exported by libgocraspasr but were not previously bound by the Go
wrapper.
Two families of word functions are supported:
1. Session-based (get_word_count/text/t0/t1) — works per-segment for
whisper-like backends.
2. Parakeet-specific (get_parakeet_word_count/text/t0/t1) — returns a
global word list for TDT/CTC/RNNT parakeet models where the session
API does not expose per-segment word data.
The Go code tries session-based first and falls back to parakeet-specific
when the session word count is zero.
Depends on #10402 (grpc server Words forwarding) for the words to reach
the HTTP response.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(crispasr): use portable sed -i.bak for macOS compatibility
BSD sed requires -i '' for in-place editing while GNU sed uses -i.
Replace with -i.bak which works on both platforms, then remove the
backup file.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Vulkan backends bundled their own loader and ICD manifests but neither the
Mesa driver the manifests point at nor a way to make the loader find them,
so on a runtime base image without Mesa the loader enumerated zero devices
and the GPU silently fell back to CPU (only NVIDIA worked, since its ICD is
injected by the container toolkit).
- scripts/build/package-gpu-libs.sh: for each installed ICD manifest, bundle
the driver .so its library_path names — no hard-coded, platform-dependent
soname list — plus that driver's ldd dependencies, skipping manifests whose
driver isn't installed. Rewrite each library_path to a bare soname so the
bundled driver resolves via the LD_LIBRARY_PATH run.sh already sets.
- .docker/install-base-deps.sh, backend/Dockerfile.golang,
backend/Dockerfile.python: install mesa-vulkan-drivers in every Vulkan
builder so the driver + manifests exist to be packaged (the LunarG SDK
ships only the loader and shader tooling).
- pkg/model/process.go: when a backend ships vulkan/icd.d/, point the loader
at it via VK_DRIVER_FILES/VK_ICD_FILENAMES at launch (no-op otherwise).
Covered by pkg/model/process_vulkan_test.go.
- backend/go/parakeet-cpp/package.sh: complete the L0 stub (was missing the
libc-family ldd walk + GPU-lib packaging) by mirroring whisper, so the
vulkan-parakeet image actually bundles its GPU runtime.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:
- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
whole conversations as one document), with a recreated bounded restricted-
regex secret-matching pattern detector tier alongside it (per-model
pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
filter; token_classify known_usecase; batch sized to context for NER models.
privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
meta + image entries with a capabilities map) matching its CI matrix jobs,
and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
privacy-filter GGUF detection) replacing the prior pref-only registration.
Reconciled against master's independent evolution:
- Dropped master's PIIPatternOverrides feature (global-pattern runtime
overrides + /api/pii/patterns API + runtime_settings.json persistence). The
per-model NER + pattern-detector design supersedes it; it was built on the
global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
removed the patch and restored master's grpc-server.cpp Score RPC (direct
llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
model_config validation forbidding score + chat/completion/embeddings on
llama-cpp. token_classify is unaffected (it runs on the privacy-filter
backend, not llama-cpp).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models
The ds4 backend zero-initialized ds4_engine_options and exposed none of the
engine's tunable knobs, so SSD streaming (run a model larger than RAM by
streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs
were unreachable from LocalAI model YAMLs.
Map ModelOptions.Options onto ds4_engine_options through a declarative table
(kEngineOptSpecs + apply_engine_option) instead of per-field branches: the
struct is fixed C with no reflection, so the field set is enumerated once and a
future knob is a one-line table row. Two fields use ds4's own typed parsers
(GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming")
mean true; path-type options (mtp_path, expert_profile_path,
directional_steering_file) resolve relative to the model directory so a gallery
entry can reference a companion file by bare filename. mtp_draft/mtp_margin are
now validated rather than parsed with throwing std::stoi/std::stof.
Add gallery entries for the 128 GB class:
- deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality)
- deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming)
- deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights)
- deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming)
SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): add depth-anything-3-metric-large gallery entry
DA3METRIC-LARGE (ViT-L) single-file metric-scale depth + sky, served by the
existing depth-anything backend (same single-GGUF path as mono-large). GGUF
published at mudler/depth-anything.cpp-gguf.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): serve nested metric model (two-file load)
The DA3 nested model needs both branches (anyview GIANT + metric ViT-L) loaded
together. Wire it through the backend:
- Load reads a 'metric_model:<file>' entry from ModelOptions.Options and, when
present, calls da_capi_load_nested(anyview, metric) instead of da_capi_load
(registers the new abi-4 symbol; helper optionValue + unit test).
- gallery: depth-anything-3-nested (model=anyview, options=metric branch, both
GGUFs fetched) for metric-scale depth + pose.
- bump depth-anything.cpp pin to cce5edc (abi 4 / da_capi_load_nested).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery
Mirrors the locate-anything-cpp backend to register a new depth-anything
backend that wraps the Depth Anything 3 ggml port (depth-anything.cpp) via
purego (cgo-less, no Python at inference).
- backend/go/depth-anything-cpp/: gRPC backend (Load + Predict + GenerateImage),
purego binding to the da_capi_* C ABI, CMake/Makefile/run/package/test scripts
building depth-anything.cpp's DA_SHARED static .so per CPU variant.
- backend/index.yaml: depth-anything backend meta + all hardware-variant
capability entries (cpu/cuda12/cuda13/intel-sycl-f32+f16/vulkan/nvidia-l4t).
- gallery/index.yaml: 8 Depth Anything 3 GGUF models (base q4_k/q8_0/f16/f32,
small, large, giant, mono-large).
- .github/backend-matrix.yml: one build entry per hardware variant.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): typed Depth RPC + REST endpoint exposing full DA3 data
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): pin depth-anything.cpp to e0b6814 (ABI 3 dense C-API)
The Depth RPC handler calls da_capi_depth_dense / da_capi_points (C-API ABI 3);
pin the native build to the commit that exports them.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): pin depth-anything.cpp to v0.1.0 release (b515c31)
Repoint the native version from the now-orphaned e0b6814 to the
b515c31 release commit, kept alive by the upstream v0.1.0 tag.
C-API is unchanged (da_capi_abi_version == 3).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): wire depth-anything-cpp into build, CI bump, and importer
The backend dir, gallery index, and CI build-matrix were present but the
backend was never wired into the integration points that adding-backends.md
requires:
- root Makefile: add to .NOTPARALLEL, the test-extra chain, a BACKEND_*
definition, the docker-build target eval, and docker-build-backends
(mirrors parakeet-cpp; the backend's own Makefile already documented that
its `test` target is driven by test-extra).
- bump_deps.yaml: register the DEPTHANYTHING_VERSION pin so the daily
auto-bump bot tracks mudler/depth-anything.cpp master (it cannot see an
unregistered Makefile pin).
- import form: add a preference-only KnownBackend entry so depth-anything is
selectable at /import-model (mirrors sam3-cpp; no reliable GGUF auto-detect
signal, so pref-only per the doc's default).
changed-backends.js needs no entry: the generic golang suffix branch already
resolves backend/go/depth-anything-cpp/.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(depth): auto-detect importer for depth-anything GGUFs
Replace the preference-only entry with a real auto-detect importer
(mirrors parakeet-cpp / locate-anything):
- DepthAnythingImporter matches a .gguf whose name carries a
depth-anything token (depth-anything-<size>-<quant>.gguf), so
/import-model recognises mudler/depth-anything.cpp-gguf repos and direct
GGUF URLs without an explicit backend preference. preferences.backend=
"depth-anything" still forces it.
- Registered before LlamaCPPImporter so its GGUF bundles aren't claimed by
the generic .gguf importer; the narrow name match means it cannot claim
arbitrary llama GGUFs or the upstream safetensors PyTorch repos.
- Multi-quant repos pick the smallest quant by default (q4_k -> ... -> f32,
depth stays >0.998 corr even at q4_k); quantizations preference overrides.
- Drops the now-redundant knownPrefOnlyBackends entry (importer-backed
backends are not listed there, matching parakeet-cpp).
- Table-driven Ginkgo test covers detection, negative cases (llama GGUF,
upstream safetensors), default/override/fallback quant pick, and direct
URL import. 10/10 specs pass.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(depth): check conn.Close error in grpc Depth client (errcheck)
The new Depth() client method used a bare `defer conn.Close()`. golangci-lint
runs with new-from-merge-base, so although the 39 sibling methods use the same
bare form (grandfathered), the newly added line trips errcheck. Drop the result
explicitly to satisfy the linter.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
* fix(depth): bump depth-anything.cpp to v0.1.1 (embeddable CMake)
v0.1.0 (b515c31) used ${CMAKE_SOURCE_DIR} for its include dirs, which
points at the parent project when built via add_subdirectory() as this
backend does, so the container build failed with missing stb_image.h /
da_gguf_keys.h. v0.1.1 (2d42897) switches to project-relative paths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
* fix(depth): resolve gosec findings in the backend wrapper
The code-scanning gate flagged three new failure-level alerts in
godepthanythingcpp.go (gosec runs with -no-fail; GitHub gates on new alerts):
- G301: export dirs were created with 0o755. Tighten to 0o750 (no world
access needed for backend-written export output).
- G304: writeDepthPNG creates req.GetDst(). That path is chosen by the
LocalAI core as the intended output destination (same pattern every
image backend uses), not attacker input, so annotate with #nosec G304
and document why.
The remaining G103 "audit unsafe" notes on the unsafe.Slice C-buffer copies
are warning-level (the same purego interop whisper/parakeet use) and do not
gate the check, per the supertonic exclusion precedent in secscan.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
* fix(depth): bump depth-anything.cpp to v0.1.2 (CUDA cross-build arch)
v0.1.1 forced CMAKE_CUDA_ARCHITECTURES=native, which breaks the GPU-less
l4t/cublas CI builds (nvcc "Unsupported gpu architecture 'compute_'" on
CMake 3.22). v0.1.2 (442eea4) drops the override and lets ggml pick its
default cross-build arch list.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>