* feat(realtime): EOU-driven semantic_vad turn detection
Add a `semantic_vad` turn-detection mode to the realtime API that feeds
the transcription model live and decides "the user finished speaking"
from the `<EOU>` end-of-utterance token rather than from silence alone.
When EOU fires the turn commits immediately (~0.3s); otherwise it falls
back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s).
Plumbing, bottom to top:
- proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof,
mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus
`TranscriptResult.eou` for the unary retranscribe gate.
- pkg/grpc: client/server/base/embed scaffolding for the bidi stream,
modeled on AudioTransformStream; release stream conns on terminal Recv.
- parakeet-cpp: live transcription RPC with per-C-call engine locking
(one live stream per turn, finalize+free at commit); bump parakeet.cpp
to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel
recompute that delayed EOU on long turns) and the <EOU>/<EOB> split;
strip the literal <EOU>/<EOB> from offline text and set Eou.
- core/backend: LiveTranscriptionSession wrapper + pipeline
`turn_detection:` config block (type/eagerness/retranscribe).
- realtime: semantic_vad integration — live input captions streamed as
transcription deltas while the user speaks, EOU-immediate commit with
eagerness fallback, optional retranscribe gate (batch re-decode must
also end in <EOU> to confirm), clause synthesis off the LLM token
callback, and per-turn live-transcription / model_load telemetry.
- UI: show the realtime pipeline components as a vertical list.
Docs and tests included; opt-in via the pipeline YAML or per-session
`session.update`. Non-streaming STT backends degrade to silence-only.
Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): explicit formally-verified state machines + parakeet streaming driver
The realtime API had several implicit state machines whose state was inferred
from scattered booleans, channels, and five separate mutexes, leaving
illegal/inconsistent states reachable. Make them explicit and keep the
implementation in step with a formal design; rework the parakeet streaming
backend along the same lines.
Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect
with a total, pure Next(state,event)->(state,[]effect) behind a single-writer
Coordinator:
M1 conncoord connection lifecycle: VAD toggle + once-only teardown
(replaces vadServerStarted + a `done` channel closed from
two sites).
M2 turncoord turn detection: collapses speechStarted and the live-stream
"turn open" flag into one state, so discardTurn can no longer
desync them and suppress the next onset.
M3 respcoord response coordination: serializes the dual-writer
start/cancel so at most one response is live; one
response.done per response.create.
M4 compactcoord conversation compaction: single-flight (replaces the
`compacting atomic.Bool` CAS).
M5 ttscoord TTS pipeline: open->closing->closed, idempotent wait(),
rejects enqueue-after-close (was a silent drop).
The Coordinator/Sink/Next plumbing — only the sealed types and Next differed
per machine — is extracted once into core/http/endpoints/openai/coordinator as
a generic Coordinator[S,E,F]; each machine keeps its public API via type
aliases, so no sink, call-site, or test moved.
Hierarchy. session_lifecycle.fizz models M1 as the parent region with its
children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn
torn => all children terminal, none start after teardown). respcoord and
compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's
teardown drives the children terminal. This closes a compaction teardown gap: a
fire-and-forget compaction could outlive a torn session — compactionSink now
takes a session-scoped cancellable context + WaitGroup and joins the in-flight
summarize+evict on shutdown.
Formal verification. formal-verification/ holds one authoritative FizzBee spec
per machine plus the composition spec, each with an always-assertion and a
documented one-line edit that makes the checker fail (verified non-vacuous).
scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under
-race AND a model-check of every .fizz spec; a missing FizzBee is a hard error
(only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI).
FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into
.tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow,
and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the
repo's forbidigo lint): transition tables + fixed-seed property walks +
concurrent/-race specs, no rapid dependency. Design map:
docs/design/realtime-state-machines.md.
Parakeet streaming backend. The same treatment applied to the parakeet-cpp
streaming paths:
- AudioTranscriptionStream returns codes.Unimplemented for non-streaming models
instead of decoding offline and emitting it as one delta + final. A client
that asked for streaming learns the model cannot stream rather than receiving
a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported
carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it
as an SSE error event. Mirrors AudioTranscriptionLive, which already did this.
- utteranceBoundary (boundary.go): a single definition of the end-of-utterance
latch, replacing three open-coded finalEou toggles. Modelled as a two-valued
type so illegal states are unrepresentable.
- Shared decode driver (driver.go): streamFeedResult (one per-feed event) +
feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail.
The feed loop is written once.
- AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed
{delta,eou,eob,words} the realtime turn detector consumes and a terminal
FinalResult carrying only Text. Segments/duration/eou are offline-only and no
longer produced (nor read) on the live path; liveTraceState drops the terminal
eou and keeps the per-feed eou_events count.
- AudioTranscriptionStream + streamJSON merge into one driver-based function;
streamSegmenter is generalized to the unified event with a text-only fallback
that preserves the legacy (no-words) library's per-utterance segmentation.
Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and
parakeet packages under -race, the fail-closed conformance gate green, and
make test-realtime (12 e2e WS+WebRTC).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
chore(recon): re-pin voice/face-detect to squashed release commits
The voice-detect.cpp and face-detect.cpp engine repos were squashed to a single
release commit, which orphaned the previous pins (voice 3d51077, face 06914b0).
Re-pin to the new single-commit SHAs (voice 1db1759, face e22260d).
These also fold in a real correctness fix: the persistent graph-cache fingerprint
now includes op_params, so two structurally identical GGML_OP_CUSTOM graphs (a
blocked 3x3 vs a blocked 1x1 strided conv) can no longer false-hit the cache and
replay the wrong kernel. voice CI was failing test_blocked/conv1x1_s2 with an
out-of-bounds write on the GGML_NATIVE=OFF build; both engine repos are now green
and WeSpeaker embed parity is 1.0 vs golden.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(voice-detect): add Go purego backend for voice-detect.cpp
Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.
The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.
Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
-> dominant_gender with the remaining float scores as the gender map; emotion
label/scores -> dominant_emotion/emotion).
The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(voice-detect): wire backend into index, gallery and build
Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):
- backend/index.yaml: add the &voicedetect meta-backend (capabilities
platform map, no top-level uri) plus the full set of concrete per-arch
image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
-development variants). Referential integrity audited - every alias
target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
wav2vec2 age/gender/emotion analyze model. The engine architecture is
read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
not yet published: each files: entry points at the intended
mudler/voice-detect-gguf location with a TODO to fill sha256 after
upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
speaker_recognition), mirroring speaker-recognition.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(face-detect): add purego Go backend for face-detect.cpp
Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:
- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
-> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
(class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
age + gender ("M"/"F" normalized to "Man"/"Woman").
The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.
Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(voice-detect): replace em dashes in net-new descriptions
Project style forbids em/en dashes. Replace the three U+2014 chars
introduced by the voice-detect gallery/index wiring with `-`/`:`.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(face-detect): wire backend into index, gallery and build
Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):
- backend/index.yaml: add the &facedetect meta-backend (capabilities
platform map, no top-level uri to avoid the meta-backend gotcha) plus
the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
commercial-friendly alternative). The detector/embedder architecture is
read from GGUF metadata (facedetect.arch) at load; only the real
verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
artifacts are not yet published: each files: entry points at the
intended mudler/face-detect-gguf location with a TODO to fill sha256
after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
FACEDETECT_VERSION (pin 636a1963).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(recon): voice-detect metal build branch + face-detect gallery usecases
Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.
Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(recon): document voice-detect and face-detect ggml backends
Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.
- features/face-recognition.md: add a face-detect (ggml) backend section
with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
voice-detect.cpp rows to the Vision, Detection & Recognition table.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(gallery): publish recon backend GGUF uris + sha256
Fill in the published HuggingFace GGUF uris and verified sha256 for the
9 recon gallery entries (voice-detect-* and face-detect-*), and remove
the TODO publish markers. Correct the eres2net, campplus, and
emotion-wav2vec2 uris to the actual published filenames.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model
Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now
embedded and re-uploaded under the same filenames/uris) and note the
FaceVerify anti_spoof request flag in each description. Add a new
voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion
model.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): add face-detect-buffalo-sc and antelopev2 packs
Add gallery entries for two newly-published insightface face packs on
the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small
ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k
R100, 512-d). Both are non-commercial research-only.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(recon): honor LocalAI per-model threads in voice/face-detect backends
LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to CPU-optimized engine commits
voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to round-2 CPU-optimized engines
voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context,
wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp
-> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace
BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to round-3 Winograd engines
voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete)
voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d)
face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel
(2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch
(ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd
output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect
~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable
single binary (function-multiversioned, no global -mavx512f).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae)
voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0)
face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd
microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of
MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512
winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker
1.90x @1t. Parity cosine=1.0 throughout; portable single binaries.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67)
Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s,
within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback,
fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%)
+ ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace
recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression).
Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef)
WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b)
voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run
the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0);
ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned.
face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x
MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd
(0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6)
Measured-gap-driven conv kernels: small-spatial (fill the register tile when
output width <= tile width) + small-IC stem + strided-1x1/downsample recovery.
ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker
0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever
was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not
read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs
an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a)
GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context
cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn
-> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x),
CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free
keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit.
Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24)
Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN /
FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled).
On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN
parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity),
WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no
spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA
backend build must enable the flag AND bundle libcudnn - deferred until a
cuDNN-bundled GPU image; flag stays OFF here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends
The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN
implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN
(default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity
(SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10
(arm64, CUDA 13, sm_121a).
Enable it for the CUDA build, but only where cuDNN actually ships: the
arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN,
so flipping it on globally for BUILD_TYPE=cublas would be a link failure.
The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the
matrix/Docker build, uname -m fallback for local builds).
backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13
in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so
the build-time link resolves.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd)
Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the
blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path
(CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml
mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t,
2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs
golden), so registered voices + verify/identify thresholds are unaffected. The
prior default-OFF rested on a stale comment whose 23pct regression only held on
the non-shipping GGML_NATIVE=ON build.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(readme): announce native voice-detect + face-detect backends in Latest News
Add a Latest News entry for the new from-scratch C++/ggml biometric backends
(voice-detect.cpp + face-detect.cpp) that replace the Python insightface and
speaker-recognition backends: no Python/onnxruntime at inference, self-contained
GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp /
locate-anything.cpp native-backend news entries. Refs PR #10441.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): re-pin to the squashed engine release commits
The voice-detect.cpp and face-detect.cpp histories were squashed to a single
release commit, which orphaned the previous pins (voice 30beecd, face 6107a24).
Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is
identical, so the backend build is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): whisper darwin run.sh loads whichever fallback lib exists
The macOS branch hardcoded WHISPER_LIBRARY=$CURDIR/libgowhisper-fallback.dylib,
but the cmake build emits a Mach-O named libgowhisper-fallback.so on darwin, so
the Go loader panicked at runtime ("dlopen ...dylib: no such file") and the
backend exited ("grpc service not ready") — breaking e.g. the silero-vad-ggml
VAD on darwin. Pick whichever of .dylib/.so is present so it is robust to the
build's naming either way.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): darwin packaging for silero-vad
silero-vad was the last Go backend with Linux-only darwin packaging:
- package.sh fell through to "Could not detect architecture" -> exit 1 on
macOS (no Darwin branch), so its darwin image never packaged.
- run.sh exported LD_LIBRARY_PATH, which macOS dyld ignores, so the bundled
libonnxruntime.dylib couldn't be found at runtime.
Add a Darwin branch to package.sh (skip the glibc/ld.so bundling; add an
@loader_path/lib rpath so @rpath resolves to package/lib/) and a
DYLD_LIBRARY_PATH branch to run.sh — mirroring the piper darwin fix (#10525).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The metal-darwin-arm64-piper backend crashed at launch on macOS:
DYLD "Library missing"
Library not loaded: @rpath/libucd.dylib
Referenced from: .../piper
Reason: no LC_RPATH's found
The piper binary links libucd, libespeak-ng, libpiper_phonemize and
libonnxruntime via @rpath, but ships with no LC_RPATH, so dyld cannot
expand @rpath and aborts before piper runs. The libraries themselves are
already bundled in package/lib/ by package.sh.
Additionally, package.sh's architecture detection only handled the Linux
glibc loaders (/lib64/ld-linux-x86-64.so.2, /lib/ld-linux-aarch64.so.1)
and otherwise hit `echo "Error: Could not detect architecture"; exit 1`,
so on macOS packaging failed outright.
Add a Darwin branch (before the Linux checks) that skips the glibc/ld.so
bundling macOS has no use for and instead runs
`install_name_tool -add_rpath @loader_path/lib` on the piper binary, so
@rpath resolves to the bundled package/lib/ directory.
Also mirror sherpa-onnx/opus in run.sh: export DYLD_LIBRARY_PATH on
Darwin (LD_LIBRARY_PATH is Linux-only) as a defensive fallback.
Validated by hand on Apple Silicon: with the rpath added, piper
synthesized a real WAV. The darwin build is validated in CI.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The opus Go backend (WebRTC audio codec) never built on macOS, so the
published master-metal-darwin-arm64-opus image shipped source only — no
opus binary and no libopusshim — because every step assumed Linux.
- Makefile: hardcoded libopusshim.so with no OS handling. Mirror
sherpa-onnx: SHIM_EXT=so / dylib on Darwin and build
libopusshim.$(SHIM_EXT). On Darwin link the shim with
-undefined dynamic_lookup so it resolves opus_encoder_ctl from the
already globally-loaded libopus (codec.go dlopens it RTLD_GLOBAL
first) instead of baking an absolute Homebrew path into the dylib,
keeping the packaged shim relocatable.
- run.sh: hardcoded LD_LIBRARY_PATH + libopusshim.so even on macOS. Add
a Darwin branch exporting DYLD_LIBRARY_PATH and the .dylib shim, like
sherpa-onnx/run.sh.
- package.sh: bundle libopusshim.$(SHIM_EXT) and libopus*.dylib (not
just .so) into package/lib so the OCI image (which ships package/.)
is self-contained on a runtime with no Homebrew; add a Darwin arch
branch so it doesn't warn/skip.
- backend_build_darwin.yml: install + link opus and pkg-config via brew
so the Makefile's `pkg-config opus` resolves on the macOS runner, and
cache opus' Cellar dir.
Go code is unchanged; darwin build is validated in CI.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces
The backend launcher scripts derive their own directory with
CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as
$CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:...,
exec $CURDIR/<binary> "$@"). When a backend is installed under a path that
contains a space - notably macOS's ~/Library/Application Support/... - bash
word-splits the unquoted $CURDIR, so the test builtin fails with
"binary operator expected" and exec tries to run ".../Library/Application",
yielding "No such file or directory". The backend never starts, surfacing as
a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the
realpath "$0") in every affected run.sh; no logic changes.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(backends): darwin build for the localvqe backend
LocalVQE (acoustic echo cancellation / noise suppression / dereverberation)
already builds on Darwin - its Makefile takes the OS=Darwin branch with
GGML_METAL=OFF (upstream is CPU + Vulkan only), producing a native arm64 CPU
image. It was just never wired into CI.
- .github/backend-matrix.yml: add localvqe to includeDarwin (build-type metal,
lang go) - the darwin/arm64 build profile; the backend itself stays CPU.
- backend/index.yaml: metal: capability + concrete metal-localvqe(-development)
entries pointing at the -metal-darwin-arm64-localvqe images.
- backend/go/localvqe/Makefile: note on the existing Darwin branch (also the
per-backend change the CI path filter needs to build it here).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
feat(backends): darwin/Metal builds for the vision C++/ggml backends
depth-anything-cpp, locate-anything-cpp, rfdetr-cpp and sam3-cpp already carry
a Darwin/Metal path in their Makefiles (GGML_METAL=ON when build-type=metal),
but were never wired into CI, so no Metal image was published and Apple Silicon
could not install them.
- .github/backend-matrix.yml: add the four to includeDarwin (build-type metal,
lang go), matching the other go+ggml *-cpp Metal entries.
- backend/index.yaml: add metal: to each backend's capabilities map (main and
-development) plus concrete metal-<backend>(-development) entries pointing at
the latest/master -metal-darwin-arm64-<backend> images.
- backend/go/*/Makefile: a one-line note on the existing Darwin branch (also
the per-backend change the CI path filter needs to actually build them here).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): darwin/metal support (libparakeet.dylib + DYLD path)
The parakeet-cpp backend had no macOS support and panicked at startup on
Apple/Metal nodes when purego.Dlopen could not find "libparakeet.so".
Fix it across the same four layers the sibling voxtral backend already
handles correctly:
- main.go: default the dlopen target to libparakeet.dylib on darwin
(runtime.GOOS), libparakeet.so elsewhere; PARAKEET_LIBRARY still wins.
- Makefile: also stage the built libparakeet.dylib next to the Go sources.
- package.sh: accept either the Linux .so[.X.Y] or the macOS .dylib when
bundling instead of hard-failing when no .so is present (the macOS case);
note that on Darwin only system frameworks are linked.
- run.sh: on Darwin set DYLD_LIBRARY_PATH and PARAKEET_LIBRARY to the
packaged .dylib; keep LD_LIBRARY_PATH + .so on Linux.
Mirrors backend/go/voxtral.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(backends): darwin/metal support across purego Go backends
The parakeet-cpp fix in the previous commit was an instance of a bug
shared by nearly every purego/dlopen Go backend: the dlopen target was
hardcoded to a .so name and run.sh exported only LD_LIBRARY_PATH, so the
backend panicked at startup on macOS/Apple-Metal nodes (dyld needs the
.dylib name and DYLD_LIBRARY_PATH). voxtral was the only backend handling
this correctly.
Apply the same four-layer fix (mirroring backend/go/voxtral) to the
remaining affected backends:
whisper, sherpa-onnx, ced, stablediffusion-ggml, vibevoice-cpp,
qwen3-tts-cpp, omnivoice-cpp, crispasr, acestep-cpp, locate-anything-cpp,
depth-anything-cpp, rfdetr-cpp, sam3-cpp, localvqe
Per backend:
- main.go (sherpa-onnx: backend.go, two libraries): default the dlopen
target to the .dylib on darwin (runtime.GOOS), .so elsewhere; the
existing <BACKEND>_LIBRARY env override still wins.
- run.sh: on Darwin set DYLD_LIBRARY_PATH and point <BACKEND>_LIBRARY at
the packaged .dylib; keep LD_LIBRARY_PATH + the Linux CPU-variant
(avx/avx2/avx512) selection unchanged in the else branch.
- package.sh: also bundle the .dylib and stop hard-failing when no .so is
present (the macOS case).
- Makefile: also stage the built .dylib.
Notes:
- stablediffusion-ggml and acestep-cpp build their lib as a CMake MODULE,
which emits .so (not .dylib) on macOS; run.sh prefers .dylib and falls
back to .so so both layouts work.
- sherpa-onnx was already partly darwin-aware (Makefile/package.sh); only
run.sh and the two dlopen defaults needed fixing.
Linux behavior is unchanged. Verified gofmt-clean and
`CGO_ENABLED=0 go build` for every backend.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The supertonic Go TTS backend dlopens ONNX Runtime, but its runtime and
packaging scripts were Linux-only: run.sh exported LD_LIBRARY_PATH, pointed
ONNXRUNTIME_LIB_PATH at libonnxruntime.so, and always tried the ld.so exec
path, while package.sh hard-failed on any non-Linux host. On macOS dyld has
no ld.so loader, uses DYLD_LIBRARY_PATH, and ONNX Runtime ships as a .dylib.
This applies the same purego .dylib/DYLD_LIBRARY_PATH fix that PR #10481
landed for 15 other ONNX/purego backends (sherpa-onnx, silero-vad, etc.) but
which omitted supertonic:
- run.sh: on darwin export DYLD_LIBRARY_PATH and point ONNXRUNTIME_LIB_PATH
at libonnxruntime.dylib; guard the ld.so exec path to Linux only.
- package.sh: recognize Darwin instead of erroring out; the bundled .dylib is
resolved via DYLD_LIBRARY_PATH, no glibc/ld.so to bundle.
- helper.go: platform-native default library extension (dylib on darwin) for
the last-resort dlopen fallback.
It also wires the darwin CI build and gallery entries, resolving the
inconsistency where backend/index.yaml advertised metal for supertonic but no
includeDarwin matrix entry built the image:
- .github/backend-matrix.yml: add the -metal-darwin-arm64-supertonic Go entry.
- backend/index.yaml: declare metal capabilities and add the concrete
metal-supertonic / metal-supertonic-development child entries.
The Makefile already detects Darwin/osx/arm64 and stages the per-OS ONNX
Runtime tarball, mirroring sherpa-onnx, so no Makefile change is required.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): sketch sound-classification backend (CED audio tagger)
Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry,
footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend.
SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist
in DESIGN.md):
- backend/backend.proto: new SoundDetection rpc + SoundClass messages
(run `make protogen-go` to regenerate pkg/grpc/proto).
- backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h),
goced.go (Ced gRPC backend: Load + SoundDetection), Makefile
(clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh,
package.sh, .gitignore.
- DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability
registration checklist), gallery/index + CI registration, and a scoping
note for the realtime/websocket live-recognition path (sliding-window
classify over the existing ws transport + voicegate; the ced C-API
per-PCM entry point is already window-friendly).
Backend code does not compile until protogen-go regenerates the pb types
and a libced.so is built (Makefile clones+builds it).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): REST /v1/audio/classification endpoint + capability registration
Wires the ced sound-event classification backend (AudioSet audio tagger)
end to end through the REST surface, mirroring the transcription path.
- Handler: core/http/endpoints/openai/sound_classification.go parses the
multipart audio upload, temp-files it, resolves the model config and
calls the SoundDetection RPC; returns {model, detections[]} JSON.
- Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection)
loads the model and normalizes the proto response into schema types.
- Schema: core/schema/sound_classification.go (SoundClassificationResult).
- gRPC layer: SoundDetection wired through the LocalAI wrapper (interface,
Backend client, Client, embed, server, base default) so the loader-typed
client exposes the RPC; proto regenerated via make protogen-go.
- Route: POST /v1/audio/classification (+ /audio/classification alias) with
the audio/multipart default-model middleware in routes/openai.go.
- Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_
CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap +
GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase
option; /api/instructions audio area updated; auth RouteFeatureRegistry +
FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI
usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter
+ i18n; docs page features/audio-classification.md + whats-new + crosslink.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): realtime sound-event detection over the websocket API
When a realtime pipeline configures a sound-classification model, each
VAD-committed utterance (the same window the transcription path produces)
is also run through the CED sound-event classifier and the scored AudioSet
tags are emitted as a new server event. No new backend rpc is needed: the
SoundDetection gRPC method already exists on this branch.
- config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty)
beside Transcription/VAD.
- realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the
ModelInterface; implement it on wrappedModel and transcriptOnlyModel by
calling backend.ModelSoundDetection with the session's sound-classification
model config (mirrors how Transcribe dispatches). Load the optional config
in newModel / newTranscriptionOnlyModel; nil config keeps it additive.
- types: add ConversationItemSoundDetectionEvent (item_id, content_index,
detections[]{label,score,index}) with type conversation.item.sound_detection,
its ServerEventType constant and MarshalJSON, mirroring the transcription
completed event.
- realtime: add emitSoundDetection (unary path: classify the committed window,
build the event, t.SendEvent) and wire it at the utterance-commit hook right
after emitTranscription; gated on session.SoundDetectionEnabled (resolved
from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0).
Its error is logged via xlog but never aborts the turn.
- test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections,
classifier error) plus a SoundDetection method on the fakeModel double.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): implement SoundDetection in nodes backend test doubles
The SoundDetection method added to the grpc backend interface left two
test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so
core/services/nodes failed to compile under `go vet`/`go test` (go build
missed it: the doubles live in _test.go). Add the method to both,
mirroring their existing Detect mock. Repairs CI for the nodes package.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): decouple realtime sound detection from VAD (sound-only sessions)
Sound-event detection must activate on sounds, not speech, so it no longer
runs through the voice VAD/transcription path. A sound-detection-only
pipeline (sound_detection set, no transcription/LLM) now:
- is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline
stage),
- builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS
loaded), and
- defaults the session to turn_detection none (no VAD) with no transcription
stage, so the client drives windowing via input_audio_buffer.commit
(option A: client-side sliding window). The per-PCM C-API already supports
arbitrary windows.
commitUtterance gains a sound-only branch: it emits the
conversation.item.sound_detection event (scored AudioSet tags) and stops -
no transcription, no LLM response. generateResponse is now guarded on a
transcription stage being present, so a sound-only turn never invokes the LLM.
Existing transcription/VAD sessions are unchanged (additive). Added a
commitUtterance sound-only Ginkgo spec asserting it emits the sound event and
neither transcribes nor generates a response. go vet + golangci-lint
(new-from-merge-base) clean; openai suite green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): register sound-classification backend in gallery + CI
Mechanical backend-image registration for the ced sound-event classifier,
mirroring the parakeet-cpp Go/purego backend everywhere it is wired up.
- .github/backend-matrix.yml: add the ced build matrix, field-for-field copies
of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64,
l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan
amd64/arm64, rocm hipblas, and the metal darwin entry), changing only
backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang.
- backend/index.yaml: add the &ced meta anchor (capabilities map per platform)
plus ced-development and the per-arch image entries, each uri/mirror
tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is
intentionally deferred pending the HuggingFace publish (TODO note inline).
- scripts/changed-backends.js: add an explicit item.backend === "ced" branch in
inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as
the parakeet-cpp branch (before the generic golang fallthrough).
- .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in
backend/go/ced/Makefile so the daily bot bumps the pin.
- swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so
the existing /v1/audio/classification annotations land in the generated spec.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): server-side windowing for realtime sound detection (option B)
Adds an optional server-driven sliding-window classifier so a sound-only
realtime client only has to stream audio (no input_audio_buffer.commit):
- Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs.
When both > 0 on a sound-only session, the server classifies the last
window of streamed audio every hop and emits a conversation.item.sound_
detection event; the input buffer is trimmed to one window so a long
stream stays bounded. When unset, the session stays client-driven
(option A). Runs independent of VAD (sound events are not speech).
- handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so
it is unit-testable) + writeWindowWAV, which declares the true
InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples
correctly. Goroutine is started after toggleVAD and torn down with the
session (close + wg.Wait).
- Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta
registry; the earlier realtime commit added pipeline.sound_detection
without a registry entry, failing TestAllFieldsHaveRegistryEntries. This
fixes that and covers the two new knobs.
Tests: classifySoundWindow emits an event + trims the buffer to one window,
no-ops on too-little audio; writeWindowWAV declares the given sample rate.
go build/vet + golangci-lint (new-from-merge-base) clean; config + openai
suites green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0)
The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0,
converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced +
known_usecases: sound_classification) and two gallery/index.yaml entries
(ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and
removes the now-resolved TODO from backend/index.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add tiny/mini/small GGUF model gallery entries
Publishes the rest of the CED family (same architecture, metadata-driven port
verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds
their f16 + q8_0 gallery entries:
ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB
ced-mini (9.6M) f16 19MB / q8_0 11MB
ced-small (22M) f16 42MB / q8_0 23MB
All sha256-pinned. ced-base remains the accuracy default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo
All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single
HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8
gallery model entries' urls + file uris accordingly. sha256 and filenames are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): bump CED_VERSION to the short-clip fix
Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip
shorter than target_length (~10.11s): time_pos_embed was added at its full
63-frame grid instead of being sliced to the clip's actual time grid, tripping
ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s
windows) and gated with a short-clip parity test upstream.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive
- README.md: add ced.cpp to the "native C/C++/GGML engines developed and
maintained by the LocalAI project" table.
- docs/content/features/backends.md: add a Sound Classification backend
category (sound-event classification / audio tagging) listing ced.cpp.
- .agents/adding-backends.md: add a "Documenting the backend" section and two
verification-checklist items requiring new backends to be documented in the
backends.md category list, and in-house native engines to be added to the
README maintained-engines table. This directive was missing.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): repin CED_VERSION to the v0.1.0 release commit
ced.cpp history was squashed into a single release commit (tagged v0.1.0), so
the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the
v0.1.0 release commit, so the backend builds against a commit that exists.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths
- sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of
the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler.
- goced.go: reading a NUL-terminated C string from a libced-owned buffer.
#nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since
the uintptr is a C-owned malloc'd buffer, not Go-GC memory.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The parakeet-specific word accessors can return stale initialisation
data (model name, binary blobs) for segments with no real speech.
Add isValidWord() to filter out words that have:
- empty or whitespace-only text
- U+FFFD replacement characters (from binary data scrubbing)
- negative timestamps
- zero duration (end <= start)
Also skip empty segments entirely when they have no recognisable
content (empty text AND no valid words), preventing spurious subtitle
entries like '00:45:33,592 --> 00:45:33,592 parakeet@rH\u000b\ufffdI'.
Applies to both AudioTranscription and AudioTranscriptionStream.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* feat(gallery): add Depth Anything V2 models + bump native version
Add Depth Anything V2 (DA2) support to the depth-anything backend. DA2 is
depth-only (no camera pose, no confidence) and ships both relative
(relative inverse depth) and metric (depth in metres) variants. The Go
backend is model-agnostic, so no backend code changes are required — only
a native version bump and new gallery entries.
- backend/go/depth-anything-cpp/Makefile: pin DEPTHANYTHING_VERSION to the
depth-anything.cpp commit that adds the DA2 engine + C-API routing
(e3dec57f13a52366bbc4f279ef44804915960a6b, kept alive by the upstream tag
da2-support so it survives a squash-merge).
- gallery/index.yaml: add 12 DA2 entries (4 base quants, small, large, plus
Hypersim indoor and VKITTI outdoor metric models in S/B/L). Metric models
carry the metric-depth tag; none carry camera-pose.
Assisted-by: Claude:claude-opus-4-8
* chore(depth-anything-cpp): pin to merged DA2 master commit
PR #1 (mudler/depth-anything.cpp) merged to master as f4e17de (squash); repoint
the pin from the pre-merge commit to the canonical master commit.
Assisted-by: Claude:claude-opus-4-8
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): add word-level timestamp support
Add word-level timestamp extraction to the crispasr backend by calling
the CrispASR C library's word accessor functions that are already
exported by libgocraspasr but were not previously bound by the Go
wrapper.
Two families of word functions are supported:
1. Session-based (get_word_count/text/t0/t1) — works per-segment for
whisper-like backends.
2. Parakeet-specific (get_parakeet_word_count/text/t0/t1) — returns a
global word list for TDT/CTC/RNNT parakeet models where the session
API does not expose per-segment word data.
The Go code tries session-based first and falls back to parakeet-specific
when the session word count is zero.
Depends on #10402 (grpc server Words forwarding) for the words to reach
the HTTP response.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(crispasr): use portable sed -i.bak for macOS compatibility
BSD sed requires -i '' for in-place editing while GNU sed uses -i.
Replace with -i.bak which works on both platforms, then remove the
backup file.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Vulkan backends bundled their own loader and ICD manifests but neither the
Mesa driver the manifests point at nor a way to make the loader find them,
so on a runtime base image without Mesa the loader enumerated zero devices
and the GPU silently fell back to CPU (only NVIDIA worked, since its ICD is
injected by the container toolkit).
- scripts/build/package-gpu-libs.sh: for each installed ICD manifest, bundle
the driver .so its library_path names — no hard-coded, platform-dependent
soname list — plus that driver's ldd dependencies, skipping manifests whose
driver isn't installed. Rewrite each library_path to a bare soname so the
bundled driver resolves via the LD_LIBRARY_PATH run.sh already sets.
- .docker/install-base-deps.sh, backend/Dockerfile.golang,
backend/Dockerfile.python: install mesa-vulkan-drivers in every Vulkan
builder so the driver + manifests exist to be packaged (the LunarG SDK
ships only the loader and shader tooling).
- pkg/model/process.go: when a backend ships vulkan/icd.d/, point the loader
at it via VK_DRIVER_FILES/VK_ICD_FILENAMES at launch (no-op otherwise).
Covered by pkg/model/process_vulkan_test.go.
- backend/go/parakeet-cpp/package.sh: complete the L0 stub (was missing the
libc-family ldd walk + GPU-lib packaging) by mirroring whisper, so the
vulkan-parakeet image actually bundles its GPU runtime.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Richard Palethorpe <io@richiejp.com>