mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 19:37:00 -04:00
* feat(voice-detect): add Go purego backend for voice-detect.cpp Add backend/go/voice-detect implementing the Backend gRPC voice subset (VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego, mirroring the parakeet-cpp / omnivoice-cpp backends. The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and float-vector returns are owned by Go and released through the matching capi free functions, with the per-ctx last error surfaced into Go errors. Calls are serialized via base.SingleThread since the C context is not reentrant. Proto field mapping: - VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model. - VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the verify_threshold option, default 0.25) -> verify_paths -> verified/distance/ threshold/confidence/model/processing_time_ms. - VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion document maps to a single VoiceAnalysis segment (start/end 0; gender "label" -> dominant_gender with the remaining float scores as the gender map; emotion label/scores -> dominant_emotion/emotion). The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(voice-detect): wire backend into index, gallery and build Register the voice-detect.cpp speaker-recognition + voice-analysis backend (added in Voice-INT-A) into LocalAI's distribution surfaces, mirroring the ced backend (the closest mudler C++/ggml audio analogue): - backend/index.yaml: add the &voicedetect meta-backend (capabilities platform map, no top-level uri) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the -development variants). Referential integrity audited - every alias target resolves. - gallery/index.yaml: add 5 model entries on backend voice-detect - ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the wav2vec2 age/gender/emotion analyze model. The engine architecture is read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are not yet published: each files: entry points at the intended mudler/voice-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring ced. - .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via VOICEDETECT_VERSION (pin 47546430, = 4754643). - core/config/backend_capabilities.go: register voice-detect in the backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze -> speaker_recognition), mirroring speaker-recognition. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): add purego Go backend for face-detect.cpp Add the LocalAI Go backend that dlopens libfacedetect.so (the flat facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect backend. Implements the Face subset of the Backend gRPC service: - Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path -> L2-normalized ArcFace embedding. - Detect(DetectOptions): src -> detect_path_json -> Detection boxes (class_name "face", [x1,y1,x2,y2] -> x/y/w/h). - FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof -> verify_paths; best-effort img areas via detect. - FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face age + gender ("M"/"F" normalized to "Man"/"Woman"). The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib with ggml + vendored libjpeg-turbo static (PIC), so the .so is ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_ symbols). Gated Ginkgo e2e mirrors voice-detect. Note for the gallery-wiring task: backend registration (index.yaml, gallery, core/config/backend_capabilities.go) is intentionally not touched here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(voice-detect): replace em dashes in net-new descriptions Project style forbids em/en dashes. Replace the three U+2014 chars introduced by the voice-detect gallery/index wiring with `-`/`:`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): wire backend into index, gallery and build Register the face-detect.cpp face detection / embedding / verification / analysis backend (added in Face-INT-A) into LocalAI's distribution surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml recognition analogue): - backend/index.yaml: add the &facedetect meta-backend (capabilities platform map, no top-level uri to avoid the meta-backend gotcha) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/ metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants), 22 entries. Referential integrity audited: every alias target resolves. - gallery/index.yaml: add 4 model entries on backend face-detect - face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL) and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the commercial-friendly alternative). The detector/embedder architecture is read from GGUF metadata (facedetect.arch) at load; only the real verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF artifacts are not yet published: each files: entry points at the intended mudler/face-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - core/config/backend_capabilities.go: register face-detect in the backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze -> face_recognition), mirroring insightface. - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring voice-detect. - .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via FACEDETECT_VERSION (pin 636a1963). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(recon): voice-detect metal build branch + face-detect gallery usecases Add the missing metal BUILD_TYPE branch to the voice-detect Makefile forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the darwin metal CI artifact is built with the Metal backend instead of CPU-only. Expand the 4 face-detect gallery models' known_usecases to [face_recognition, detection, embeddings] to match the backend capabilities map and the mirrored insightface-buffalo entries, so auto-selection for /v1/detect and /embeddings works. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(recon): document voice-detect and face-detect ggml backends Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(gallery): publish recon backend GGUF uris + sha256 Fill in the published HuggingFace GGUF uris and verified sha256 for the 9 recon gallery entries (voice-detect-* and face-detect-*), and remove the TODO publish markers. Correct the eres2net, campplus, and emotion-wav2vec2 uris to the actual published filenames. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now embedded and re-uploaded under the same filenames/uris) and note the FaceVerify anti_spoof request flag in each description. Add a new voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add face-detect-buffalo-sc and antelopev2 packs Add gallery entries for two newly-published insightface face packs on the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k R100, 512-d). Both are non-commercial research-only. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): honor LocalAI per-model threads in voice/face-detect backends LocalAI spawns one backend process per model and serves requests concurrently, so the engines' own min(hardware_concurrency, 8) default can oversubscribe cores. Forward the per-model Threads value from the gRPC LoadModel options into the engine via VOICEDETECT_THREADS / FACEDETECT_THREADS (read at backend construction) before the capi load. A non-positive Threads is treated as unset, leaving the engine default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to CPU-optimized engine commits voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads). Brings the CPU optimizations into the LocalAI backend builds. GGUF format and parity unchanged, so the published HF GGUFs remain valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-2 CPU-optimized engines voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context, wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp -> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-3 Winograd engines voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3 convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete) voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%); face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d) face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel (2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch (ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect ~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable single binary (function-multiversioned, no global -mavx512f). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae) voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32 conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t, parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker do not call conv1d_same so are provably unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0) face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512 winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker 1.90x @1t. Parity cosine=1.0 throughout; portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67) Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s, within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback, fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%) + ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression). Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef) WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs ~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t / +17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b) voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0); ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned. face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd (0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6) Measured-gap-driven conv kernels: small-spatial (fill the register tile when output width <= tile width) + small-IC stem + strided-1x1/downsample recovery. ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker 0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a) GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn -> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x), CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit. Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24) Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled). On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity), WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA backend build must enable the flag AND bundle libcudnn - deferred until a cuDNN-bundled GPU image; flag stays OFF here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN (default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity (SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10 (arm64, CUDA 13, sm_121a). Enable it for the CUDA build, but only where cuDNN actually ships: the arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN, so flipping it on globally for BUILD_TYPE=cublas would be a link failure. The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the matrix/Docker build, uname -m fallback for local builds). backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13 in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so the build-time link resolves. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd) Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path (CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t, 2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs golden), so registered voices + verify/identify thresholds are unaffected. The prior default-OFF rested on a stale comment whose 23pct regression only held on the non-shipping GGML_NATIVE=ON build. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(readme): announce native voice-detect + face-detect backends in Latest News Add a Latest News entry for the new from-scratch C++/ggml biometric backends (voice-detect.cpp + face-detect.cpp) that replace the Python insightface and speaker-recognition backends: no Python/onnxruntime at inference, self-contained GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp / locate-anything.cpp native-backend news entries. Refs PR #10441. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): re-pin to the squashed engine release commits The voice-detect.cpp and face-detect.cpp histories were squashed to a single release commit, which orphaned the previous pins (voice 30beecd, face 6107a24). Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is identical, so the backend build is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
326 lines
11 KiB
Markdown
326 lines
11 KiB
Markdown
+++
|
|
disableToc = false
|
|
title = "Voice Recognition"
|
|
weight = 15
|
|
url = "/features/voice-recognition/"
|
|
+++
|
|
|
|

|
|
|
|
LocalAI supports voice (speaker) recognition: speaker verification
|
|
(1:1), speaker identification (1:N) against a built-in vector store,
|
|
speaker embedding, and demographic analysis (age / gender / emotion
|
|
from voice).
|
|
|
|
The audio analog to [Face Recognition](/features/face-recognition/),
|
|
served over the same `/v1/voice/*` HTTP API by two backends:
|
|
|
|
- **`voice-detect` (recommended, default).** A standalone C++/ggml
|
|
engine ([voice-detect.cpp](https://github.com/mudler/voice-detect.cpp)):
|
|
no Python, no onnxruntime, no torch runtime. Each gallery entry is a
|
|
single self-describing GGUF. This is the recommended option for new
|
|
deployments.
|
|
- **`speaker-recognition` (Python).** The original SpeechBrain / ONNX
|
|
backend. Still supported; see [the Python backend](#speaker-recognition-python-backend)
|
|
below.
|
|
|
|
Both backends expose the identical wire format, so the API examples on
|
|
this page work with either - only the gallery entry name (the `model`
|
|
field) changes.
|
|
|
|
## voice-detect (ggml) backend
|
|
|
|
The `voice-detect` backend reads the embedding (or analysis)
|
|
architecture (`voicedetect.arch`) directly from the GGUF metadata, so
|
|
installing a gallery entry is all that is needed to select an engine. It
|
|
drives the VoiceEmbed / VoiceVerify / VoiceAnalyze gRPC rpcs behind the
|
|
`/v1/voice/{embed,verify,analyze,register,identify,forget}` endpoints.
|
|
|
|
### Gallery entries
|
|
|
|
| Gallery entry | Model | Embedding dim | License |
|
|
|---|---|---|---|
|
|
| `voice-detect-ecapa-tdnn` | SpeechBrain ECAPA-TDNN (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** |
|
|
| `voice-detect-wespeaker-resnet34` | WeSpeaker ResNet34 (VoxCeleb) | 256 | CC-BY-4.0 |
|
|
| `voice-detect-eres2net` | 3D-Speaker ERes2Net (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** |
|
|
| `voice-detect-campplus` | 3D-Speaker CAM++ (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** |
|
|
| `voice-detect-emotion-wav2vec2` | audEERING wav2vec2 (age / gender / emotion) | analyze head | **CC-BY-NC-SA-4.0 - non-commercial** |
|
|
|
|
The four speaker-recognition entries drive verify / embed / identify.
|
|
`voice-detect-emotion-wav2vec2` is the analysis head behind
|
|
`/v1/voice/analyze` (continuous age estimate plus gender and emotion
|
|
class scores) and is **non-commercial / research use only**.
|
|
|
|
### Quickstart
|
|
|
|
Install the default entry (recommended for copy-paste):
|
|
|
|
```bash
|
|
local-ai models install voice-detect-ecapa-tdnn
|
|
```
|
|
|
|
Verify that two audio clips were spoken by the same person:
|
|
|
|
```bash
|
|
curl -sX POST http://localhost:8080/v1/voice/verify \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "voice-detect-ecapa-tdnn",
|
|
"audio1": "https://example.com/alice_1.wav",
|
|
"audio2": "https://example.com/alice_2.wav"
|
|
}'
|
|
```
|
|
|
|
Analyze age / gender / emotion (install the analyze entry first):
|
|
|
|
```bash
|
|
local-ai models install voice-detect-emotion-wav2vec2
|
|
|
|
curl -sX POST http://localhost:8080/v1/voice/analyze \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "voice-detect-emotion-wav2vec2", "audio": "https://example.com/alice.wav"}'
|
|
```
|
|
|
|
The 1:N register / identify / forget workflow and the rest of the API
|
|
are identical to the [API reference](#api-reference) below - just pass a
|
|
`voice-detect-*` model name. The default verify threshold is ~0.25 for
|
|
the ECAPA-TDNN / ERes2Net / CAM++ recognizers and ~0.30 for WeSpeaker
|
|
ResNet34.
|
|
|
|
## speaker-recognition (Python) backend
|
|
|
|
The `speaker-recognition` backend follows the same two-engine pattern
|
|
under one image.
|
|
|
|
### Engines
|
|
|
|
| Gallery entry | Model | Size | License |
|
|
|---|---|---|---|
|
|
| `speechbrain-ecapa-tdnn` | ECAPA-TDNN on VoxCeleb (SpeechBrain) | ~17 MB | **Apache 2.0 — commercial-safe** |
|
|
| `wespeaker-resnet34` | WeSpeaker ResNet34 ONNX | ~26 MB | **Apache 2.0 — commercial-safe** |
|
|
|
|
Both entries are commercial-safe Apache-2.0. SpeechBrain is the
|
|
default — it's a lightweight pure-PyTorch checkpoint that auto-
|
|
downloads on first use. The `wespeaker-resnet34` entry wires the
|
|
direct-ONNX path for CPU-only deployments that don't want the torch
|
|
runtime.
|
|
|
|
## Quickstart
|
|
|
|
Install the default backend and model:
|
|
|
|
```bash
|
|
local-ai models install speechbrain-ecapa-tdnn
|
|
```
|
|
|
|
Verify that two audio clips were spoken by the same person:
|
|
|
|
```bash
|
|
curl -sX POST http://localhost:8080/v1/voice/verify \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "speechbrain-ecapa-tdnn",
|
|
"audio1": "https://example.com/alice_1.wav",
|
|
"audio2": "https://example.com/alice_2.wav"
|
|
}'
|
|
```
|
|
|
|
Response:
|
|
|
|
```json
|
|
{
|
|
"verified": true,
|
|
"distance": 0.18,
|
|
"threshold": 0.25,
|
|
"confidence": 28.0,
|
|
"model": "speechbrain-ecapa-tdnn",
|
|
"processing_time_ms": 340.0
|
|
}
|
|
```
|
|
|
|
## 1:N identification workflow (register → identify → forget)
|
|
|
|
Same flow as face recognition, same in-memory vector store under the
|
|
hood.
|
|
|
|
1. Register known speakers:
|
|
|
|
```bash
|
|
curl -sX POST http://localhost:8080/v1/voice/register \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "speechbrain-ecapa-tdnn",
|
|
"name": "Alice",
|
|
"audio": "https://example.com/alice.wav"
|
|
}'
|
|
# → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}
|
|
```
|
|
|
|
2. Identify an unknown probe:
|
|
|
|
```bash
|
|
curl -sX POST http://localhost:8080/v1/voice/identify \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "speechbrain-ecapa-tdnn",
|
|
"audio": "https://example.com/unknown.wav",
|
|
"top_k": 5
|
|
}'
|
|
# → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}
|
|
```
|
|
|
|
3. Remove a speaker by ID:
|
|
|
|
```bash
|
|
curl -sX POST http://localhost:8080/v1/voice/forget \
|
|
-d '{"id": "b2f..."}'
|
|
# → 204 No Content
|
|
```
|
|
|
|
{{% notice warning %}}
|
|
**Storage caveat.** The default vector store is in-memory. All
|
|
registered speakers are lost when LocalAI restarts. Persistent storage
|
|
(pgvector) is a tracked future enhancement shared with face
|
|
recognition — the voice-recognition HTTP API is designed to swap the
|
|
backing store without changing the wire format.
|
|
{{% /notice %}}
|
|
|
|
## API reference
|
|
|
|
### `POST /v1/voice/verify` (1:1)
|
|
|
|
| field | type | description |
|
|
|---|---|---|
|
|
| `model` | string | gallery entry name (e.g. `speechbrain-ecapa-tdnn`) |
|
|
| `audio1`, `audio2` | string | URL, base64, or data-URI of an audio file |
|
|
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 for ECAPA-TDNN |
|
|
| `anti_spoofing` | bool, optional | reserved — unused in the current release |
|
|
|
|
Returns `verified`, `distance`, `threshold`, `confidence`, `model`,
|
|
and `processing_time_ms`.
|
|
|
|
### `POST /v1/voice/analyze`
|
|
|
|
Returns demographic attributes (age, gender, emotion) inferred from
|
|
speech:
|
|
|
|
| field | type | description |
|
|
|---|---|---|
|
|
| `model` | string | gallery entry |
|
|
| `audio` | string | URL / base64 / data-URI |
|
|
| `actions` | string[] | subset of `["age","gender","emotion"]`; empty = all supported |
|
|
|
|
Emotion is inferred from the SUPERB emotion-recognition checkpoint
|
|
(`superb/wav2vec2-base-superb-er`, Apache 2.0) — 4-way categorical
|
|
neutral / happy / angry / sad. The model auto-downloads on the first
|
|
analyze call.
|
|
|
|
Age and gender are **opt-in**: no standard-transformers checkpoint
|
|
with a clean classifier head is shipped as the default. The
|
|
high-accuracy Audeering age/gender model uses a custom multi-task
|
|
head that `AutoModelForAudioClassification` doesn't load safely
|
|
(the age weights are silently dropped and the classifier is
|
|
re-initialised with random values). To enable age/gender, set
|
|
`age_gender_model:<repo>` in the model YAML's `options:` pointing at
|
|
a checkpoint with a vanilla `Wav2Vec2ForSequenceClassification`
|
|
head. Override the emotion default similarly via `emotion_model:`.
|
|
Set either to an empty string to disable that head.
|
|
|
|
If a head fails to load (offline, disk full, `transformers`
|
|
missing), the engine degrades gracefully: it still returns the
|
|
attributes it could compute. When nothing can be computed the backend
|
|
returns `501 Unimplemented`.
|
|
|
|
Analyze is supported by both `speechbrain-ecapa-tdnn` and
|
|
`wespeaker-resnet34` — the speaker recognizer and the analysis head
|
|
are independent.
|
|
|
|
### `POST /v1/voice/register` (1:N enrollment)
|
|
|
|
| field | type | description |
|
|
|---|---|---|
|
|
| `model` | string | voice recognition model |
|
|
| `audio` | string | speaker audio to enroll |
|
|
| `name` | string | human-readable label |
|
|
| `labels` | map[string]string, optional | arbitrary metadata |
|
|
| `store` | string, optional | vector store model; defaults to local-store |
|
|
|
|
Returns `{id, name, registered_at}`. The `id` is an opaque UUID used
|
|
by `/v1/voice/identify` and `/v1/voice/forget`.
|
|
|
|
### `POST /v1/voice/identify` (1:N recognition)
|
|
|
|
| field | type | description |
|
|
|---|---|---|
|
|
| `model` | string | voice recognition model |
|
|
| `audio` | string | probe audio |
|
|
| `top_k` | int, optional | max matches to return; default 5 |
|
|
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 |
|
|
| `store` | string, optional | vector store model |
|
|
|
|
Returns a list of matches sorted by ascending distance, each with
|
|
`id`, `name`, `labels`, `distance`, `confidence`, and `match`
|
|
(`distance ≤ threshold`).
|
|
|
|
### `POST /v1/voice/forget`
|
|
|
|
| field | type | description |
|
|
|---|---|---|
|
|
| `id` | string | ID returned by `/v1/voice/register` |
|
|
|
|
Returns `204 No Content` on success, `404 Not Found` if the ID is
|
|
unknown.
|
|
|
|
### `POST /v1/voice/embed`
|
|
|
|
Returns the L2-normalized speaker embedding vector.
|
|
|
|
| field | type | description |
|
|
|---|---|---|
|
|
| `model` | string | voice model |
|
|
| `audio` | string | URL / base64 / data-URI |
|
|
|
|
Returns `{embedding: float[], dim: int, model: string}`. Dimension
|
|
depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker
|
|
ResNet34.
|
|
|
|
> **Note:** the OpenAI-compatible `/v1/embeddings` endpoint is
|
|
> intentionally text-only — it does nothing useful with audio input.
|
|
> Use `/v1/voice/embed` for audio.
|
|
|
|
## Audio input
|
|
|
|
Audio is materialised by the HTTP layer to a temporary WAV file
|
|
before the gRPC call. All audio fields accept:
|
|
|
|
- `http://` / `https://` URLs (downloaded server-side, subject to
|
|
`ValidateExternalURL` safety checks).
|
|
- Raw base64 (no prefix).
|
|
- Data URIs (`data:audio/wav;base64,...`).
|
|
|
|
The backend itself always receives a filesystem path — the same
|
|
convention the Whisper / Voxtral transcription backends use.
|
|
|
|
## Threshold reference
|
|
|
|
| Recognizer | Cosine-distance threshold |
|
|
|---|---|
|
|
| ECAPA-TDNN (SpeechBrain, VoxCeleb) | ~0.25 |
|
|
| WeSpeaker ResNet34 | ~0.30 |
|
|
| 3D-Speaker ERes2Net | ~0.28 |
|
|
|
|
Pass `threshold` explicitly when switching recognizers — the per-model
|
|
default only applies when omitted.
|
|
|
|
## Related features
|
|
|
|
- [Face Recognition](/features/face-recognition/) — the image analog;
|
|
the two share a registry design.
|
|
- [Audio to Text](/features/audio-to-text/) — transcription (Whisper,
|
|
Voxtral, faster-whisper). Runs in addition to, not instead of,
|
|
voice recognition.
|
|
- [Stores](/features/stores/) — the generic vector store powering
|
|
both the face and voice 1:N recognition pipelines.
|
|
- [Embeddings](/features/embeddings/) — text-only OpenAI-compatible
|
|
embedding endpoint; for audio embeddings use `/v1/voice/embed`.
|