docs(recon): document voice-detect and face-detect ggml backends

Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-22 07:39:02 -04:00 · 2026-06-22 08:43:30 +00:00
parent 46d7d59a82
commit b843f498ca
3 changed files with 170 additions and 15 deletions
--- a/docs/content/features/face-recognition.md
+++ b/docs/content/features/face-recognition.md
@@ -7,16 +7,93 @@ url = "/features/face-recognition/"

 ![Face recognition: 1:N match against a vector store, with an anti-spoofing liveness gate that can veto a verification](/images/diagrams/face-recognition-flow.png)

-LocalAI supports face recognition through the `insightface` backend:
-face verification (1:1), face identification (1:N) against a built-in
-vector store, face embedding, face detection, demographic analysis
-(age / gender), and antispoofing / liveness detection.
+LocalAI supports face recognition: face verification (1:1), face
+identification (1:N) against a built-in vector store, face embedding,
+face detection, demographic analysis (age / gender), and antispoofing /
+liveness detection.

-The backend ships **two interchangeable engines** under one image, each
-paired with a distinct gallery entry so users can pick by license and
-accuracy needs.
+The same `/v1/face/*` HTTP API is served by two backends:

-## Licensing — read this first
+- **`face-detect` (recommended, default).** A standalone C++/ggml
+  engine ([face-detect.cpp](https://github.com/mudler/face-detect.cpp)):
+  no Python, no onnxruntime, no torch runtime. Each gallery entry is a
+  single self-describing GGUF. This is the recommended option for new
+  deployments.
+- **`insightface` (Python).** The original ONNX Runtime backend. Still
+  supported; see [the Python backend](#insightface-python-backend) below.
+
+Both backends expose the identical wire format, so the API examples in
+this page work with either - only the gallery entry name (the `model`
+field) changes.
+
+## face-detect (ggml) backend
+
+The `face-detect` backend reads the detector and recognizer architecture
+(`facedetect.arch`) directly from the GGUF metadata, so installing a
+gallery entry is all that is needed to select an engine. It drives the
+Embeddings / Detect / FaceVerify / FaceAnalyze gRPC rpcs behind the
+`/v1/face/{embed,verify,analyze,detect,register,identify,forget}`
+endpoints.
+
+### Licensing - read this first
+
+| Gallery entry | Detector + recognizer | Embedding dim | License |
+|---|---|---|---|
+| `face-detect-buffalo-l` | SCRFD-10GF + ArcFace R50 + GenderAge | 512 | **Non-commercial research only** (upstream insightface weights) |
+| `face-detect-buffalo-m` | SCRFD-2.5GF + ArcFace R50 + GenderAge | 512 | **Non-commercial research only** |
+| `face-detect-buffalo-s` | SCRFD-500MF + MBF + GenderAge | 512 | **Non-commercial research only** |
+| `face-detect-yunet-sface` | YuNet + SFace (OpenCV Zoo) | 128 | **Apache 2.0 - commercial-safe** |
+
+The insightface buffalo packs (buffalo_l / buffalo_m / buffalo_s) are
+released by the upstream maintainers for **non-commercial research use
+only**. Pick the `face-detect-yunet-sface` entry for production /
+commercial deployments.
+
+### Quickstart
+
+Install the commercial-safe entry (recommended for copy-paste):
+
+```bash
+local-ai models install face-detect-yunet-sface
+```
+
+Verify that two images depict the same person:
+
+```bash
+curl -sX POST http://localhost:8080/v1/face/verify \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "face-detect-yunet-sface",
+    "img1": "https://example.com/alice_1.jpg",
+    "img2": "https://example.com/alice_2.jpg"
+  }'
+```
+
+Detect faces and analyze demographics (buffalo entries populate
+age / gender; YuNet + SFace returns regions only):
+
+```bash
+curl -sX POST http://localhost:8080/v1/face/detect \
+  -H "Content-Type: application/json" \
+  -d '{"model": "face-detect-buffalo-l", "img": "https://example.com/group.jpg"}'
+
+curl -sX POST http://localhost:8080/v1/face/analyze \
+  -H "Content-Type: application/json" \
+  -d '{"model": "face-detect-buffalo-l", "img": "https://example.com/alice.jpg"}'
+```
+
+The 1:N register / identify / forget workflow and the rest of the API
+are identical to the [API reference](#api-reference) below - just pass a
+`face-detect-*` model name. The per-engine verify thresholds are ~0.35
+for the buffalo ArcFace/MBF recognizers and ~0.363 for SFace.
+
+## insightface (Python) backend
+
+The `insightface` backend ships **two interchangeable engines** under
+one image, each paired with a distinct gallery entry so users can pick
+by license and accuracy needs.
+
+### Licensing - read this first

 | Gallery entry | Detector + recognizer | Size | License |
 |---|---|---|---|
--- a/docs/content/features/voice-recognition.md
+++ b/docs/content/features/voice-recognition.md
@@ -7,16 +7,92 @@ url = "/features/voice-recognition/"

 ![Voice recognition: register, identify, and forget voiceprints in a vector store, for 1:1 verify or 1:N identify](/images/diagrams/voice-recognition-flow.png)

-LocalAI supports voice (speaker) recognition through the
-`speaker-recognition` backend: speaker verification (1:1), speaker
-identification (1:N) against a built-in vector store, speaker
-embedding, and demographic analysis (age / gender / emotion from
-voice).
+LocalAI supports voice (speaker) recognition: speaker verification
+(1:1), speaker identification (1:N) against a built-in vector store,
+speaker embedding, and demographic analysis (age / gender / emotion
+from voice).

 The audio analog to [Face Recognition](/features/face-recognition/),
-following the same two-engine pattern under one image.
+served over the same `/v1/voice/*` HTTP API by two backends:

-## Engines
+- **`voice-detect` (recommended, default).** A standalone C++/ggml
+  engine ([voice-detect.cpp](https://github.com/mudler/voice-detect.cpp)):
+  no Python, no onnxruntime, no torch runtime. Each gallery entry is a
+  single self-describing GGUF. This is the recommended option for new
+  deployments.
+- **`speaker-recognition` (Python).** The original SpeechBrain / ONNX
+  backend. Still supported; see [the Python backend](#speaker-recognition-python-backend)
+  below.
+
+Both backends expose the identical wire format, so the API examples on
+this page work with either - only the gallery entry name (the `model`
+field) changes.
+
+## voice-detect (ggml) backend
+
+The `voice-detect` backend reads the embedding (or analysis)
+architecture (`voicedetect.arch`) directly from the GGUF metadata, so
+installing a gallery entry is all that is needed to select an engine. It
+drives the VoiceEmbed / VoiceVerify / VoiceAnalyze gRPC rpcs behind the
+`/v1/voice/{embed,verify,analyze,register,identify,forget}` endpoints.
+
+### Gallery entries
+
+| Gallery entry | Model | Embedding dim | License |
+|---|---|---|---|
+| `voice-detect-ecapa-tdnn` | SpeechBrain ECAPA-TDNN (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** |
+| `voice-detect-wespeaker-resnet34` | WeSpeaker ResNet34 (VoxCeleb) | 256 | CC-BY-4.0 |
+| `voice-detect-eres2net` | 3D-Speaker ERes2Net (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** |
+| `voice-detect-campplus` | 3D-Speaker CAM++ (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** |
+| `voice-detect-emotion-wav2vec2` | audEERING wav2vec2 (age / gender / emotion) | analyze head | **CC-BY-NC-SA-4.0 - non-commercial** |
+
+The four speaker-recognition entries drive verify / embed / identify.
+`voice-detect-emotion-wav2vec2` is the analysis head behind
+`/v1/voice/analyze` (continuous age estimate plus gender and emotion
+class scores) and is **non-commercial / research use only**.
+
+### Quickstart
+
+Install the default entry (recommended for copy-paste):
+
+```bash
+local-ai models install voice-detect-ecapa-tdnn
+```
+
+Verify that two audio clips were spoken by the same person:
+
+```bash
+curl -sX POST http://localhost:8080/v1/voice/verify \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "voice-detect-ecapa-tdnn",
+    "audio1": "https://example.com/alice_1.wav",
+    "audio2": "https://example.com/alice_2.wav"
+  }'
+```
+
+Analyze age / gender / emotion (install the analyze entry first):
+
+```bash
+local-ai models install voice-detect-emotion-wav2vec2
+
+curl -sX POST http://localhost:8080/v1/voice/analyze \
+  -H "Content-Type: application/json" \
+  -d '{"model": "voice-detect-emotion-wav2vec2", "audio": "https://example.com/alice.wav"}'
+```
+
+The 1:N register / identify / forget workflow and the rest of the API
+are identical to the [API reference](#api-reference) below - just pass a
+`voice-detect-*` model name. The default verify threshold is ~0.25 for
+the ECAPA-TDNN / ERes2Net / CAM++ recognizers and ~0.30 for WeSpeaker
+ResNet34.
+
+## speaker-recognition (Python) backend
+
+The `speaker-recognition` backend follows the same two-engine pattern
+under one image.
+
+### Engines

 | Gallery entry | Model | Size | License |
 |---|---|---|---|
--- a/docs/content/reference/compatibility-table.md
+++ b/docs/content/reference/compatibility-table.md
@@ -97,6 +97,8 @@ All backends listed here can be installed on demand from the [Backend Gallery]({
 | [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) in C/C++ using GGML | CPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T |
 | [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose in C/C++ using GGML | CPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T |
 | [sam3.cpp](https://github.com/PABannier/sam3.cpp) | Segment Anything (SAM 3/2/EdgeTAM) with text/point/box prompts in C/C++ using GGML | CPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T |
+| [face-detect.cpp](https://github.com/mudler/face-detect.cpp) | Native face detection, recognition, embedding, demographics and anti-spoofing (SCRFD/ArcFace, YuNet/SFace) in C/C++ using GGML | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
+| [voice-detect.cpp](https://github.com/mudler/voice-detect.cpp) | Native speaker (voice) recognition and voice analysis (ECAPA-TDNN, WeSpeaker, ERes2Net, CAM++, wav2vec2) in C/C++ using GGML | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
 | [insightface](https://github.com/deepinsight/insightface) | Face verification, embedding, and anti-spoofing liveness (ONNX Runtime) | CPU, CUDA 12 |
 | [speaker-recognition](https://speechbrain.github.io/) | Speaker (voice) recognition via SpeechBrain ECAPA-TDNN | CPU, CUDA 12, Metal |